(For more resources related to this topic, see here.)

Hadoop is supported by many cloud vendors as the popularity of Map/Reduce has grown over the past few years. Accumulo is another story; even though popularity is growing, the support of cloud vendors hasn't caught up.

Amazon EC2

Amazon has great support for Accumulo, Hadoop, and ZooKeeper. For Hadoop and ZooKeeper, there is a set of libraries called Apache Whirr. Apache Whirr supports Amazon EC2, Rackspace, and many more cloud providers. Apache Whirr uses low-level API libraries. For Accumulo, you have two options: one is to use the Amazon EMR command-line interface, and the other is to create a new virtual machine and then setup it.

Prerequisites for Amazon EC2

Prerequisites needed to complete the setup phase for Amazon EC2 are as follows:

Cygwin is required

Windows users need to download and install PuTTY from http://www.putty.org/ or use Cygwin SSH

A valid user is needed to access Amazon AWS Console

Install the Amazon EMR command-line interface by following the steps at this location, http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-cli-install.html

Creating Amazon EC2 Hadoop and ZooKeeper cluster

The following steps are required to create Amazon EC2 Hadoop and the ZooKeeper cluster:

The management console for Amazon Services has a nice graphical overview of all the actions that you can do. In our case, we use the Amazon AWS Console to verify what we have done while setting up the cluster.

From the drop-down menu under your name at the top-right corner, select Security Credentials .

Under Access Keys, you need to create a new root key and download the file containing AWSAccessKeyId and AWSSecretKey.

Normally, you would create an AWS Identity and Access Management (IAM) user with limited permissions, and give only that user the access to the cluster. But in this case, we are creating a demo cluster and will be destroying it after use.

Create a new key by running the following command:
- For Linux and Windows Cygwin:
  ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa_whirr
The rsa key is used later when configuring Whirr. It is not required to copy the key to the ~/.ssh/authorized_keys folder because the rsa key is going to be used from the current location.

Download Whirr and set it up using the following commands:
cd /usr/local sudo wget http://apache.claz.org/whirr/stable/whirr-0.8.2.tar.gz sudo tar xzf whirr-0.8.2.tar.gz sudo mv whirr-0.8.2 whirr sudo chown –R hadoopuser:hadoopgroup whirr

Download Whirr in the /usr/local folder, unpack it, and rename it to whirr. For Cygwin, don't run the last command in the script.

Set up the credentials for Amazon EC2:
- For Linux and Cygwin:
  sudo cp /usr/local/whirr/conf/credentials.sample/usr/local/whirr/credentials sudo nano /usr/local/whirr/conf/credentials
- Skip the sudo command in Cygwin. Elevated privileges in Windows are usually acquired by right-clicking on the icon and choosing Run as administrator.
- Edit the /usr/local/whirr/const/credentials file and change the following lines:
  PROVIDER=aws-ec2 IDENTITY=<The value from the variable AWSAccessKeyId> CREDENTIAL= <The value from the variable AWSSecretKey>
- By default, Whirr will look for the credentials file in the home directory; if it's not found there, it will look in /usr/local/whirr/conf. I prefer to use the /usr/local/whirr/conf directory to keep everything at the same place.

The first step in simplifying the creation of the cluster is to create a configuration file, which will be named cluster.properties for this example.
- For Linux:
  sudo nano /usr/local/whirr/conf/cluster.properties
- For Cygwin:
  nano /usr/local/whirr/conf/cluster.properties
  
  Add the following lines:
  
  whirr.cluster-name=demo-cluster whirr.instance-templates=1 zookeeper,1 hadoop-jobtracker+hadoop-namenode,
  1 hadoop-datanode+Hadoop-tasktracker whirr.provider=aws-ec2 whirr.private-key-file=${sys:user.home}/.ssh/id_rsa_whirr whirr.public-key-file=${sys:user.home}/.ssh/id_rsa_whirr.pub
  
  Unlock access to the largest independent learning library in Tech for FREE!
  
  Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
  
  Renews at $19.99/month. Cancel anytime
This file describes a single cluster with one ZooKeeper node, one Hadoop node running JobTracker and NameNode, and one Hadoop node running DataNode and JobTracker.

Create our cluster as described in the cluster.properties file:
- For Linux:
  su - hadoopuser
- For Windows Cygwin:
  cd /usr/local/whirr bin/whirr launch-cluster --config conf/cluster.properties
If you get the error message java.io.FileNotFoundException: whirr.log (Permission denied), then the current user has not got permission to access the whirr.log file.

After a few seconds, you will see that the script will start to print out the status message and information about what is going to be done, as shown in the following screenshot:

The result from creating a cluster using Whirr is very detailed and important for troubleshooting and monitoring purposes, as shown in the following screenshot:

The output from running the script gives very valuable information about the cluster created. Every instance has a role and an external and internal IP address. The ID of every node is in the form <region>/<unique id>.

After creating the cluster, please visit https://console.aws.amazon.com/ec2/home?region=us-east-1#s=Instancesto to see your new cluster. If the cluster was created in another region, change it to the correct region at the top.

Destroy our cluster as described in the cluster.properties file, by running the following command for Linux and Windows Cygwin:
cd /usr/local/whirr bin/whirr destroy-cluster --config conf/cluster.properties

The directory ~/.whirr/demo-cluster has been created as a direct result of the previous step, and contains information about the cluster just created and three files:
- hadoop-proxy.sh: Run this script to create a proxy tunnel to be able to connect to the cluster using the SSH tunnel. Use this example to create a proxy auto-config (PAC) file: https://svn.apache.org/repos/asf/whirr/trunk/resources/hadoop-ec2-proxy.pac.
- hadoop-site.xml: It contains information about the Hadoop cluster.
- instances: It contains information about each node instance (location, instance, role(s), external IP address, and internal IP address).

All nodes in the preceding example were created in the same security group that allows them to talk to each other.

Setting up Accumulo

The easiest way to set up Accumulo on Amazon is to use the Amazon CLI (command-line interface). There is a single ZooKeeper node up and running, that should be used while setting up Accumulo.

Browse to the Amazon EC2 console https://console.aws.amazon.com/s3/home?region=us-east-1#, and create a new bucket with a unique name. For this example, the name demo-accumulo will be used.

To create an instance of Accumulo, we use the following commands in Amazon CLI:
For Linux and Windows:

elastic-mapreduce --create --alive --name "Accumulo"--bootstrap-action s3://elasticmapreduce/samples/accumulo/accumulo-install.sh --args "<zookeeper ip address>, Demo-Database, DBPassword"
--bootstrap-name "install Accumulo" --enable-debugging –log-url s3://demo-accumulo/Accumulo-logs/ --instance-type m1.large --instance-count 4 --key- pair<Key Pair Name>

Locate the key pair name at https://console.aws.amazon.com/ec2/home?region=us-east-1#s=KeyPairs.