Hadoop/Hbase was designed to crunch a huge amount of data in a batch mode and provide
meaningful results to this data.
This article is an excerpt taken from the book ‘HBase High Performance Cookbook’ written by Ruchir Choudhry. This book provides a solid understanding of the HBase basics.
However, as the technology evolved over the years, the original architecture was fine tuned to move from the world of big-iron to the choice of cloud Infrastructure:
- It provides optimum pricing for the provisioning of new hardware, storage, and monitoring the infrastructure.
- One-click setup of additional nodes and storage.
- Elastic load-balancing to different clusters within the Hbase ecosystem.
- Ability to resize the cluster on-demand.
- Share capacity with different time-zones, for example, doing batch jobs in different data centers to and real-time analytics near to the customer.
- Easy integration with other Cloud-based services.
- HBase on Amazon EMR provides the ability to back up your HBase data directly to Amazon Simple Storage Service (Amazon S3). You can also restore from a previously created backup when launching an HBase cluster.
Configuring Hbase for the Cloud
Before we start, let’s take a quick look at the supported versions and the prerequisites you need to move ahead.
The list of supported versions is as below:
Hbase Version – 0.94.18 & 0.94
AMI Versions – 3.1.0 and later & 3.0-3.04
AWS CLI configuration parameters –
–ami-version 2.2 or later
Hbase Version details –
Now let’s look at the prerequisites:
- At least two instances (Optional): The cluster’s master node runs the HBase master server and Zookeeper, and slave nodes run the HBase region servers. For optimum performance and production systems, HBase clusters should run on at least two EC2 instances, but you can run HBase on a single node for evaluation Purposes.
- Long-running clusters: HBase only runs on long-running clusters. By default, the CLI and Amazon EMR console create long-running clusters.
- An Amazon EC2 key pair set (Recommended): To use the Secure Shell (SSH) network protocol to connect with the master node and run HBase shell commands, you must use an Amazon EC2 key pair when you create the cluster.
- The correct AMI and Hadoop versions: HBase clusters are currently supported only on Hadoop 20.205 or later.
- The AWS CLI: This is needed to interact with Hbase using the command-line options.
- Use of Ganglia tool: For monitoring, it’s advisable to use the Ganglia tool; this provides all information related to performance and can be installed as a client lib when we create the cluster.
- The logs for Hbase: They are available on the master node; it’s a standard practice in a production environment to copy these logs to the Amazon S3 cluster.
How to do it
- Open a browser and copy the following URL: (https://console.aws.amazon.com/elasticmapreduce/); if you don’t have an Amazon AWS account, then you have to create it.
- Then choose Create cluster as shown in the following:
- Provide the cluster name; you must select Launch mode as cluster
- Let’s proceed to the software configuration section. There are two options: Amazon template or MapR template. We are going to use Amazon template. It will load the default applications, which includes Hbase.
- Security is key when you are using ssh to the login to the cluster. Let’s create a security key, by selecting NETWORK & SECURITY on the left section of the panel (as shown in the following). We have created as Hbase03:
- Once you create this security key, it will ask for a download of a .pem file , which is known as hbase03.pem.
- Copy this file to the user location and change the access to:
This will ensure the write level of access is there on the file and is not accessible Two-way.
- Now select this pair from the drop-down box in the EC2 Key pair; this will allow you to register the instance to the key while provisioning the instance. You can do this later too, but I had some challenges in doing this so it is always better to provision the instance with the property
- Now, you are ready to provision the EMR cluster. Go ahead and provision the cluster. It will take around 10 to 20 mins to have a cluster fully accessible and in a running Condition.
- Verify it by observing the console:
How it works
When you select the cluster name, it maps to your associate account internally and keeps the mapping the cluster is alive (or net destroyed). When you select an installation using the setting it loads all the respective JAR files, which allows it to perform in a fully distributed environment. You can select the EMR or MapR stable release, which allows us to load the compatible library, and hence focus on the solution rather than troubleshooting integration issues within the Hadoop/Hbase farms. Internally, all the slaves connects to the master, and hence, we considered an extra-large VM.
Connecting to an Hbase cluster using the command line
How to do it
- You can alternatively SSH to the node and see the details as follows:
- Once you have connected to the cluster, you can perform all the tasks which you can perform on local clusters.
The preceding screenshot gives the details of the components we selected while installing the cluster.
- Let’s connect to the Hbase shell to make sure all the components are connecting internally and we are able to create a sample table.
How it works
The communication between your machine and the Hbase cluster works by passing a key every time a command is executed; this allows the communication to be private. The shell becomes the remote shell that connects to the Hbase master via a private connection. All the base shell commands such as put, create, and scan get all the known Hbase commands.
Backing up and restoring Hbase
Amazon Elastic MapReduce provides multiple ways to back up and restore Hbase data to S3 cloud. It also allows us to do an incremental backup; during the backup process Hbase continues to execute the write commands, helping us to keep working while the backup process continues. There is a risk of having an inconsistency in the data. If consistency is of prime importance, then the write needs to be stopped during the initial backup process, synchronized across nodes. This can be achieved by passing the–consistent parameter when requesting a backup. When you back up HBase data, you should specify a different backup directory for each Cluster. An easy way to do this is to use the cluster identifier as part of the path specified for the backup directory. For example, s3://mybucket/backups/j-3AEXXXXXX16F2. This ensures that any future incremental backups reference the correct HBase cluster.
How to do it
When you are ready to delete old backup files that are no longer needed, we recommend that you first do a full backup of your HBase data. This ensures that all data is preserved and provides a baseline for future incremental backups. Once the full backup is done, you can navigate to the backup location and manually delete the old backup files:
- While creating a cluster, add an additional step scheduling regular backups, as shown in the following.
- You have to specify the location of the backup to which a backup file with the data will be kept, based on the backup frequency selected. For highly valuable data, you can have a backup on an hourly basis. For less sensitive data, it can be planned daily:
- It’s a good practice to backup to a separate location in Amazon S3 to ensure that incremental backups are calculated correctly.
- It’s important to specify the exact time from when the backup will be started, the time zone specified is UTC for our cluster.
- We can proceed with creating the cluster as planned; it will create a backup of the data to the location specified.
- You have to provide the exact location of the backup file and restore it.
- The version that is backed up needs to be specified and saved, which will allow the data to be restored.
How it works
During the backup process, Hbase continues to execute write commands; this ensures the cluster remains available throughout the backup process. Internally, the operation is done in parallel, thus there is a chance of it being inconsistent. If the use case requires consistency, then we have to pause the write to Hbase. This can be achieved by passing the consistent parameter while requesting a backup. This internally queues the writes and executes them as soon as the synchronization complete.
We learned about configuration of Hbase for the cloud, connected Hbase cluster using the command line, and performed backup & restore of Hbase.
If you found this post useful, do check out the book ‘HBase High Perforamnce Cookbook’ to learn other concepts such as terminating an HBase Cluster, accessing HBase data with hive, viewing HBase log files, etc.