In this article by Ruchir Choudhry, the author of the book HBase High Performance Cookbook, we will cover the configuration and deployment of HBase.
(For more resources related to this topic, see here.)
Introduction
HBase is an open source, nonrelational, column-oriented distributed database modeled after Google’s Cloud BigTable and written in Java. It is developed as part of Apache Software Foundation’s Apache Hadoop project, and it runs on top of Hadoop Distributed File System (HDFS), providing BigTable-like capabilities for Hadoop. It’s a column-oriented database, which is empowered by a fault-tolerant distributed file structure knows as HDFS. In addition to this, it also provides advanced features, such as auto sharding, load balancing, in-memory caching, replication, compression, near real-time lookups, strong consistency (using multiversions), block caches, and bloom filters for real-time queries and an array of client APIs.
Throughout the chapter, we will discuss how to effectively set up mid and large size HBase clusters on top of the Hadoop and HDFS framework. This article will help you set up an HBase on a fully distributed cluster.
For the cluster setup, we will consider redhat-6.2 Linux 2.6.32-220.el6.x86_64 #1 SMP Wed Nov 9 08:03:13 EST 2011 x86_64 x86_64 GNU/Linux, which will have six nodes.
Configuration and Deployment
Before we start HBase in a fully distributed mode, we will first be setting up Hadoop-2.4.0 in a distributed mode, and then, on top of a Hadoop cluster, we will set up HBase because it stores data in Hadoop Distributed File System (HDFS).
Check the permissions of the users; HBase must have the ability to create a directory.
Let’s create two directories in which the data for NameNode and DataNode will reside:
drwxrwxr-x 2 app app 4096 Jun 19 22:22 NameNodeData
drwxrwxr-x 2 app app 4096 Jun 19 22:22 DataNodeData
-bash-4.1$ pwd
/u/HbaseB/hadoop-2.4.0
-bash-4.1$ ls -lh
total 60K
drwxr-xr-x 2 app app 4.0K Mar 31 08:49 bin
drwxrwxr-x 2 app app 4.0K Jun 19 22:22 DataNodeData
drwxr-xr-x 3 app app 4.0K Mar 31 08:49 etc
Getting Ready
Following are the steps to install and configure HBase:
- The first step to start is to choose a Hadoop cluster.
- Then, get the hardware details required for it.
- Get the software required to perform the setup.
- Get the OS required to do the setup.
- Perform the configuration steps.
We will require the following components for NameNode:
Components |
Details |
Type of systems |
An operating system |
redhat-6.2 Linux 2.6.32-220.el6.x86_64 #1 SMP Wed Nov 9 08:03:13 EST 2011 x86_64 x86_64 GNU/Linux, or other standard linux kernel. |
|
Hardware/CPUS |
16 to 24 CPUS cores. |
NameNode /Secondry NameNode. |
Hardware/RAM |
64 to 128 GB. In special cases, 128 GB to 512 GB RAM. |
NameNode/Secondry NameNodes. |
Hardware/storage |
Both NameNode servers should have highly reliable storage for their namespace storage and edit log journaling. Typically, hardware RAID and/or reliable network storage are justifiable options. Note that the previous commands including an onsite disk replacement option in your support contract so that a failed RAID disk can be replaced quickly. |
NameNode/Secondry Namenodes. |
RAID: Raid is nothing but a Random Access Inexpensive Drive or Independent Disk; there are many levels of RAID drives, but for Master or NameNode, RAID-1 will be enough.
JBOD: This stands for Just a Bunch of Disk. The design is to have multiple hard drives stacked over each other with no redundancy. The calling software needs to take care of the failure and redundancy. In essence, it works as a single logical volume.
The following screenshot shows the working mechanism of RAID and JBOD:
Before we start for the cluster setup, a quick recap of the Hadoop setup is essential, with brief descriptions.
How to do it…
Let’s create a directory where you will have all the software components to be downloaded:
- For simplicity, let’s take this as /u/HbaseB.
- Create different users for different purposes. The format will be user/group; this is essentially required to differentiate various roles for specific purposes:
- HDFS/Hadoop: This is for the handling of Hadoop-related setups
- Yarn/Hadoop: This is for Yarn-related setups
- HBase/Hadoop
- Pig/Hadoop
- Hive/Hadoop
- Zookeeper/Hadoop
- HCat/Hadoop
- Set up directories for the Hadoop cluster: let’s assume /u as a shared mount point; we can create specific directories, which will be used for specific purposes:
-bash-4.1$ ls -ltr total 32 drwxr-xr-x 9 app app 4096 Oct 7 2013 hadoop-2.2.0 drwxr-xr-x 10 app app 4096 Feb 20 10:58 zookeeper-3.4.6 drwxr-xr-x 15 app app 4096 Apr 5 08:44 pig-0.12.1 drwxrwxr-x 7 app app 4096 Jun 30 00:57 hbase-0.98.3-hadoop2 drwxrwxr-x 8 app app 4096 Jun 30 00:59 apache-hive-0.13.1-bin
drwxrwxr-x 7 app app 4096 Jun 30 01:04 mahout-distribution-0.9Make sure that you have adequate privileges in the folder to add, edit, and execute a command. Also, you must set up password-less communication between different machines, such as from the name node to DataNode and from HBase Master to all the region server nodes. Refer to this webpage to learn how to do this: http://www.debian-administration.org/article/152/Password-less_logins_with_OpenSSH.
- Here, we will list the procedure to achieve the end result of the recipe. This section will follow a numbered bullet form. We do not need to explain the reason we are following a procedure. Numbered single sentences will do fine.
- Let’s assume there is a /u directory and you have downloaded the entire stack of software from /u/HbaseB/hadoop-2.2.0/etc/hadoop/; look for the core-site.xml file. Place the following lines in this file:
configuration> <property> <name>fs.default.name</name> <value>hdfs://mynamenode-hadoop:9001</value> <description>The name of the default file system. </description> </property> </configuration>
You can specify a port that you want to use; it should not clash with the ports that are already in use by the system for various purposes. A quick look at this link can provide more specific details about this; complete detail on this topic is out of the scope of this book. You can refer to http://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers.
- Save the file. This helps us create a master/NameNode directory.
- Now let’s move on to set up secondary nodes. Edit /u/HbaseB/hadoop-2.4.0/etc/hadoop/ and look for the core-site.xml file:
<configuration> <property> <name>fs.checkpoint.dir</name> <value>/u/dn001/hadoop/hdf/secdn /u/dn002/hadoop/hdfs/secdn </value> <description>A comma separated list of paths. Use the list of directories from $FS_CHECKPOINT_DIR. example, /u/dn001/hadoop/hdf/secdn,/u/dn002/hadoop/hdfs/secd n </description> </property> </configuration>
The separation of the directory structure is for the purpose of the clean separation of the hdfs block separation and to keep the configurations as simple as possible. This also allows us to do proper maintenance.
- Now let’s move toward changing the setup for hdfs; the file location will be /u/HbaseB/hadoop-2.4.0/etc/hadoop/hdfs-site.xmlfor NameNode:
<property> <name>dfs.name.dir</name> <value> /u/nn01/hadoop/hdfs/nn/u/nn02/hadoop/hdfs/nn </value> <description> Comma separated list of path, Use the list of directories </description> </property> for DataNode: <property> <name>dfs.data.dir</name> <value>/u/dnn01/hadoop/hdfs/dn,/u/dnn02/hadoop/hdfs/dn </value> <description>Comma separated list of path, Use the list of directories </description> </property>
- Now let’s go for NameNode for the HTTP address or to NameNode using the HTTP protocol:
<property> <name>dfs.http.address</name> <value>namenode.full.hostname:50070</value> <description>Enter your NameNode hostname for http access. </description> </property>
- The HTTP address for the secondary NameNode is as follows:
<property> <name>dfs.secondary.http.address</name> <value> secondary.namenode.full.hostname:50090 </value> <description> Enter your Secondary NameNode hostname. </description> </property>
We can go for an HTTPS setup for NameNode as well, but let’s keep this optional for now:
- Now let’s look for the Yarn setup in the /u/HbaseB/ hadoop-2.2.0/etc/hadoop/ yarn-site.xml file:
- For the resource tracker that’s a part of the Yarn resource manager, execute the following code:
<property> <name>yarn.resourcemanager.resourcetracker.address</name> <value>yarnresourcemanager.full.hostname:8025</value> <description>Enter your yarn Resource Manager hostname.</description> </property>
- For the resource schedule that’s part of the Yarn resource scheduler, execute the following code:
<property> <name>yarn.resourcemanager.scheduler.address</name> <value>resourcemanager.full.hostname:8030</value> <description>Enter your ResourceManager hostname</description> </property>
- For scheduler address, execute the following code:
<property> <name>yarn.resourcemanager.address</name> <value>resourcemanager.full.hostname:8050</value> <description>Enter your ResourceManager hostname.</description> </property>
- For scheduler admin address, execute the following code:
<property> <name>yarn.resourcemanager.admin.address</name> <value>resourcemanager.full.hostname:8041</value> <description>Enter your ResourceManager hostname.</description> </property>
- To set up the local directory, execute the following code:
<property> <name>yarn.nodemanager.local-dirs</name> <value>/u/dnn01/hadoop/hdfs /yarn,/u/dnn02/hadoop/hdfs/yarn </value> <description>Comma separated list of paths. Use the list of directories from,.</description> </property>
- To set up the log location, execute the following code:
<property> <name>yarn.nodemanager.logdirs</name> <value>/u/var/log/hadoop/yarn</value> <description>Use the list of directories from $YARN_LOG_DIR. <description> </property>
This completes the configuration changes required for Yarn
- For the resource tracker that’s a part of the Yarn resource manager, execute the following code:
- Now let’s make the changes for MapReduce. Open /u/HbaseB/ hadoop-2.2.0/etc/hadoop/mapred-site.xml.
- Now let’s place this configuration setup in mapred-site.xml and place this between <configuration></configuration>:
<property> <name>mapreduce.jobhistory.address</name> <value>jobhistoryserver.full.hostname:10020</value> <description>Enter your JobHistoryServer hostname.</description> </property>
- Once we have configured MapReduce, we can move on to configuring HBase. Let’s go to the /u/HbaseB/hbase-0.98.3-hadoop2/conf path and open the hbase-site.xml file.
- You will see a template that has <configuration></configurations>.
- We need to add the following lines between the starting and ending tags:
<property> <name>hbase.rootdir</name> <value>hdfs://hbase.namenode.full.hostname:8020/apps/hbase/data</value> <description> Enter the HBase NameNode server hostname</description> </property> <property> <!—this id for binding address --> <name>hbase.master.info.bindAddress</name> <value>$hbase.master.full.hostname</value> <description>Enter the HBase Master server hostname</description> </property>
This competes the HBase changes.
- ZooKeeper: Now let’s focus on the setup of ZooKeeper. In distributed a environment, let’s go to /u/HbaseB/zookeeper-3.4.6/conf locations, rename zoo_sample.cfg to zoo.cfg, and place the details as follows:
yourzooKeeperserver.1=zoo1:2888:3888 yourZooKeeperserver.2=zoo2:2888:3888
If you want to test this setup locally, use different port combinations.
Atomic broadcasting is an atomic messaging system that keeps all the servers in sync and provides reliable delivery, total orders, casual orders, and so on.
- Region servers: Before concluding, let’s go to the region server setup process. Go to the /u/HbaseB/hbase-0.98.3-hadoop2/conf folder and edit the regionserver file. Specify the region servers accordingly:
RegionServer1 RegionServer2 RegionServer3 RegionServer4
- Copy all the configuration files of Hbase and ZooKeeper to the relative host dedicated for Hbase and ZooKeeper.
- Let’s quickly validate the setup that we worked on:
Sudo su $HDFS_USER /u/HbaseB/hadoop-2.2.0/bin/hadoop namenode -format /u/HbaseB/hadoop-2.4.0/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR start namenode
- Now let’s go to the secondary nodes:
Sudo su $HDFS_USER /u/HbaseB/hadoop-2.2.0/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR start secondarynamenode
- Now let’s perform all the steps for DataNode:
Sudo su $HDFS_USER /u/HbaseB/hadoop-2.2.0/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR start datanode Test 01> See if you can reach from your browser http://namenode.full.hostname:50070 Test 02> sudo su $HDFS_USER /u/HbaseB/hadoop-2.2.0/sbin/hadoop dfs -copyFromLocal /tmp/hello.txt /u/HbaseB/hadoop-2.2.0/sbin/hadoop dfs –ls you must see hello.txt once the command executes. Test 03> Browse http://datanode.full.hostname:50075/browseDirectory.jsp?namenodeInfoPort=50070&dir=/&nnaddr=$datanode.full.hostname:8020 you should see the details on the datanode.
- Validate the Yarn and MapReduce setup by following these steps:
- Execute the command from Resource Manager:
<login as $YARN_USER and source the directories.sh companion script> /u/HbaseB/hadoop-2.2.0/sbin /yarn-daemon.sh --config $HADOOP_CONF_DIR start resourcemanager
- Execute the command from Node Manager
<login as $YARN_USER and source the directories.sh companion script> /usr/lib/hadoop-yarn/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start nodemanager
- Execute the following commands:
hadoop fs -mkdir /app-logs hadoop fs -chown $YARN_USER /app-logs hadoop fs -chmod 1777 /app-logs Execute MapReduce Sudo su $HDFS_USER /u/HbaseB/hadoop-2.2.0/sbin/hadoop fs -mkdir -p /mapred/history/done_intermediate /u/HbaseB/hadoop-2.2.0/sbin/hadoop fs -chmod -R 1777 /mapred/history/done_intermediate /u/HbaseB/hadoop-2.2.0/sbin/hadoop fs -mkdir -p /mapred/history/done /u/HbaseB/hadoop-2.2.0/sbin/hadoop fs -chmod -R 1777 /mapred/history/done /u/HbaseB/hadoop-2.2.0/sbin/hadoop fs -chown -R mapred /mapred export HADOOP_LIBEXEC_DIR=/u/HbaseB/hadoop-2.2.0/libexec/ export HADOOP_MAPRED_HOME=/=/u/HbaseB/hadoop-2.2.0/hadoop-mapreduce
export HADOOP_MAPRED_LOG_DIR==/u/HbaseB/hadoop-2.2.0//mapred - Start the jobhistory servers:
<login as $MAPRED_USER and source the directories.sh companion script> /u/HbaseB/hadoop-2.2.0/sbin/mr-jobhistory-daemon.sh start historyserver --config $HADOOP_CONF_DIR Test 01: from the browser or from curl use the link to browse. http://resourcemanager.full.hostname:8088/ Test 02: Sudo su $HDFS_USER /u/HbaseB/hadoop-2.2.0/bin/hadoop jar /u/HbaseB/hadoop-2.2.0/hadoop-mapreduce/hadoop-mapreduce-examples-2.0.2.1-alpha.jar teragen 100 /test/10gsort/input /u/HbaseB/hadoop-2.2.0/bin/hadoop jar /u/HbaseB/hadoop-2.2.0/hadoop-mapreduce/hadoop-mapreduce-examples-2.0.2.1-alpha.jar
- Execute the command from Resource Manager:
- Validate the HBase setup:
- Login as $HDFS_USER
/u/HbaseB/hadoop-2.2.0/bin/hadoop fs –mkdir /apps/hbase /u/HbaseB/hadoop-2.2.0/bin/hadoop fs –chown –R /apps/hbase
- Now login as $HBASE_USER
/u/HbaseB/hbase-0.98.3-hadoop2/bin/hbas-daemon.sh –-config $HBASE_CONF_DIR start master this will start the master node
- Now let’s move to HBase Region server nodes:
/u/HbaseB/hbase-0.98.3-hadoop2/bin/hbase-daemon.sh –config $HBASE_CONF_DIR start regionservers this will start the regionservers
For single machine direct sudo ./hbase master start can also be used. Please check the logs in case of any logs.
Now lets login using
Sudo su- $HBASE_USER ./hbase shell
will connect us to the hbase to the master.
- Login as $HDFS_USER
- Validate the ZooKeeper setup:
-bash-4.1$ sudo ./zkServer.sh start JMX enabled by default Using config: /u/HbaseB/zookeeper-3.4.6/bin/../conf/zoo.cfg Starting zookeeper ... STARTED You can also pipe the log to the ZooKeeper logs. /u/logs//u/HbaseB/zookeeper-3.4.6/zoo.out 2>&1
Summary
In this article, we learned how to configure and set up HBase. We set up HBase to store data in Hadoop Distributed File System.
We explored the working structure of RAID and JBOD and the differences between both filesystems.
Resources for Article:
Further resources on this subject:
- Understanding the HBase Ecosystem[article]
- The HBase’s Data Storage[article]
- HBase Administration, Performance Tuning[article]