Configuring and deploying HBase [Tutorial]

20 min read

HBase is inspired by the Google big table architecture, and is fundamentally a non-relational, open source, and column-oriented distributed NoSQL. Written in Java, it is designed and developed by many engineers under the framework of Apache Software Foundation. Architecturally it sits on Apache Hadoop and runs by using Hadoop Distributed File System (HDFS) as its foundation.

It is a column-oriented database, empowered by a fault-tolerant distributed file structure known as HDFS. In addition to this, it also provides very advanced features, such as auto sharding, load-balancing, in-memory caching, replication, compression, near real-time lookups, strong consistency (using multi-version). It uses the latest concepts of block cache and bloom filter to provide faster response to online/real-time request. It supports multiple clients running on heterogeneous platforms by providing user-friendly APIs.

In this tutorial, we will discuss how to effectively set up mid and large size HBase cluster on top of Hadoop/HDFS framework.

We will also help you set up HBase on a fully distributed cluster. For cluster setup, we will consider REH (RedHat Enterprise-6.2 Linux 64 bit); for the setup we will be using six nodes.

This article is an excerpt taken from the book ‘HBase High Performance Cookbook’ written by Ruchir Choudhry. This book provides a solid understanding of the HBase basics.

Let’s get started!

Configuring and deploying Hbase

Before we start HBase in fully distributed mode, we will be setting up first Hadoop-2.2.0 in a distributed mode, and then on top of Hadoop cluster we will set up HBase because HBase stores data in HDFS.

Getting Ready

The first step will be to create a directory at user/u/HBase B and download the tar file from the location given later. The location can be local, mount points or in cloud environments; it can be block storage:

wget wget –b http://apache.mirrors.pair.com/hadoop/common/hadoop-2.2.0/hadoop-2.2.0.tar.gz

This –b option will download the tar file as a background process. The output will be piped to wget-log. You can tail this log file using tail -200f wget-log.

Untar it using the following commands:

tar -xzvf hadoop-2.2.0.tar.gz

This is used to untar the file in a folder hadoop-2.2.0 in your current diectory location.

Once the untar process is done, for clarity it’s recommended use two different folders one for NameNode and other for DataNode.

I am assuming app is a user and app is a group on a Linux platform which has access to read/write/execute access to the locations, if not please create a user app and group app if you have sudo su – or root/admin access, in case you don’t have please ask your administrator to create this user and group for you in all the nodes and directorates you will be accessing.

To keep the NameNodeData and the DataNodeData for clarity let’s create two folders by using the following command, inside /u/HBase B:

Mkdir NameNodeData DataNodeData

NameNodeData will have the data which is used by the name nodes and DataNodeData will have the data which will be used by the data nodes:

ls –ltr will show the below results.

drwxrwxr-x 2 app app  4096 Jun 19 22:22 NameNodeData

drwxrwxr-x 2 app app  4096 Jun 19 22:22 DataNodeData

-bash-4.1$ pwd

/u/HBase B/hadoop-2.2.0

-bash-4.1$ ls -ltr

total 60K

drwxr-xr-x 2 app app 4.0K Mar 31 08:49 bin

drwxrwxr-x 2 app app 4.0K Jun 19 22:22 DataNodeData

drwxr-xr-x 3 app app 4.0K Mar 31 08:49 etc

The steps in choosing Hadoop cluster are:

Hardware details required for it

Software required to do the setup

OS required to do the setup

Configuration steps

HDFS core architecture is based on master/slave, where an HDFS cluster comprises of solo NameNode, which is essentially used as a master node, and owns the accountability for that orchestrating, handling the file system, namespace, and controling access to files by client. It performs this task by storing all the modifications to the underlying file system and propagates these changes as logs, appends to the native file system files, and edits. SecondaryNameNode is designed to merge the fsimage and the edits log files regularly and controls the size of edit logs to an acceptable limit.

In a true cluster/distributed environment, it runs on a different machine. It works as a checkpoint in HDFS.

We will require the following for the NameNode:

Components

Details

Used for nodes/systems

Operating System

Redhat-6.2 Linux x86_64 GNU/Linux, or other standard linux kernel.

All the setup for Hadoop/HBase and other components used

Hardware /CPUS

16 to 32 CPU cores

NameNode/Secondary NameNode

2 quad-hex-/octo-core CPU

DataNodes

Hardware/RAM

128 to 256 GB, In special caes 128 GB to 512 GB RAM

NameNode/Secondary NameNodes

128 GB -512 GB of RAM

DataNodes

Hardware/storage

It’s pivotal to have NameNode server on robust and reliable storage platform as it responsible for many key activities like edit-log journaling. As the importance of these machines are very high and the NameNodes plays a central role in orchestrating everything,thus RAID or any robust storage device is acceptable.

NameNode/Secondary Namenodes

2 to 4 TB hard disk in a JBOD

DataNodes

RAID is nothing but a random access inexpensive drive or independent disk. There are many levels of RAID drives, but for master or a NameNode, RAID 1 will be enough.

JBOD stands for Just a bunch of Disk. The design is to have multiple hard drives stacked over each other with no redundancy. The calling software needs to take care of the failure and redundancy. In essence, it works as a single logical volume:

Before we start for the cluster setup, a quick recap of the Hadoop setup is essential with brief descriptions.

How to do it

Let’s create a directory where you will have all the software components to be downloaded:

For the simplicity, let’s take it as /u/HBase B.

Create different users for different purposes.

The format will be as follows user/group, this is essentially required to differentiate different roles for specific purposes:

Hdfs/hadoop is for handling Hadoop-related setup

Yarn/hadoop is for yarn related setup

HBase /hadoop

Pig/hadoop

Hive/hadoop

Zookeeper/hadoop

Hcat/hadoop

Set up directories for Hadoop cluster. Let’s assume /u as a shared mount point. We can create specific directories that will be used for specific purposes.

Please make sure that you have adequate privileges on the folder to add, edit, and execute commands. Also, you must set up password less communication between different machines like from name node to the data node and from HBase master to all the region server nodes.

Once the earlier-mentioned structure is created; we can download the tar files from the following locations:

-bash-4.1$ ls -ltr

total 32

drwxr-xr-x  9 app app 4096 hadoop-2.2.0

drwxr-xr-x 10 app app 4096 zookeeper-3.4.6

drwxr-xr-x 15 app app 4096 pig-0.12.1

drwxrwxr-x  7 app app 4096 HBase -0.98.3-hadoop2

drwxrwxr-x  8 app app 4096 apache-hive-0.13.1-bin

drwxrwxr-x  7 app app 4096 Jun 30 01:04 mahout-distribution-0.9

You can download these tar files from the following location:

wget –o https://archive.apache.org/dist/HBase /HBase -0.98.3/HBase -0.98.3-hadoop1-bin.tar.gz

wget -o https://www.apache.org/dist/zookeeper/zookeeper-3.4.6/zookeeper-3.4.6.tar.gz

wget –o https://archive.apache.org/dist/mahout/0.9/mahout-distribution-0.9.tar.gz

wget –o https://archive.apache.org/dist/hive/hive-0.13.1/apache-hive-0.13.1-bin.tar.gz

wget -o https://archive.apache.org/dist/pig/pig-0.12.1/pig-0.12.1.tar.gz

Here, we will list the procedure to achieve the end result of the recipe. This section will follow a numbered bullet form. We do not need to give the reason that we are following a procedure. Numbered single sentences would do fine.

Let’s assume that there is a /u directory and you have downloaded the entire stack of software from: /u/HBase B/hadoop-2.2.0/etc/hadoop/ and look for the file core-site.xml.

Place the following lines in this configuration file:

<configuration>

<property>

   <name>fs.default.name</name>

   <value>hdfs://addressofbsdnsofmynamenode-hadoop:9001</value>

</property>

</configuration>

You can specify a port that you want to use, and it should not clash with the ports that are already in use by the system for various purposes.

Save the file. This helps us create a master /NameNode.

Now, let’s move to set up SecondryNodes, let’s edit /u/HBase B/hadoop-2.2.0/etc/hadoop/ and look for the file core-site.xml:

<property>

 <name>fs.defaultFS</name>

 <value>hdfs://custome location of your hdfs</value>

</property>

<configuration>

<property>       

   <name>fs.checkpoint.dir</name>       

   <value>/u/HBase B/dn001/hadoop/hdf/secdn

       /u/HBase B/dn002/hadoop/hdfs/secdn

</value>   

</property>

</configuration>

The separation of the directory structure is for the purpose of a clean separation of the HDFS block separation and to keep the configurations as simple as possible. This also allows us to do a proper maintenance.

Now, let’s move towards changing the setup for hdfs; the file location will be /u/HBase B/hadoop-2.2.0/etc/hadoop/hdfs-site.xml.

Add these properties in hdfs-site.xml:

For NameNode:

<property>         

<name>dfs.name.dir</name>         

<value>

/u/HBase B/nn01/hadoop/hdfs/nn,/u/HBase B/nn02/hadoop/hdfs/nn

</value>     

</property>

For DataNode:

<property>         

<name>dfs.data.dir</name>         

<value>

/u/HBase B/dnn01/hadoop/hdfs/dn,/HBase B/u/dnn02/hadoop/hdfs/dn

</value>

</property>

Now, let’s go for NameNode for http address or to access using http protocol:

<property>

<name>dfs.http.address</name>

<value>yournamenode.full.hostname:50070</value>

</property>

<property>

<name>dfs.secondary.http.address</name>

<value>

secondary.yournamenode.full.hostname:50090

</value>     

</property>

We can go for the https setup for the NameNode too, but let’s keep it optional for now:

Let’s set up the yarn resource manager:

Let’s look for Yarn setup:

/u/HBase B/hadoop-2.2.0/etc/hadoop/ yarn-site.xml

For resource tracker a part of yarn resource manager:

<property>

 <name>yarn.yourresourcemanager.resourcetracker.address</name>

<value>youryarnresourcemanager.full.hostname:8025</value>

</property>

For resource schedule part of yarn resource scheduler:

<property>

<name>yarn.yourresourcemanager.scheduler.address</name>

<value>yourresourcemanager.full.hostname:8030</value>

</property>

For scheduler address:

<property>

<name>yarn.yourresourcemanager.address</name>

<value>yourresourcemanager.full.hostname:8050</value>

</property>

For scheduler admin address:

<property>

<name>yarn.yourresourcemanager.admin.address</name>

<value>yourresourcemanager.full.hostname:8041</value>

</property>

To set up a local dir:

<property>         <name>yarn.yournodemanager.local-dirs</name>         <value>/u/HBase /dnn01/hadoop/hdfs /yarn,/u/HBase B/dnn02/hadoop/hdfs/yarn </value>    </property>

To set up a log location:

<property>

<name>

yarn.yournodemanager.logdirs

</name>         

<value>/u/HBase B/var/log/hadoop/yarn</value>

</property>

This completes the configuration changes required for Yarn.

Now, let’s make the changes for Map reduce:

Let’s open the mapred-site.xml:

/u/HBase B/hadoop-2.2.0/etc/hadoop/mapred-site.xml

Now, let’s place this property configuration setup in the mapred-site.xml and place it between the following:

<configuration >

</configurations >

<property><name>mapreduce.yourjobhistory.address</name>

<value>yourjobhistoryserver.full.hostname:10020</value>

</property>

Once we have configured Map reduce job history details, we can move on to configure HBase .

Let’s go to this path /u/HBase B/HBase -0.98.3-hadoop2/conf and open HBase -site.xml.

You will see a template having the following:

<configuration >

</configurations >

We need to add the following lines between the starting and ending tags:

<property>

<name>HBase .rootdir</name>

<value>hdfs://HBase .yournamenode.full.hostname:8020/apps/HBase /data

</value>

</property>

<property>

<name>HBase .yourmaster.info.bindAddress</name>

<value>$HBase .yourmaster.full.hostname</value>

</property>

This competes the HBase changes.

ZooKeeper: Now, let’s focus on the setup of ZooKeeper. In distributed env, let’s go to this location and rename the zoo_sample.cfg to zoo.cfg:

/u/HBase B/zookeeper-3.4.6/conf

Open zoo.cfg by vi zoo.cfg and place the details as follows; this will create two instances of zookeeper on different ports:

yourzooKeeperserver.1=zoo1:2888:3888

yourZooKeeperserver.2=zoo2:2888:3888

If you want to test this setup locally, please use different port combinations. In a production-like setup as mentioned earlier, yourzooKeeperserver.1=zoo1:2888:3888 is server.id=host:port:port:

yourzooKeeperserver.1= server.id

zoo1=host

2888=port

3888=port

Atomic broadcasting is an atomic messaging system that keeps all the servers in sync and provides reliable delivery, total order, casual order, and so on.

Region servers: Before concluding it, let’s go through the region server setup process.

Go to this folder /u/HBase B/HBase -0.98.3-hadoop2/conf and edit the regionserver file.

Specify the region servers accordingly:

RegionServer1

RegionServer2

RegionServer3

RegionServer4

RegionServer1 equal to the IP or fully qualified CNAME of 1 Region server.

You can have as many region servers (1. N=4 in our case), but its CNAME and mapping in the region server file need to be different.

Copy all the configuration files of HBase and ZooKeeper to the relative host dedicated for HBase and ZooKeeper. As the setup is in a fully distributed cluster mode, we will be using a different host for HBase and its components and a dedicated host for ZooKeeper.

Next, we validate the setup we’ve worked on by adding the following to the bashrc, this will make sure later we are able to configure the NameNode as expected:

It preferred to use it in your profile, essentially /etc/profile; this will make sure the shell which is used is only impacted.

Now let’s format NameNode:

Sudo su $HDFS_USER

/u/HBase B/hadoop-2.2.0/bin/hadoop namenode -format

HDFS is implemented on the existing local file system of your cluster. When you want to start the Hadoop setup first time you need to start with a clean slate and hence any existing data needs to be formatted and erased.

Before formatting we need to take care of the following.

Check whether there is a Hadoop cluster running and using the same HDFS; if it’s done accidentally all the data will be lost.

/u/HBase B/hadoop-2.2.0/sbin/hadoop-daemon.sh --config

$HADOOP_CONF_DIR start namenode

Now let’s go to the SecondryNodes:

Sudo su $HDFS_USER

/u/HBase B/hadoop-2.2.0/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR start secondarynamenode

Repeating the same procedure in DataNode:

Sudo su $HDFS_USER

/u/HBase B/hadoop-2.2.0/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR start datanode

Test 01>

See if you can reach from your browser http://namenode.full.hostname:50070:

Test 02> sudo su $HDFS_USER touch /tmp/hello.txt

Now, hello.txt file will be created in tmp location:

/u/HBase B/hadoop-2.2.0/bin/hadoop dfs  -mkdir -p /app

/u/HBase B/hadoop-2.2.0/bin/hadoop dfs  -mkdir -p /app/apphduser

This will create a specific directory for this application user in the HDFS FileSystem location(/app/apphduser)

/u/HBase B/hadoop-2.2.0/bin/hadoop dfs -copyFromLocal /tmp/hello.txt /app/apphduser

/u/HBase B/hadoop-2.2.0/bin/hadoop dfs –ls /app/apphduser

apphduser is a dirctory which is created in hdfs for a specific user.

So that the data is sepreated based on the users, in a true production env many users will be using it.

You can also use hdfs dfs –ls / commands if it shows hadoop command as depricated.

You must see hello.txt once the command executes:

Test 03> Browse http://datanode.full.hostname:50075/browseDirectory.jsp?namenodeInfoPort=50070&dir=/&nnaddr=$datanode.full.hostname:8020

It is important to change the data host name and other parameters accordingly.

You should see the details on the DataNode. Once you hit the preceding URL you will get the following screenshot:

On the command line it will be as follows:

Validate Yarn/MapReduce setup and execute this command from the resource manager:

<login as $YARN_USER> /u/HBase B/hadoop-2.2.0/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start resourcemanager

Execute the following command from NodeManager:

<login as $YARN_USER >

/u/HBase B/hadoop-2.2.0/sbin /yarn-daemon.sh --config

$HADOOP_CONF_DIR start nodemanager

Executing the following commands will create the directories in the hdfs and apply the respective access rights:

Cd u/HBase B/hadoop-2.2.0/bin

hadoop fs -mkdir /app-logs // creates the dir in HDFS

hadoop fs -chown $YARN_USER /app-logs //changes the ownership

hadoop fs -chmod 1777 /app-logs // explained in the note section

Execute MapReduce

Start jobhistory servers:

<login as $MAPRED_USER>

/u/HBase B/hadoop-2.2.0/sbin/mr-jobhistory-daemon.sh start historyserver --config $HADOOP_CONF_DIR

Let’s have a few tests to be sure we have configured properly:

Test 01: From the browser or from curl use the link to browse: http://yourresourcemanager.full.hostname:8088/.

Test 02:

Sudo su $HDFS_USER

/u/HBase B/hadoop-2.2.0/bin/hadoop jar /u/HBase B/hadoop-2.2.0/hadoop-mapreduce/hadoop-mapreduce-examples-2.0.2.1-alpha.jar teragen 100 /test/10gsort/input

/u/HBase B/hadoop-2.2.0/bin/hadoop jar /u/HBase B/hadoop-2.2.0/hadoop-mapreduce/hadoop-mapreduce-examples-2.0.2.1-alpha.jar

Validate the HBase setup:

Login as $HDFS_USER

/u/HBase B/hadoop-2.2.0/bin/hadoop fs –mkdir -p /apps/HBase

/u/HBase B/hadoop-2.2.0/bin/hadoop fs –chown app:app –R  /apps/HBase

Now login as $HBase _USER:

/u/HBase B/HBase -0.98.3-hadoop2/bin/HBase -daemon.sh –-config $HBase _CONF_DIR start master

This command will start the master node. Now let’s move to HBase Region server nodes:

/u/HBase B/HBase -0.98.3-hadoop2/bin/HBase -daemon.sh –-config $HBase _CONF_DIR start regionserver

This command will start the regionservers:

For a single machine, direct sudo ./HBase master start can also be used.

Please check the logs in case of any logs at this location /opt/HBase B/HBase -0.98.5-hadoop2/logs.

You can check the log files and check for any errors:

Now let’s login using:

Sudo su- $HBase _USER

/u/HBase B/HBase -0.98.3-hadoop2/bin/HBase shell

We will connect HBase to the master.

Validate the ZooKeeper setup. If you want to use an external zookeeper, make sure there is no internal HBase based zookeeper running while working with the external zookeeper or existing zookeeper and is not managed by HBase :

For this you have to edit /opt/HBase B/HBase -0.98.5-hadoop2/conf/ HBase -env.sh.

Change the following statement (HBase _MANAGES_ZK=false):

# Tell HBase whether it should manage its own instance of Zookeeper or not.

export HBase _MANAGES_ZK=true.

Once this is done we can add zoo.cfg to HBase ‘s CLASSPATH.

HBase looks into zoo.cfg as a default lookup for configurations

dataDir=/opt/HBase B/zookeeper-3.4.6/zooData

# this is the place where the zooData will be present

server.1=172.28.182.45:2888:3888

# IP and port for server 01

server.2=172.29.75.37:4888:5888

# IP and port for server 02

You can edit the log4j.properties file which is located at /opt/HBase B/zookeeper-3.4.6/conf and point the location where you want to keep the logs.

# Define some default values that can be overridden by system properties:

zookeeper.root.logger=INFO, CONSOLE

zookeeper.console.threshold=INFO

zookeeper.log.dir=.

zookeeper.log.file=zookeeper.log

zookeeper.log.threshold=DEBUG

zookeeper.tracelog.dir=. # you can specify the location here

zookeeper.tracelog.file=zookeeper_trace.log

Once this is done you start zookeeper with the following command:

-bash-4.1$ sudo /u/HBase B/zookeeper-3.4.6/bin/zkServer.sh start

Starting zookeeper ... STARTED

You can also pipe the log to the ZooKeeper logs:

/u/logs//u/HBase B/zookeeper-3.4.6/zoo.out 2>&1

2 : refers to the second file descriptor for the process, that is stderr.

> : means re-direct

&1:  means the target of the rediretion should be the same location as the first file descriptor i.e stdout

How it works

Sizing of the environment is very critical for the success of any project, and it’s a very complex task to optimize it to the needs.

We dissect it into two parts, master and slave setup. We can divide it in the following parts:

Master-NameNode

Master-Secondary NameNode

Master-Jobtracker

Master-Yarn Resource Manager

Master-HBase Master

Slave-DataNode

Slave-Map Reduce Tasktracker

Slave-Yarn Node Manager

Slave-HBase Region server

NameNode: The architecture of Hadoop provides us a capability to set up a fully fault tolerant/high availability Hadoop/HBase cluster. In doing so, it requires a master and slave setup. In a fully HA setup, nodes are configured in active passive way; one node is always active at any given point of time and the other node remains as passive.

Active node is the one interacting with the clients and works as a coordinator to the clients. The other standby node keeps itself synchronized with the active node and to keep the state intact and live, so that in case of failover it is ready to take the load without any downtime.

Now we have to make sure that when the passive node comes up in the event of a failure, the passive node is in perfect sync with the active node, which is currently taking the traffic. This is done by Journal Nodes(JNs), these Journal Nodes use daemon threads to keep the primary and sercodry in perfect sync.

Journal Node: By design, JournalNodes will only have single NameNode acting as a active/primary to be a writer at a time. In case of failure of the active/primary, the passive NameNode immediately takes the charge and transforms itself as active, this essentially means this newly active node starts writing to Journal Nodes. Thus it totally avoids the other NameNode to stay in active state, this also acknowledges that the newly active node work as a fail over node.

JobTracker: This is an integral part of Hadoop EcoSystem. It works as a service which farms MapReduce task to specific nodes in the cluster.

ResourceManager (RM): This responsibility is limited to scheduling, that is, only mediating available resources in the system between different needs for the application like registering new nodes, retiring dead nodes, it dose it by constantly monitoring the heartbeats based on the internal configuration. Due to this core design practice of explicit separation of responsibilities and clear orchestrations of modularity and with the inbuilt and robust scheduler API, This allows the resource manager to scale and support different design needs at one end, and on the other, it allows us to cater to different programming models.

HBase Master: The Master server is the main orchestrator for all the region servers in the HBase cluster . Usually, it’s placed on the ZooKeeper nodes. In a real cluster configuration, you will have 5 to 6 nodes of Zookeeper.

DataNode: It’s a real workhorse and does most of the heavy lifting; it runs the MapReduce Job and stores the chunks of HDFS data. The core objective of the data node was to be available on the commodity hardware and should be agnostic to the failures.

It keeps some data of HDFS, and the multiple copy of the same data is sprinkled around the cluster. This makes the DataNode architecture fully fault tolerant. This is the reason a data node can have JBOD01 rather rely on the expensive RAID02.

MapReduce: Jobs are run on these DataNodes in parallel as a subtask. These subtasks provides the consistent data across the cluster and stays consistent.

So we learned about the HBase basics and how to configure and set it up. We set up HBase to store data in Hadoop Distributed File System. We also explored the working structure of RAID and JBOD and the differences between both filesystems.

If you found this post useful, be sure to check out the book ‘HBase High Perforamnce Cookbook’ to learn more about configuring HBase in terms of administering and managing clusters as well as other concepts in HBase.