In this article is written by Mark Kerzner and Sujee Maniyam, the authors of HBase Design Patterns, we will talk about how to write high performance and scalable HBase applications.
In particular, will take a look at the following topics:
- The bulk loading of data into HBase
- Profiling HBase applications
- Tips to get good performance on writes
- Tips to get good performance on reads
(For more resources related to this topic, see here.)
Loading bulk data into HBase
When deploying HBase for the first time, we usually need to import a significant amount of data. This is called initial loading or bootstrapping. There are three methods that can be used to import data into HBase, given as follows:
- Using the Java API to insert data into HBase. This can be done in a single client, using single or multiple threads.
- Using MapReduce to insert data in parallel (this approach also uses the Java API), as shown in the following diagram:
- Using MapReduce to generate HBase store files in parallel in bulk and then import them into HBase directly. (This approach does not require the use of the API; it does not require code and is very efficient.)
On comparing the three methods speed wise, we have the following order:
Java client < MapReduce insert < HBase file import
The Java client and MapReduce use HBase APIs to insert data. MapReduce runs on multiple machines and can exploit parallelism. However, both of these methods go through the write path in HBase.
Importing HBase files directly, however, skips the usual write path. HBase files already have data in the correct format that HBase understands. That’s why importing them is much faster than using MapReduce and the Java client.
We covered the Java API earlier. Let’s start with how to insert data using MapReduce.
Importing data into HBase using MapReduce
MapReduce is the distributed processing engine of Hadoop. Usually, programs read/write data from HDFS. Luckily, HBase supports MapReduce. HBase can be the source and the sink for MapReduce programs. A source means MapReduce programs can read from HBase, and sink means results from MapReduce can be sent to HBase.
The following diagram illustrates various sources and sinks for MapReduce:
The diagram we just saw can be summarized as follows:
Scenario |
Source |
Sink |
Description |
1 |
HDFS |
HDFS |
This is a typical MapReduce method that reads data from HDFS and also sends the results to HDFS. |
2 |
HDFS |
HBase |
This imports the data from HDFS into HBase. It’s a very common method that is used to import data into HBase for the first time. |
3 |
HBase |
HBase |
Data is read from HBase and written to it. It is most likely that these will be two separate HBase clusters. It’s usually used for backups and mirroring. |
Importing data from HDFS into HBase
Let’s say we have lots of data in HDFS and want to import it into HBase. We are going to write a MapReduce program that reads from HDFS and inserts data into HBase. This is depicted in the second scenario in the table we just saw.
Now, we’ll be setting up the environment for the following discussion. In addition, you can find the code and the data for this discussion in our GitHub repository at https://github.com/elephantscale/hbase-book.
The dataset we will use is the sensor data. Our (imaginary) sensor data is stored in HDFS as CSV (comma-separated values) text files. This is how their format looks:
Sensor_id, max temperature, min temperature
Here is some sample data:
- sensor11,90,70
- sensor22,80,70
- sensor31,85,72
- sensor33,75,72
We have two sample files (sensor-data1.csv and sensor-data2.csv) in our repository under the /data directory. Feel free to inspect them.
The first thing we have to do is copy these files into HDFS.
Create a directory in HDFS as follows:
$ hdfs dfs -mkdir hbase-import
Now, copy the files into HDFS:
$ hdfs dfs -put sensor-data* hbase-import/
Verify that the files exist as follows:
$ hdfs dfs -ls hbase-import
We are ready to insert this data into HBase. Note that we are designing the table to match the CSV files we are loading for ease of use.
Our row key is sensor_id.
We have one column family and we call it f (short for family).
Now, we will store two columns, max temperature and min temperature, in this column family.
Pig for MapReduce
Pig allows you to write MapReduce programs at a very high level, and inserting data into HBase is just as easy.
Here’s a Pig script that reads the sensor data from HDFS and writes it in HBase:
-- ## hdfs-to-hbase.pig
data = LOAD 'hbase-import/' using PigStorage(',') as (sensor_id:chararray, max:int, min:int);
-- describe data;
-- dump data;
Now, store the data in hbase://sensors using the following line of code:
org.apache.pig.backend.hadoop.hbase.HBaseStorage('f:max,f:min');
After creating the table, in the first command, we will load data from the hbase-import directory in HDFS.
The schema for the data is defined as follows:
Sensor_id : chararray (string)
max : int
min : int
The describe and dump statements can be used to inspect the data; in Pig, describe will give you the structure of the data object you have, and dump will output all the data to the terminal.
The final STORE command is the one that inserts the data into HBase. Let’s analyze how it is structured:
- INTO ‘hbase://sensors’: This tells Pig to connect to the sensors HBase table.
- org.apache.pig.backend.hadoop.hbase.HBaseStorage: This is the Pig class that will be used to write in HBase. Pig has adapters for multiple data stores.
- The first field in the tuple, sensor_id, will be used as a row key.
- We are specifying the column names for the max and min fields (f:max and f:min, respectively). Note that we have to specify the column family (f:) to qualify the columns.
Before running this script, we need to create an HBase table called sensors. We can do this from the HBase shell, as follows:
$ hbase shell
$ create 'sensors' , 'f'
$ quit
Then, run the Pig script as follows:
$ pig hdfs-to-hbase.pig
Now watch the console output. Pig will execute the script as a MapReduce job. Even though we are only importing two small files here, we can insert a fairly large amount of data by exploiting the parallelism of MapReduce.
At the end of the run, Pig will print out some statistics:
Input(s):
Successfully read 7 records (591 bytes) from: "hdfs://quickstart.cloudera:8020/user/cloudera/hbase-import"
Output(s):
Successfully stored 7 records in: "hbase://sensors"
Looks good! We should have seven rows in our HBase sensors table. We can inspect the table from the HBase shell with the following commands:
$ hbase shell
$ scan 'sensors'
This is how your output might look:
ROW COLUMN+CELL
sensor11 column=f:max, timestamp=1412373703149, value=90
sensor11 column=f:min, timestamp=1412373703149, value=70
sensor22 column=f:max, timestamp=1412373703177, value=80
sensor22 column=f:min, timestamp=1412373703177, value=70
sensor31 column=f:max, timestamp=1412373703177, value=85
sensor31 column=f:min, timestamp=1412373703177, value=72
sensor33 column=f:max, timestamp=1412373703177, value=75
sensor33 column=f:min, timestamp=1412373703177, value=72
sensor44 column=f:max, timestamp=1412373703184, value=55
sensor44 column=f:min, timestamp=1412373703184, value=42
sensor45 column=f:max, timestamp=1412373703184, value=57
sensor45 column=f:min, timestamp=1412373703184, value=47
sensor55 column=f:max, timestamp=1412373703184, value=55
sensor55 column=f:min, timestamp=1412373703184, value=42
7 row(s) in 0.0820 seconds
There you go; you can see that seven rows have been inserted!
With Pig, it was very easy. It took us just two lines of Pig script to do the import.
Java MapReduce
We have just demonstrated MapReduce using Pig, and you now know that Pig is a concise and high-level way to write MapReduce programs. This is demonstrated by our previous script, essentially the two lines of Pig code. However, there are situations where you do want to use the Java API, and it would make more sense to use it than using a Pig script. This can happen when you need Java to access Java libraries or do some other detailed tasks for which Pig is not a good match. For that, we have provided the Java version of the MapReduce code in our GitHub repository.
Using HBase’s bulk loader utility
HBase is shipped with a bulk loader tool called ImportTsv that can import files from HDFS into HBase tables directly. It is very easy to use, and as a bonus, it uses MapReduce internally to process files in parallel.
Perform the following steps to use ImportTsv:
- Stage data files into HDFS (remember that the files are processed using MapReduce).
- Create a table in HBase if required.
- Run the import.
Staging data files into HDFS
The first step to stage data files into HDFS has already been outlined in the previous section. The following sections explain the next two steps to stage data files.
Creating an HBase table
We will do this from the HBase shell. A note on regions is in order here. Regions are shards created automatically by HBase. It is the regions that are responsible for the distributed nature of HBase. However, you need to pay some attention to them in order to assure performance. If you put all the data in one region, you will cause what is called region hotspotting.
What is especially nice about a bulk loader is that when creating a table, it lets you presplit the table into multiple regions.
Precreating regions will allow faster imports (because the insert requests will go out to multiple region servers).
Here, we are creating a single column family:
$ hbase shell
hbase> create 'sensors', {NAME => 'f'}, {SPLITS => ['sensor20', 'sensor40', 'sensor60']}
0 row(s) in 1.3940 seconds
=> Hbase::Table - sensors
hbase > describe 'sensors'
DESCRIPTION ENABLED
'sensors', {NAME => 'f', DATA_BLOCK_ENCODING => true
'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE
=> '0', VERSIONS => '1', COMPRESSION => 'NONE',
MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELE
TED_CELLS => 'false', BLOCKSIZE => '65536', IN_M
EMORY => 'false', BLOCKCACHE => 'true'}
1 row(s) in 0.1140 seconds
We are creating regions here. Why there are exactly four regions will be clear from the following diagram:
On inspecting the table in the HBase Master UI, we will see this. Also, you can see how Start Key and End Key, which we specified, are showing up.
Run the import
Ok, now it’s time to insert data into HBase.
To see the usage of ImportTsv, do the following:
$ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
This will print the usage as follows:
$ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
-Dimporttsv.separator=,
-Dimporttsv.columns=HBASE_ROW_KEY,f:max,f:min
sensors hbase-import/
The following table explains what the parameters mean:
Parameter |
Description |
-Dimporttsv.separator |
Here, our separator is a comma (,). The default value is tab (t). |
-Dimporttsv.columns=HBASE_ROW_KEY,f:max,f:min |
This is where we map our input files into HBase tables. The first field, sensor_id, is our key, and we use HBASE_ROW_KEY to denote that the rest we are inserting into column family f. The second field, max temp, maps to f:max. The last field, min temp, maps to f:min. |
sensors |
This is the table name. |
hbase-import |
This is the HDFS directory where the data files are located. |
When we run this command, we will see that a MapReduce job is being kicked off. This is how an import is parallelized.
Also, from the console output, we can see that MapReduce is importing two files as follows:
[main] mapreduce.JobSubmitter: number of splits:2
While the job is running, we can inspect the progress from YARN (or the JobTracker UI).
One thing that we can note is that the MapReduce job only consists of mappers. This is because we are reading a bunch of files and inserting them into HBase directly. There is nothing to aggregate. So, there is no need for reducers.
After the job is done, inspect the counters and we can see this:
Map-Reduce Framework
Map input records=7
Map output records=7
This tells us that mappers read seven records from the files and inserted seven records into HBase.
Let’s also verify the data in HBase:
$ hbase shell
hbase > scan 'sensors'
ROW COLUMN+CELL
sensor11 column=f:max, timestamp=1409087465345, value=90
sensor11 column=f:min, timestamp=1409087465345, value=70
sensor22 column=f:max, timestamp=1409087465345, value=80
sensor22 column=f:min, timestamp=1409087465345, value=70
sensor31 column=f:max, timestamp=1409087465345, value=85
sensor31 column=f:min, timestamp=1409087465345, value=72
sensor33 column=f:max, timestamp=1409087465345, value=75
sensor33 column=f:min, timestamp=1409087465345, value=72
sensor44 column=f:max, timestamp=1409087465345, value=55
sensor44 column=f:min, timestamp=1409087465345, value=42
sensor45 column=f:max, timestamp=1409087465345, value=57
sensor45 column=f:min, timestamp=1409087465345, value=47
sensor55 column=f:max, timestamp=1409087465345, value=55
sensor55 column=f:min, timestamp=1409087465345, value=42
7 row(s) in 2.1180 seconds
Your output might vary slightly.
We can see that seven rows are inserted, confirming the MapReduce counters!
Let’s take another quick look at the HBase UI, which is shown here:
As you can see, the inserts go to different regions. So, on a HBase cluster with many region servers, the load will be spread across the cluster. This is because we have presplit the table into regions. Here are some questions to test your understanding.
Run the same ImportTsv command again and see how many records are in the table. Do you get duplicates? Try to find the answer and explain why that is the correct answer, then check these in the GitHub repository (https://github.com/elephantscale/hbase-book).
Bulk import scenarios
Here are a few bulk import scenarios:
Scenario |
Methods |
Notes |
The data is already in HDFS and needs to be imported into HBase. |
The two methods that can be used to do this are as follows:
|
It is probably a good idea to presplit the table before a bulk import. This spreads the insert requests across the cluster and results in a higher insert rate. If you are writing a custom MapReduce job, consider using a high-level MapReduce platform such as Pig or Hive. They are much more concise to write than the Java code. |
The data is in another database (RDBMs/NoSQL) and you need to import it into HBase. |
Use a utility such as Sqoop to bring the data into HDFS and then use the tools outlined in the first scenario. |
Avoid writing MapReduce code that directly queries databases. Most databases cannot handle many simultaneous connections. It is best to bring the data into Hadoop (HDFS) first and then use MapReduce. |
Profiling HBase applications
Just like any software development process, once we have our HBase application working correctly, we would want to make it faster. At times, developers get too carried away and start optimizing before the application is finalized. There is a well-known rule that premature optimization is the root of all evil. One of the sources for this rule is Scott Meyers Effective C++.
We can perform some ad hoc profiling in our code by timing various function calls. Also, we can use profiling tools to pinpoint the trouble spots. Using profiling tools is highly encouraged for the following reasons:
- Profiling takes out the guesswork (and a good majority of developers’ guesses are wrong).
- There is no need to modify the code. Manual profiling means that we have to go and insert the instrumentation code all over the code. Profilers work by inspecting the runtime behavior.
- Most profilers have a nice and intuitive UI to visualize the program flow and time flow.
The authors use JProfiler. It is a pretty effective profiler. However, it is neither free nor open source. So, for the purpose of this article, we are going to show you a simple manual profiling, as follows:
public class UserInsert { static String tableName = "users"; static String familyName = "info"; public static void main(String[] args) throws Exception { Configuration config = HBaseConfiguration.create(); // change the following to connect to remote clusters // config.set("hbase.zookeeper.quorum", "localhost"); long t1a = System.currentTimeMillis(); HTable htable = new HTable(config, tableName); long t1b = System.currentTimeMillis(); System.out.println ("Connected to HTable in : " + (t1b-t1a) + " ms"); int total = 100; long t2a = System.currentTimeMillis(); for (int i = 0; i < total; i++) { int userid = i; String email = "user-" + i + "@foo.com"; String phone = "555-1234"; byte[] key = Bytes.toBytes(userid); Put put = new Put(key); put.add(Bytes.toBytes(familyName), Bytes.toBytes("email"), Bytes.toBytes(email)); put.add(Bytes.toBytes(familyName), Bytes.toBytes("phone"), Bytes.toBytes(phone)); htable.put(put); } long t2b = System.currentTimeMillis(); System.out.println("inserted " + total + " users in " + (t2b - t2a) + " ms"); htable.close(); } }
The code we just saw inserts some sample user data into HBase. We are profiling two operations, that is, connection time and actual insert time.
A sample run of the Java application yields the following:
Connected to HTable in : 1139 ms
inserted 100 users in 350 ms
We spent a lot of time in connecting to HBase. This makes sense. The connection process has to go to ZooKeeper first and then to HBase. So, it is an expensive operation. How can we minimize the connection cost?
The answer is by using connection pooling. Luckily, for us, HBase comes with a connection pool manager. The Java class for this is HConnectionManager. It is very simple to use.
Let’s update our class to use HConnectionManager:
Code : File name: hbase_dp.ch8.UserInsert2.java package hbase_dp.ch8; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.client.HConnection; import org.apache.hadoop.hbase.client.HConnectionManager; import org.apache.hadoop.hbase.client.HTable; import org.apache.hadoop.hbase.client.HTableInterface; import org.apache.hadoop.hbase.client.Put; import org.apache.hadoop.hbase.util.Bytes; public class UserInsert2 { static String tableName = "users"; static String familyName = "info"; public static void main(String[] args) throws Exception { Configuration config = HBaseConfiguration.create(); // change the following to connect to remote clusters // config.set("hbase.zookeeper.quorum", "localhost"); long t1a = System.currentTimeMillis(); HConnection hConnection = HConnectionManager.createConnection(config); long t1b = System.currentTimeMillis(); System.out.println ("Connection manager in : " + (t1b-t1a) + " ms"); // simulate the first 'connection' long t2a = System.currentTimeMillis(); HTableInterface htable = hConnection.getTable(tableName) ; long t2b = System.currentTimeMillis(); System.out.println ("first connection in : " + (t2b-t2a) + " ms"); // second connection long t3a = System.currentTimeMillis(); HTableInterface htable2 = hConnection.getTable(tableName) ; long t3b = System.currentTimeMillis(); System.out.println ("second connection : " + (t3b-t3a) + " ms"); int total = 100; long t4a = System.currentTimeMillis(); for (int i = 0; i < total; i++) { int userid = i; String email = "user-" + i + "@foo.com"; String phone = "555-1234"; byte[] key = Bytes.toBytes(userid); Put put = new Put(key); put.add(Bytes.toBytes(familyName), Bytes.toBytes("email"), Bytes.toBytes(email)); put.add(Bytes.toBytes(familyName), Bytes.toBytes("phone"), Bytes.toBytes(phone)); htable.put(put); } long t4b = System.currentTimeMillis(); System.out.println("inserted " + total + " users in " + (t4b - t4a) + " ms"); hConnection.close(); } }
A sample run yields the following timings:
Connection manager in : 98 ms first connection in : 808 ms second connection : 0 ms inserted 100 users in 393 ms
The first connection takes a long time, but then take a look at the time of the second connection. It is almost instant ! This is cool!
If you are connecting to HBase from web applications (or interactive applications), use connection pooling.
More tips for high-performing HBase writes
Here we will discuss some techniques and best practices to improve writes in HBase.
Batch writes
Currently, in our code, each time we call htable.put (one_put), we make an RPC call to an HBase region server. This round-trip delay can be minimized if we call htable.put() with a bunch of put records. Then, with one round trip, we can insert a bunch of records into HBase. This is called batch puts.
Here is an example of batch puts. Only the relevant section is shown for clarity. For the full code, see
hbase_dp.ch8.UserInsert3.java: int total = 100; long t4a = System.currentTimeMillis(); List<Put> puts = new ArrayList<>(); for (int i = 0; i < total; i++) { int userid = i; String email = "user-" + i + "@foo.com"; String phone = "555-1234"; byte[] key = Bytes.toBytes(userid); Put put = new Put(key); put.add(Bytes.toBytes(familyName), Bytes.toBytes("email"), Bytes.toBytes(email)); put.add(Bytes.toBytes(familyName), Bytes.toBytes("phone"), Bytes.toBytes(phone)); puts.add(put); // just add to the list } htable.put(puts); // do a batch put long t4b = System.currentTimeMillis(); System.out.println("inserted " + total + " users in " + (t4b - t4a) + " ms");
A sample run with a batch put is as follows:
inserted 100 users in 48 ms
The same code with individual puts took around 350 milliseconds!
Use batch writes when you can to minimize latency.
Note that the HTableUtil class that comes with HBase implements some smart batching options for your use and enjoyment.
Setting memory buffers
We can control when the puts are flushed by setting the client write buffer option. Once the data in the memory exceeds this setting, it is flushed to disk. The default setting is 2 M. Its purpose is to limit how much data is stored in the buffer before writing it to disk.
There are two ways of setting this:
- In hbase-site.xml (this setting will be cluster-wide):
<property>
<name>hbase.client.write.buffer</name>
<value>8388608</value> <!-- 8 M -->
</property> - In the application (only applies for that application):
htable.setWriteBufferSize(1024*1024*10); // 10
Keep in mind that a bigger buffer takes more memory on both the client side and the server side. As a practical guideline, estimate how much memory you can dedicate to the client and put the rest of the load on the cluster.
Turning off autofush
If autoflush is enabled, each htable.put() object incurs a round trip RPC call to HRegionServer. Turning autoflush off can reduce the number of round trips and decrease latency. To turn it off, use this code:
htable.setAutoFlush(false);
The risk of turning off autoflush is if the client crashes before the data is sent to HBase, it will result in a data loss. Still, when will you want to do it? The answer is: when the danger of data loss is not important and speed is paramount. Also, see the batch write recommendations we saw previously.
Turning off WAL
Before we discuss this, we need to emphasize that the write-ahead log (WAL) is there to prevent data loss in the case of server crashes. By turning it off, we are bypassing this protection. Be very careful when choosing this. Bulk loading is one of the cases where turning off WAL might make sense.
To turn off WAL, set it for each put:
put.setDurability(Durability.SKIP_WAL);
More tips for high-performing HBase reads
So far, we looked at tips to write data into HBase. Now, let’s take a look at some tips to read data faster.
The scan cache
When reading a large number of rows, it is better to set scan caching to a high number (in the 100 seconds or 1,000 seconds range). Otherwise, each row that is scanned will result in a trip to HRegionServer. This is especially encouraged for MapReduce jobs as they will likely consume a lot of rows sequentially.
To set scan caching, use the following code:
Scan scan = new Scan(); scan.setCaching(1000);
Only read the families or columns needed
When fetching a row, by default, HBase returns all the families and all the columns. If you only care about one family or a few attributes, specifying them will save needless I/O.
To specify a family, use this:
scan.addFamily( Bytes.toBytes("familiy1"));
To specify columns, use this:
scan.addColumn( Bytes.toBytes("familiy1"), Bytes.toBytes("col1"))
The block cache
When scanning large rows sequentially (say in MapReduce), it is recommended that you turn off the block cache. Turning off the cache might be completely counter-intuitive. However, caches are only effective when we repeatedly access the same rows. During sequential scanning, there is no caching, and turning on the block cache will introduce a lot of churning in the cache (new data is constantly brought into the cache and old data is evicted to make room for the new data).
So, we have the following points to consider:
- Turn off the block cache for sequential scans
- Turn off the block cache for random/repeated access
Benchmarking or load testing HBase
Benchmarking is a good way to verify HBase’s setup and performance. There are a few good benchmarks available:
- HBase’s built-in benchmark
- The Yahoo Cloud Serving Benchmark (YCSB)
- JMeter for custom workloads
HBase’s built-in benchmark
HBase’s built-in benchmark is PerformanceEvaluation.
To find its usage, use this:
$ hbase org.apache.hadoop.hbase.PerformanceEvaluation
To perform a write benchmark, use this:
$ hbase org.apache.hadoop.hbase.PerformanceEvaluation --nomapred randomWrite 5
Here we are using five threads and no MapReduce.
To accurately measure the throughput, we need to presplit the table that the benchmark writes to. It is TestTable.
$ hbase org.apache.hadoop.hbase.PerformanceEvaluation --nomapred --presplit=3 randomWrite 5
Here, the table is split in three ways. It is good practice to split the table into as many regions as the number of region servers.
There is a read option along with a whole host of scan options.
YCSB
The YCSB is a comprehensive benchmark suite that works with many systems such as Cassandra, Accumulo, and HBase.
Download it from GitHub, as follows:
$ git clone git://github.com/brianfrankcooper/YCSB.git
Build it like this:
$ mvn -DskipTests package
Create an HBase table to test against:
$ hbase shell
hbase> create 'ycsb', 'f1'
Now, copy hdfs-site.xml for your cluster into the hbase/src/main/conf/ directory and run the benchmark:
$ bin/ycsb load hbase -P workloads/workloada -p columnfamily=f1 -p table=ycsb
YCSB offers lots of workloads and options. Please refer to its wiki page at https://github.com/brianfrankcooper/YCSB/wiki.
JMeter for custom workloads
The standard benchmarks will give you an idea of your HBase cluster’s performance. However, nothing can substitute measuring your own workload.
We want to measure at least the insert speed or the query speed.
We also want to run a stress test. So, we can measure the ceiling on how much our HBase cluster can support.
We can do a simple instrumentation as we did earlier too. However, there are tools such as JMeter that can help us with load testing.
Please refer to the JMeter website and check out the Hadoop or HBase plugins for JMeter.
Monitoring HBase
Running any distributed system involves decent monitoring. HBase is no exception. Luckily, HBase has the following capabilities:
- HBase exposes a lot of metrics
- These metrics can be directly consumed by monitoring systems such as Ganglia
- We can also obtain these metrics in the JSON format via the REST interface and JMX
Monitoring is a big subject and we consider it as part HBase administration. So, in this section, we will give pointers to tools and utilities that allow you to monitor HBase.
Ganglia
Ganglia is a generic system monitor that can monitor hosts (such as CPU, disk usage, and so on). The Hadoop stack has had a pretty good integration with Ganglia for some time now. HBase and Ganglia integration is set up by modern installers from Cloudera and Hortonworks.
To enable Ganglia metrics, update the hadoop-metrics.properties file in the HBase configuration directory. Here’s a sample file:
hbase.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 hbase.period=10 hbase.servers=ganglia-server:PORT jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 jvm.period=10 jvm.servers=ganglia-server:PORT rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 rpc.period=10 rpc.servers=ganglia-server:PORT
This file has to be uploaded to all the HBase servers (master servers as well as region servers).
Here are some sample graphs from Ganglia (these are Wikimedia statistics, for example):
These graphs show cluster-wide resource utilization.
OpenTSDB
OpenTSDB is a scalable time series database. It can collect and visualize metrics on a large scale. OpenTSDB uses collectors, light-weight agents that send metrics to the open TSDB server to collect metrics, and there is a collector library that can collect metrics from HBase.
You can see all the collectors at http://opentsdb.net/docs/build/html/user_guide/utilities/tcollector.html.
An interesting factoid is that OpenTSDB is built on Hadoop/HBase.
Collecting metrics via the JMX interface
HBase exposes a lot of metrics via JMX.
This page can be accessed from the web dashboard at http://<hbase master>:60010/jmx.
For example, for a HBase instance that is running locally, it will be http://localhost:60010/jmx.
Here is a sample screenshot of the JMX metrics via the web UI:
Here’s a quick example of how to programmatically retrieve these metrics using curl:
$ curl 'localhost:60010/jmx'
Since this is a web service, we can write a script/application in any language (Java, Python, or Ruby) to retrieve and inspect the metrics.
Summary
In this article, you learned how to push the performance of our HBase applications up. We looked at how to effectively load a large amount of data into HBase. You also learned about benchmarking and monitoring HBase and saw tips on how to do high-performing reads/writes.
Resources for Article:
Further resources on this subject:
- The HBase’s Data Storage [article]
- Hadoop and HDInsight in a Heartbeat [article]
- Understanding the HBase Ecosystem [article]