How to interact with HBase using HBase shell [Tutorial]

9 min read

HBase is among the top five most popular and widely-deployed NoSQL databases. It is used to support critical production workloads across hundreds of organizations. It is supported by multiple vendors (in fact, it is one of the few databases that is multi-vendor), and more importantly has an active and diverse developer and user community.

In this article, we see how to work with the HBase shell in order to efficiently work on the massive amounts of data.

The following excerpt is taken from the book ‘7 NoSQL Databases in Week‘ authored by Aaron Ploetz et al.

Working with the HBase shell

The best way to get started with understanding HBase is through the HBase shell.

Before we do that, we need to first install HBase. An easy way to get started is to use the Hortonworks sandbox. You can download the sandbox for free from https://hortonworks.com/products/sandbox/. The sandbox can be installed on Linux, Mac and Windows. Follow the instructions to get this set up.

On any cluster where the HBase client or server is installed, type hbase shell to get a prompt into HBase:

hbase(main):004:0> version
1.1.2.2.3.6.2-3, r2873b074585fce900c3f9592ae16fdd2d4d3a446, Thu Aug  4 18:41:44 UTC 2016

This tells you the version of HBase that is running on the cluster. In this instance, the HBase version is 1.1.2, provided by a particular Hadoop distribution, in this case HDP 2.3.6:

hbase(main):001:0> help
HBase Shell, version 1.1.2.2.3.6.2-3, r2873b074585fce900c3f9592ae16fdd2d4d3a446, Thu Aug  4 18:41:44 UTC 2016
Type 'help "COMMAND"', (e.g. 'help "get"' -- the quotes are necessary) for help on a specific command.
Commands are grouped. Type 'help "COMMAND_GROUP"', (e.g. 'help "general"') for help on a command group.

This provides the set of operations that are possible through the HBase shell, which includes DDL, DML, and admin operations.

hbase(main):001:0> create 'sensor_telemetry', 'metrics'
0 row(s) in 1.7250 seconds
=> Hbase::Table - sensor_telemetry

This creates a table called sensor_telemetry, with a single column family called metrics. As we discussed before, HBase doesn’t require column names to be defined in the table schema (and in fact, has no provision for you to be able to do so):

hbase(main):001:0> describe 'sensor_telemetry'
Table sensor_telemetry is ENABLED 
sensor_telemetry 
COLUMN FAMILIES DESCRIPTION
{NAME => 'metrics', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false',
KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING
=> 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0',
BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE =>'0'}
1 row(s) in 0.5030 seconds

This describes the structure of the sensor_telemetry table. The command output indicates that there’s a single column family present called metrics, with various attributes defined on it.

BLOOMFILTER indicates the type of bloom filter defined for the table, which can either be a bloom filter of the ROW type, which probes for the presence/absence of a given row key, or of the ROWCOL type, which probes for the presence/absence of a given row key, col-qualifier combination. You can also choose to have BLOOMFILTER set to None.

The BLOCKSIZE configures the minimum granularity of an HBase read. By default, the block size is 64 KB, so if the average cells are less than 64 KB, and there’s not much locality of reference, you can lower your block size to ensure there’s not more I/O than necessary, and more importantly, that your block cache isn’t wasted on data that is not needed.

VERSIONS refers to the maximum number of cell versions that are to be kept around:

hbase(main):004:0> alter 'sensor_telemetry', {NAME => 'metrics', BLOCKSIZE => '16384', COMPRESSION => 'SNAPPY'}
Updating all regions with the new schema...
1/1 regions updated.
Done.
0 row(s) in 1.9660 seconds

Here, we are altering the table and column family definition to change the BLOCKSIZE to be 16 K and the COMPRESSION codec to be SNAPPY:

hbase(main):004:0> version
1.1.2.2.3.6.2-3, r2873b074585fce900c3f9592ae16fdd2d4d3a446, Thu Aug 4 18:41:44 UTC 2016 hbase(main):005:0> describe 'sensor_telemetry'
Table sensor_telemetry is
ENABLED
sensor_telemetry
COLUMN FAMILIES DESCRIPTION
{NAME => 'metrics', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false',
KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING
=> 'NONE', TTL => 'FOREVER', COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0',
BLOCKCACHE => 'true', BLOCKSIZE => '16384', REPLICATION_SCOPE => '0'} 
1 row(s) in 0.0410 seconds

This is what the table definition now looks like after our ALTER table statement. Next, let’s scan the table to see what it contains:

hbase(main):007:0> scan 'sensor_telemetry'
ROW COLUMN+CELL
0 row(s) in 0.0750 seconds

No surprises, the table is empty. So, let’s populate some data into the table:

hbase(main):007:0> put 'sensor_telemetry', '/94555/20170308/18:30', 'temperature', '65'
ERROR: Unknown column family! Valid column names: metrics:*

Here, we are attempting to insert data into the sensor_telemetry table. We are attempting to store the value '65' for the column qualifier 'temperature' for a row key '/94555/20170308/18:30'. This is unsuccessful because the column 'temperature' is not associated with any column family.

In HBase, you always need the row key, the column family and the column qualifier to uniquely specify a value. So, let’s try this again:

hbase(main):008:0> put 'sensor_telemetry', '/94555/20170308/18:30',
 'metrics:temperature', '65'
 0 row(s) in 0.0120 seconds

Ok, that seemed to be successful. Let’s confirm that we now have some data in the table:

hbase(main):009:0> count 'sensor_telemetry'
 1 row(s) in 0.0620 seconds
 => 1

Ok, it looks like we are on the right track. Let’s scan the table to see what it contains:

hbase(main):010:0> scan 'sensor_telemetry'
 ROW COLUMN+CELL 
/94555/20170308/18:30 column=metrics:temperature, timestamp=1501810397402,value=65 
1 row(s) in 0.0190 seconds

This tells us we’ve got data for a single row and a single column. The insert time epoch in milliseconds was 1501810397402.

In addition to a scan operation, which scans through all of the rows in the table, HBase also provides a get operation, where you can retrieve data for one or more rows, if you know the keys:

hbase(main):011:0> get 'sensor_telemetry', '/94555/20170308/18:30'
COLUMN CELL
metrics:temperature timestamp=1501810397402, value=65

OK, that returns the row as expected. Next, let’s look at the effect of cell versions. As we’ve discussed before, a value in HBase is defined by a combination of Row-key, Col-family, Col-qualifier, Timestamp.

To understand this, let’s insert the value '66', for the same row key and column qualifier as before:

hbase(main):012:0> put 'sensor_telemetry', '/94555/20170308/18:30',
'metrics:temperature', '66'
0 row(s) in 0.0080 seconds

Now let’s read the value for the row key back:

hbase(main):013:0> get 'sensor_telemetry', '/94555/20170308/18:30'
COLUMN CELL 
metrics:temperature timestamp=1501810496459,
value=66 
1 row(s) in 0.0130 seconds

This is in line with what we expect, and this is the standard behavior we’d expect from any database. A put in HBase is the equivalent to an upsert in an RDBMS. Like an upsert, put inserts a value if it doesn’t already exist and updates it if a prior value exists.

Now, this is where things get interesting. The get operation in HBase allows us to retrieve data associated with a particular timestamp:

hbase(main):015:0> get 'sensor_telemetry', '/94555/20170308/18:30', {COLUMN =>
'metrics:temperature', TIMESTAMP => 1501810397402} 
COLUMN CELL
metrics:temperature timestamp=1501810397402,value=65 
1 row(s) in 0.0120 seconds

We are able to retrieve the old value of 65 by providing the right timestamp. So, puts in HBase don’t overwrite the old value, they merely hide it; we can always retrieve the old values by providing the timestamps.

Now, let’s insert more data into the table:

hbase(main):028:0> put 'sensor_telemetry', '/94555/20170307/18:30',
'metrics:temperature', '43'
0 row(s) in 0.0080 seconds

hbase(main):029:0> put 'sensor_telemetry', '/94555/20170306/18:30',
'metrics:temperature', '33'
0 row(s) in 0.0070 seconds

Now, let’s scan the table back:

hbase(main):030:0> scan 'sensor_telemetry' 
ROW COLUMN+CELL 
/94555/20170306/18:30 column=metrics:temperature, timestamp=1501810843956, value=33 
/94555/20170307/18:30 column=metrics:temperature, timestamp=1501810835262, value=43 
/94555/20170308/18:30 column=metrics:temperature, timestamp=1501810615941,value=67 
3 row(s) in 0.0310 seconds

We can also scan the table in reverse key order:

hbase(main):031:0> scan 'sensor_telemetry', {REVERSED => true} 
ROW COLUMN+CELL 
/94555/20170308/18:30 column=metrics:temperature, timestamp=1501810615941, value=67 
/94555/20170307/18:30 column=metrics:temperature, timestamp=1501810835262, value=43 
/94555/20170306/18:30 column=metrics:temperature, timestamp=1501810843956,value=33 
3 row(s) in 0.0520 seconds

What if we wanted all the rows, but in addition, wanted all the cell versions from each row? We can easily retrieve that:

hbase(main):032:0> scan 'sensor_telemetry', {RAW => true, VERSIONS => 10} 
ROW COLUMN+CELL 
/94555/20170306/18:30 column=metrics:temperature, timestamp=1501810843956, value=33 
/94555/20170307/18:30 column=metrics:temperature, timestamp=1501810835262, value=43 
/94555/20170308/18:30 column=metrics:temperature, timestamp=1501810615941, value=67 
/94555/20170308/18:30 column=metrics:temperature, timestamp=1501810496459, value=66 
/94555/20170308/18:30 column=metrics:temperature, timestamp=1501810397402, value=65

Here, we are retrieving all three values of the row key /94555/20170308/18:30 in the scan result set.

HBase scan operations don’t need to go from the beginning to the end of the table; you can optionally specify the row to start scanning from and the row to stop the scan operation at:

hbase(main):034:0> scan 'sensor_telemetry', {STARTROW => '/94555/20170307'} 
ROW COLUMN+CELL 
/94555/20170307/18:30 column=metrics:temperature, timestamp=1501810835262, value=43 
/94555/20170308/18:30 column=metrics:temperature, timestamp=1501810615941, value=67 
2 row(s) in 0.0550 seconds

HBase also provides the ability to supply filters to the scan operation to restrict what rows are returned by the scan operation. It’s possible to implement your own filters, but there’s rarely a need to. There’s a large collection of filters that are already implemented:

hbase(main):033:0> scan 'sensor_telemetry', {ROWPREFIXFILTER => '/94555/20170307'} 
ROW COLUMN+CELL 
/94555/20170307/18:30 column=metrics:temperature, timestamp=1501810835262, value=43 
1 row(s) in 0.0300 seconds

This returns all the rows whose keys have the prefix /94555/20170307:

hbase(main):033:0> scan 'sensor_telemetry', { FILTER =>     
   SingleColumnValueFilter.new( 
         Bytes.toBytes('metrics'),       
         Bytes.toBytes('temperature'),  
         CompareFilter::CompareOp.valueOf('EQUAL'), 
         BinaryComparator.new(Bytes.toBytes('66')))}

The SingleColumnValueFilter can be used to scan a table and look for all rows with a given column value.

We saw how fairly easy it is to work with your data in HBase using the HBase shell.

If you found this excerpt useful, make sure you check out the book ‘Seven NoSQL Databases in Week‘, to get more hands-on information about HBase and the other popular NoSQL databases out there today.

Level Up Your Company’s Big Data with Mesos

2018 is the year of graph databases. Here’s why.

Top 5 NoSQL Databases

Amey Varangaonkar

Data Science Enthusiast. A massive science fiction and Manchester United fan. Loves to read, write and listen to music.