10 min read

This article by Shashwat Shriparv, author of the book, Learning HBase, will introduce you to the world of HBase.

(For more resources related to this topic, see here.)

HBase is a horizontally scalable, distributed, open source, and a sorted map database. It runs on top of Hadoop file system that is Hadoop Distributed File System (HDFS). HBase is a NoSQL nonrelational database that doesn’t always require a predefined schema. It can be seen as a scaling flexible, multidimensional spreadsheet where any structure of data is fit with on-the-fly addition of new column fields, and fined column structure before data can be inserted or queried. In other words, HBase is a column-based database that runs on top of Hadoop distributed file system and supports features such as linear scalability (scale out), automatic failover, automatic sharding, and more flexible schema.

HBase is modeled on Google BigTable. It was inspired by Google BigTable, which is a compressed, high-performance, proprietary data store built on the Google filesystem. HBase was developed as a Hadoop subproject to support storage of structural data, which can take advantage of most distributed files systems (typically, the Hadoop Distributed File System known as HDFS).

The following table contains key information about HBase and its features:

Features

Description

Developed by

Apache

Written in

Java

Type

Column oriented

License

Apache License

Lacking features of relational databases

SQL support, relations, primary, foreign, and unique key constraints, normalization

Website

http://hbase.apache.org

Distributions

Apache, Cloudera

Download link

http://mirrors.advancedhosters.com/apache/hbase/

Mailing lists

Blog

http://blogs.apache.org/hbase/

HBase layout on top of Hadoop

The following figure represents the layout information of HBase on top of Hadoop:

There is more than one ZooKeeper in the setup, which provides high availability of master status; a RegionServer may contain multiple rations. The RegionServers run on the machines where DataNodes run. There can be as many RegionServers as DataNodes. RegionServers can have multiple HRegions; one HRegion can have one HLog and multiple HFiles with its associate’s MemStore.

HBase can be seen as a master-slave database where the master is called HMaster, which is responsible for coordination between client application and HRegionServer. It is also responsible for monitoring and recording metadata changes and management. Slaves are called HRegionServers, which serve the actual tables in form of regions. These regions are the basic building blocks of the HBase tables, which contain distribution of tables. So, HMaster and RegionServer work in coordination to serve the HBase tables and HBase cluster.

Usually, HMaster is co-hosted with Hadoop NameNode daemon process on a server and communicates to DataNode daemon for reading and writing data on HDFS. The RegionServer runs or is co-hosted on the Hadoop DataNodes.

Comparing architectural differences between RDBMs and HBase

Let’s list the major differences between relational databases and HBase:

Relational databases

HBase

Uses tables as databases

Uses regions as databases

Filesystems supported are FAT, NTFS, and EXT

Filesystem supported is HDFS

The technique used to store logs is commit logs

The technique used to store logs is Write-Ahead Logs (WAL)

The reference system used is coordinate system

The reference system used is ZooKeeper

Uses the primary key

Uses the row key

Partitioning is supported

Sharding is supported

Use of rows, columns, and cells

Use of rows, column families, columns, and cells

HBase features

Let’s see the major features of HBase that make it one of the most useful databases for the current and future industry:

  • Automatic failover and load balancing: HBase runs on top of HDFS, which is internally distributed and automatically recovered using multiple block allocation and replications. It works with multiple HMasters and region servers. This failover is also facilitated using HBase and RegionServer replication.
  • Automatic sharding: An HBase table is made up of regions that are hosted by RegionServers and these regions are distributed throughout the RegionServers on different DataNodes. HBase provides automatic and manual splitting of these regions to smaller subregions, once it reaches a threshold size to reduce I/O time and overhead.
  • Hadoop/HDFS integration: It’s important to note that HBase can run on top of other filesystems as well. While HDFS is the most common choice as it supports data distribution and high availability using distributed Hadoop, for which we just need to set some configuration parameters and enable HBase to communicate to Hadoop, an out-of-the-box underlying distribution is provided by HDFS.
  • Real-time, random big data access: HBase uses log-structured merge-tree (LSM-tree) as data storage architecture internally, which merges smaller files to larger files periodically to reduce disk seeks.
  • MapReduce: HBase has a built-in support of Hadoop MapReduce framework for fast and parallel processing of data stored in HBase.

    You can search for the Package org.apache.hadoop.hbase.mapreduce for more details.

  • Java API for client access: HBase has a solid Java API support (client/server) for easy development and programming.
  • Thrift and a RESTtful web service: HBase not only provides a thrift and RESTful gateway but also web service gateways for integrating and accessing HBase besides Java code (HBase Java APIs) for accessing and working with HBase.
  • Support for exporting metrics via the Hadoop metrics subsystem: HBase provides Java Management Extensions (JMX) and exporting matrix for monitoring purposes with tools such as Ganglia and Nagios.
  • Distributed: HBase works when used with HDFS. It provides coordination with Hadoop so that distribution of tables, high availability, and consistency is supported by it.
  • Linear scalability (scale out): Scaling of HBase is not scale up but scale out, which means that we don’t need to make servers more powerful but we add more machines to its cluster. We can add more nodes to the cluster on the fly. As soon as a new RegionServer node is up, the cluster can begin rebalancing, start the RegionServer on the new node, and it is scaled up, it is as simple as that.
  • Column oriented: HBase stores each column separately in contrast with most of the relational databases, which uses stores or are row-based storage. So in HBase, columns are stored contiguously and not the rows. More about row- and column-oriented databases will follow.
  • HBase shell support: HBase provides a command-line tool to interact with HBase and perform simple operations such as creating tables, adding data, and scanning data. This also provides full-fledged command-line tool using which we can interact with HBase and perform operations such as creating table, adding data, removing data, and a few other administrative commands.
  • Sparse, multidimensional, sorted map database: HBase is a sparse, multidimensional, sorted map-based database, which supports multiple versions of the same record.
  • Snapshot support: HBase supports taking snapshots of metadata for getting the previous or correct state form of data.

HBase in the Hadoop ecosystem

Let’s see where HBase sits in the Hadoop ecosystem. In the Hadoop ecosystem, HBase provides a persistent, structured, schema-based data store. The following figure illustrates the Hadoop ecosystem:

HBase can work as a separate entity on the local filesystem (which is not really effective as no distribution is provided) as well as in coordination with Hadoop as a separate but connected entity. As we know, Hadoop provides two services, a distributed files system (HDFS) for storage and a MapReduce framework for processing in a parallel mode. When there was a need to store structured data (data in the form of tables, rows and columns), which most of the programmers are already familiar with, the programmers were finding it difficult to process the data that was stored on HDFS as an unstructured flat file format. This led to the evolution of HBase, which provided a way to store data in a structural way.

Consider that we have got a CSV file stored on HDFS and we need to query from it. We would need to write a Java code for this, which wouldn’t be a good option. It would be better if we could specify the data key and fetch the data from that file. So, what we can do here is create a schema or table with the same structure of CSV file to store the data of the CSV file in the HBase table and query using HBase APIs, or HBase shell using key.

Data representation in HBase

Let’s look into the representation of rows and columns in HBase table:

An HBase table is divided into rows, column families, columns, and cells. Row keys are unique keys to identify a row, column families are groups of columns, columns are fields of the table, and the cell contains the actual value or the data.

So, we have been through the introduction of HBase; now, let’s see what Hadoop and its components are in brief. It is assumed here that you are already familiar with Hadoop; if not, following a brief introduction about Hadoop will help you to understand it.

Hadoop

Hadoop is an underlying technology of HBase, providing high availability, fault tolerance, and distribution. It is an Apache-sponsored, free, open source, Java-based programming framework which supports large dataset storage. It provides distributed file system and MapReduce, which is a distributed programming framework. It provides a scalable, reliable, distributed storage and development environment. Hadoop makes it possible to run applications on a system with tens to tens of thousands of nodes. The underlying distributed file system provides large-scale storage, rapid data access. It has the following submodules:

  • Hadoop Common: This is the core component that supports the other Hadoop modules. It is like the master components facilitating communication and coordination between different Hadoop modules.
  • Hadoop distributed file system: This is the underlying distributed file system, which is abstracted on the top of the local filesystem that provides high throughput of read and write operations of data on Hadoop.
  • Hadoop YARN: This is the new framework that is shipped with newer releases of Hadoop. It provides job scheduling and job and resource management.
  • Hadoop MapReduce: This is the Hadoop-based processing system that provides parallel processing of large data and datasets.

Other Hadoop subprojects are HBase, Hive, Ambari, Avro, Cassandra (Cassandra isn’t a Hadoop subproject, it’s a related project; they solve similar problems in different ways), Mahout, Pig, Spark, ZooKeeper (ZooKeeper isn’t a Hadoop subproject. It’s a dependency shared by many distributed systems), and so on. All of these have different usability and the combination of all these subprojects forms the Hadoop ecosystem.

Core daemons of Hadoop

The following are the core daemons of Hadoop:

  • NameNode: This stores and manages all metadata about the data present on the cluster, so it is the single point of contact to Hadoop. In the new release of Hadoop, we have an option of more than one NameNode for high availability.
  • JobTracker: This runs on the NameNode and performs the MapReduce of the jobs submitted to the cluster.
  • SecondaryNameNode: This maintains the backup of metadata present on the NameNode, and also records the file system changes.
  • DataNode: This will contain the actual data.
  • TaskTracker: This will perform tasks on the local data assigned by the JobTracker.

The preceding are the daemons in the case of Hadoop v1 or earlier. In newer versions of Hadoop, we have ResourceManager instead of JobTracker, the node manager instead of TaskTrackers, and the YARN framework instead of a simple MapReduce framework. The following is the comparison between daemons in Hadoop 1 and Hadoop 2:

Hadoop 1

Hadoop 2

HDFS

  • NameNode
  • Secondary NameNode
  • DataNode
 

  • NameNode (more than one active/standby)
  • Checkpoint node
  • DataNode

Processing

  • MapReduce v1
  • JobTracker
  • TaskTracker
 

  • YARN (MRv2)
  • ResourceManager
  • NodeManager
  • Application Master

Comparing HBase with Hadoop

As we now know what HBase and what Hadoop are, let’s have a comparison between HDFS and HBase for better understanding:

Hadoop/HDFS

HBase

This provide filesystem for distributed storage

This provides tabular column-oriented data storage

This is optimized for storage of huge-sized files with no random read/write of these files

This is optimized for tabular data with random read/write facility

This uses flat files

This uses key-value pairs of data

The data model is not flexible

Provides a flexible data model

This uses file system and processing framework

This uses tabular storage with built-in Hadoop MapReduce support

This is mostly optimized for write-once read-many

This is optimized for both read/write many

Summary

So in this article, we discussed the introductory aspects of HBase and it’s features. We have also discussed HBase’s components and their place in the HBase ecosystem.

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here