Setting up Apache Druid in Hadoop for Data visualizations [Tutorial]

Apache Druid is a distributed, high-performance columnar store. Druid allows us to store both real-time and historical data that is time series in nature. It also provides fast data aggregation and flexible data exploration. The architecture supports storing trillions of data points on petabyte sizes. In this tutorial, we will explore Apache Druid components and how it can be used to visualize data in order to build the analytics that drives the business decisions. In this article we will understand how to set up Apache Druid in Hadoop to visualize data.

In order to understand more about the Druid architecture, you may refer to this white paper.

This article is an excerpt from a book written by Naresh Kumar and Prashant Shindgikar titled Modern Big Data Processing with Hadoop.

Apache Druid components

Let's take a quick look at the different components of the Druid cluster:

ComponentDescriptionDruid BrokerThese are the nodes that are aware of where the data lies in the cluster. These nodes are contacted by the applications/clients to get the data within Druid.Druid CoordinatorThese nodes manage the data (they load, drop, and load-balance it) on the historical nodes.Druid OverlordThis component is responsible for accepting tasks and returning the statuses of the tasks.Druid RouterThese nodes are needed when the data volume is in terabytes or higher range. These nodes route the requests to the brokers.Druid HistoricalThese nodes store immutable segments and are the backbone of the Druid cluster. They serve load segments, drop segments, and serve queries on segments' requests.

Other required components

The following table presents a couple of other required components:

ComponentDescriptionZookeeperApache Zookeeper is a highly reliable distributed coordination serviceMetadata StorageMySQL and PostgreSQL are the popular RDBMSes used to keep track of all segments, supervisors, tasks, and configurations

Apache Druid installation

Apache Druid can be installed either in standalone mode or as part of a Hadoop cluster. In this section, we will see how to install Druid via Apache Ambari.

Add service

First, we invoke the Actions drop-down below the list of services in the Hadoop cluster.

The screen looks like this:

apache-druid-hadoop-data-visualizations-tutorial-img-0

Select Druid and Superset

In this setup, we will install both Druid and Superset at the same time. Superset is the visualization application that we will learn about in the next step.

The selection screen looks like this:

apache-druid-hadoop-data-visualizations-tutorial-img-1

Click on Next when both the services are selected.

Service placement on servers

In this step, we will be given a choice to select the servers on which the application has to be installed. I have selected node 3 for this purpose. You can select any node you wish.

The screen looks something like this:

apache-druid-hadoop-data-visualizations-tutorial-img-2

Click on Next when when the changes are done.

Choose Slaves and Clients

Here, we are given a choice to select the nodes on which we need the Slaves and Clients for the installed components. I have left the options that are already selected for me:

apache-druid-hadoop-data-visualizations-tutorial-img-3

Service configurations

In this step, we need to select the databases, usernames, and passwords for the metadata store used by the Druid and Superset applications. Feel free to choose the default ones. I have given MySQL as the backend store for both of them.

The screen looks like this:

apache-druid-hadoop-data-visualizations-tutorial-img-4

Once the changes look good, click on the Next button at the bottom of the screen.

Service installation

In this step, the applications will be installed automatically and the status will be shown at the end of the plan.

Click on Next once the installation is complete. Changes to the current screen look like this:

apache-druid-hadoop-data-visualizations-tutorial-img-5

Installation summary

Once everything is successfully completed, we are shown a summary of what has been done. Click on Complete when done:

apache-druid-hadoop-data-visualizations-tutorial-img-6

Sample data ingestion into Druid

Once we have all the Druid-related applications running in our Hadoop cluster, we need a sample dataset that we must load in order to run some analytics tasks.

Let's see how to load sample data. Download the Druid archive from the internet:

[druid@node-3 ~$ curl -O http://static.druid.io/artifacts/releases/druid-0.12.0-bin.tar.gz
% Total % Received % Xferd Average Speed Time Time Time Current
                               Dload Upload Total Spent Left Speed
100 222M 100 222M 0 0 1500k 0 0:02:32 0:02:32 --:--:-- 594k

Extract the archive:

[druid@node-3 ~$ tar -xzf druid-0.12.0-bin.tar.gz

Copy the sample Wikipedia data to Hadoop:

[druid@node-3 ~]$ cd druid-0.12.0
[druid@node-3 ~/druid-0.12.0]$ hadoop fs -mkdir /user/druid/quickstart
[druid@node-3 ~/druid-0.12.0]$ hadoop fs -put quickstart/wikiticker-2015-09-12-sampled.json.gz /user/druid/quickstart/

Submit the import request:

[druid@node-3 druid-0.12.0]$ curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/wikiticker-index.json localhost:8090/druid/indexer/v1/task;echo
{"task":"index_hadoop_wikiticker_2018-03-16T04:54:38.979Z"}

After this step, Druid will automatically import the data into the Druid cluster and the progress can be seen in the overlord console.

The interface is accessible via http://<overlord-ip>:8090/console.html. The screen looks like this:

apache-druid-hadoop-data-visualizations-tutorial-img-7

Once the ingestion is complete, we will see the status of the job as SUCCESS.

In case of FAILED imports, please make sure that the backend that is configured to store the Metadata for the Druid cluster is up and running.Even though Druid works well with the OpenJDK installation, I have faced a problem with a few classes not being available at runtime. In order to overcome this, I have had to use Oracle Java version 1.8 to run all Druid applications.

Now we are ready to start using Druid for our visualization tasks.

MySQL database with Apache Druid

We will use a MySQL database to store the data. Apache Druid allows us to read the data present in an RDBMS system such as MySQL.

Sample database

The employees database is a standard dataset that has a sample organization and their employee, salary, and department data. We will see how to set it up for our tasks.

This section assumes that the MySQL database is already configured and running.

Download the sample dataset

Download the sample dataset from GitHub with the following command on any server that has access to the MySQL database:

[user@master ~]$ sudo yum install git -y

[user@master ~]$ git clone https://github.com/datacharmer/test_db
Cloning into 'test_db'...
remote: Counting objects: 98, done.
remote: Total 98 (delta 0), reused 0 (delta 0), pack-reused 98
Unpacking objects: 100% (98/98), done.

Copy the data to MySQL

In this step, we will import the contents of the data in the files to the MySQL database:

[user@master test_db]$ mysql -u root < employees.sql
INFO
CREATING DATABASE STRUCTURE
INFO
storage engine: InnoDB
INFO
LOADING departments
INFO
LOADING employees
INFO
LOADING dept_emp
INFO
LOADING dept_manager
INFO
LOADING titles
INFO
LOADING salaries
data_load_time_diff
NULL

Verify integrity of the tables

This is an important step, just to make sure that all of the data we have imported is correctly stored in the database. The summary of the integrity check is shown as the verification happens:

[user@master test_db]$ mysql -u root -t < test_employees_sha.sql
+----------------------+
| INFO                 |
+----------------------+
| TESTING INSTALLATION |
+----------------------+
+--------------+------------------+------------------------------------------+
| table_name   | expected_records | expected_crc                             |
+--------------+------------------+------------------------------------------+
| employees    | 300024 | 4d4aa689914d8fd41db7e45c2168e7dcb9697359 |
| departments  |  9 | 4b315afa0e35ca6649df897b958345bcb3d2b764 |
| dept_manager |               24 | 9687a7d6f93ca8847388a42a6d8d93982a841c6c |
| dept_emp     | 331603 | d95ab9fe07df0865f592574b3b33b9c741d9fd1b |
| titles       | 443308 | d12d5f746b88f07e69b9e36675b6067abb01b60e |
| salaries     | 2844047 | b5a1785c27d75e33a4173aaa22ccf41ebd7d4a9f |
+--------------+------------------+------------------------------------------+
+--------------+------------------+------------------------------------------+
| table_name   | found_records    | found_crc                        |
+--------------+------------------+------------------------------------------+
| employees    | 300024 | 4d4aa689914d8fd41db7e45c2168e7dcb9697359 |
| departments  |  9 | 4b315afa0e35ca6649df897b958345bcb3d2b764 |
| dept_manager |               24 | 9687a7d6f93ca8847388a42a6d8d93982a841c6c |
| dept_emp     | 331603 | d95ab9fe07df0865f592574b3b33b9c741d9fd1b |
| titles       | 443308 | d12d5f746b88f07e69b9e36675b6067abb01b60e |
| salaries     | 2844047 | b5a1785c27d75e33a4173aaa22ccf41ebd7d4a9f |
+--------------+------------------+------------------------------------------+

+--------------+---------------+-----------+
| table_name   | records_match | crc_match |
+--------------+---------------+-----------+
| employees    | OK | ok        |
| departments  | OK | ok        |
| dept_manager | OK            | ok |
| dept_emp     | OK | ok        |
| titles       | OK | ok        |
| salaries     | OK | ok        |
+--------------+---------------+-----------+
+------------------+
| computation_time |
+------------------+
| 00:00:11         |
+------------------+
+---------+--------+
| summary | result |
+---------+--------+
| CRC     | OK |
| count   | OK |
+---------+--------+

Now the data is correctly loaded in the MySQL database called employees.

Single Normalized Table

In data warehouses, its a standard practice to have normalized tables when compared to many small related tables. Lets create a single normalized table that contains details of employees, salaries, departments

MariaDB [employees]> create table employee_norm as select e.emp_no, e.birth_date, CONCAT_WS(' ', e.first_name, e.last_name) full_name , e.gender, e.hire_date, s.salary, s.from_date, s.to_date, d.dept_name, t.title from employees e, salaries s, departments d, dept_emp de, titles t where e.emp_no = t.emp_no and e.emp_no = s.emp_no and d.dept_no = de.dept_no and e.emp_no = de.emp_no and s.to_date < de.to_date and s.to_date < t.to_date order by emp_no, s.from_date;
Query OK, 3721923 rows affected (1 min 7.14 sec)
Records: 3721923  Duplicates: 0  Warnings: 0

MariaDB [employees]> select * from employee_norm limit 1G
*************************** 1. row ***************************
    emp_no: 10001
birth_date: 1953-09-02
 full_name: Georgi Facello
    gender: M
 hire_date: 1986-06-26
    salary: 60117
 from_date: 1986-06-26
   to_date: 1987-06-26
 dept_name: Development
     title: Senior Engineer
1 row in set (0.00 sec)

MariaDB [employees]>

Once we have normalized data, we will see how to use the data from this table to generate rich visualisations.

To summarize, we walked through Hadoop application such as Apache Druid that is used to visualize data and learned how to use them with RDBMses such as MySQL. We also saw a sample database to help us understand the application better.

To know more about how to visualize data using Apache Superset and learn how to use them with data in RDBMSes such as MySQL, do checkout this book Modern Big Data Processing with Hadoop.