Big Data Analytics

10 min read

In this article, Dmitry Anoshin, the author of Learning Hunk will talk about Hadoop—how to extract Hunk to VM to set up a connection with Hadoop to create dashboards.

We are living in a century of information technology. There are a lot of electronic devices around us that generate a lot of data. For example, you can surf on the Internet, visit a couple news portals, order new Airmax on the web store, write a couple of messages to your friend, and chat on Facebook. Every action produces data; we can multiply these actions with the number of people who have access to the Internet or just use a mobile phone and we will get really big data. Of course, you have a question, how big is it? I suppose, now it starts from terabytes or even petabytes. The volume is not the only issue; we struggle with a variety of data. As a result, it is not enough to analyze only structure data. We should dive deep into the unstructured data, such as machine data, that are generated by various machines.

World famous enterprises try to collect this extremely big data in order to monetize it and find business insights. Big data offers us new opportunities, for example, we can enrich customer data via social networks using the APIs of Facebook or Twitter. We can build customer profiles and try to predict customer wishes in order to sell our product or improve customer experience. It is easy to say, but difficult to do. However, organizations try to overcome these challenges and use big data stores, such as Hadoop.

(For more resources related to this topic, see here.)

The big problem

Hadoop is a distributed file system and framework to compute. It is relatively easy to get data into Hadoop. There are plenty of tools to get data into different formats. However, it is extremely difficult to get value out of these data that you put into Hadoop.

Let’s look at the path from data to value. First, we have to start at the collection of data. Then, we also spend a lot of time preparing and making sure this data is available for analysis while being able to ask questions to this data. It looks as follows:

Unfortunately, the questions that you asked are not good or the answers that you got are not clear, and you have to repeat this cycle over again. Maybe, you have transformed and formatted your data. In other words, it is a long and challenging process.

What you actually want is something to collect data; spend some time preparing the data, then you would able to ask question and get answers from data repetitively. Now, you can spend a lot of time asking multiple questions. In addition, you are able to iterate with data on those questions to refine the answers that you are looking for.

The elegant solution

What if we could take Splunk and put it on top of all these data stored in Hadoop? And it was, what the Splunk company actually did. The following figure shows how we got Hunk as name of the new product:

Let’s discuss some solution goals Hunk inventors were thinking about when they were planning Hunk:

Splunk can take data from Hadoop via the Splunk Hadoop Connection app. However, it is a bad idea to copy massive data from Hadoop to Splunk. It is much better to process data in place because Hadoop provides both storage and computation and why not take advantage of both.
Splunk has extremely powerful Splunk Processing Language (SPL) and it is a kind of advantage of Splunk, because it has a wide range of analytic functions. This is why it is a good idea to keep SPL in the new product.
Splunk has true schema on the fly. The data that we store in Hadoop changes constantly. So, Hunk should be able to build schema on the fly, independent from the format of the data.
It’s a very good idea to have the ability to make previews. As you know, when a search is going on, you would able to get incremental results. It can dramatically reduce the outage. For example, we don’t need to wait till the MapReduce job is finished. We can look at the incremental result and, in the case of a wrong result, restart a search query.
The deployment of Hadoop is not easy; Splunk tries to make the installation and configuration of Hunk easy for us.

Getting up Hunk

In order to start exploring the Hadoop data, we have to install Hunk on the top of our Hadoop cluster. Hunk is easy to install and configure. Let’s learn how to deploy Hunk Version 6.2.1 on top of the existing CDH cluster. It’s assumed that your VM is up and running.

Extracting Hunk to VM

To extract Hunk to VM, perform the following steps:

Open the console application.

Run ls -la to see the list of files in your Home directory:

[cloudera@quickstart ~]$ cd ~
[cloudera@quickstart ~]$ ls -la | grep hunk
-rw-r--r--   1 root     root     113913609 Mar 23 04:09 hunk-6.2.1-249325-Linux-x86_64.tgz

Unpack the archive:

cd /opt
sudo tar xvzf /home/cloudera/hunk-6.2.1-249325-Linux-x86_64.tgz -C /opt

Setting up Hunk variables and configuration files

Perform the following steps to set up the Hunk variables and configuration files

It’s time to set the SPLUNK_HOME environment variable. This variable is already added to the profile; it is just to bring to your attention that this variable must be set:
```
export SPLUNK_HOME=/opt/hunk
```
Use default splunk-launch.conf. This is the basic properties file used by the Hunk service. We don’t have to change there something special, so let’s use the default settings:
```
sudocp /opt/hunk/etc/splunk-launch.conf.default /opt/hunk//etc/splunk-launch.conf
```

Running Hunk for the first time

Perform the following steps to run Hunk:

Run Hunk:

sudo /opt/hunk/bin/splunk start --accept-license

Here is the sample output from the first run:

sudo /opt/hunk/bin/splunk start --accept-license

This appears to be your first time running this version of Splunk.

Copying '/opt/hunk/etc/openldap/ldap.conf.default' to '/opt/hunk/etc/openldap/ldap.conf'.

Generating RSA private key, 1024 bit long modulus

Some output lines were deleted to reduce amount of log text

Waiting for web server at http://127.0.0.1:8000 to be available…. Done

If you get stuck, we’re here to help.

Look for answers here: http://docs.splunk.com

The Splunk web interface is at http://vm-cluster-node1.localdomain:8000

Setting up a data provider and virtual index for the CDR data

We need to accomplish two tasks: provide a technical connector to the underlying data storage and create a virtual index for the data on this storage.

Log in to http://quickstart.cloudera:8000. The system would ask you to change the default admin user password. I did set it to admin:

Setting up a connection to Hadoop

Right now, we are ready to set up the integration between Hadoop and Hunk. At first, we need to specify the way Hunk connects to the current Hadoop installation. We are using the most recent way: YARN with MR2. Then, we have to point virtual indexes to the data stored on Hadoop. To do this, perform the following steps:

Click on Explore Data.
Click on Create a provider:

Let’s fill the form to create the data provider:

Property name	Value
Name	hadoop-hunk-provider
Java home	/usr/java/jdk1.7.0_67-cloudera
Hadoop home	/usr/lib/hadoop
Hadoop version	Hadoop 2.x, (Yarn)
filesystem	hdfs://quickstart.cloudera:8020
Resource Manager Address	quickstart.cloudera:8032
Resource Scheduler Address	quickstart.cloudera:8030
HDFS Working Directory	/user/hunk
Job Queue	default

You don’t have to modify any other properties. The HDFS working directory has been created for you in advance. You can create it using the following command:
```
sudo -u hdfshadoop fs  -mkdir -p /user/hunk
```

If you did everything correctly, you should see a screen similar to the following screenshot:

Let’s discuss briefly what we have done:

We told Hunk where Hadoop home and Java are. Hunk uses Hadoop streaming internally, so it needs to know how to call Java and Hadoop streaming. You can inspect the submitted jobs from Hunk (discussed later) and see the following lines:
```
/opt/hunk/bin/jars/sudobash /usr/bin/hadoop jar "/opt/hunk/bin/jars/SplunkMR-s6.0-hy2.0.jar" "com.splunk.mr.SplunkMR"
```
MapReduce JAR is submitted by Hunk. Also, we need to tell Hunk where the YARN Resource Manager and Scheduler are located. These services allow us to ask for cluster resources and run jobs.
Job queue could be useful in the production environment. You could have several queues for cluster resource distribution in real life. We would set queue name as default, since we are not discussing cluster utilization and load balancing.

Setting up a virtual index for the data stored in Hadoop

Now it’s time to create virtual index. We are going to add the dataset with the avro files to the virtual index as an example data.

Click on Explore Data and then click on Create a virtual index:
You’ll get a message telling that there are no indexes:
Just click on New Virtual Index.

A virtual index is a metadata. It tells Hunk where the data is located and what provider should be used to read the data.

Property name	Value
Name	milano_cdr_aggregated_10_min_activity
Path to data in HDFS	/masterdata/stream/milano_cdr

Here is an example screen you should see after you create your first virtual index:

Accessing data through the virtual index

To access data through the virtual index, perform the following steps:

Click on Explore Data and select a provider and virtual index:
Select part-m-00000.avro by clicking on it. The Next button will be activated after you pick up a file:
Preview data in the Preview Data tab.

You should see how Hunk automatically for timestamp from our CDR data:

Pay attention to the Time column and the field named Time_interval from the Event column. The time_interval column keeps the time of record. Hunk should automatically use that field as a time field:

Save the source type by clicking on Save As and then Next:
In the Entering Context Settings page, select search in the App context drop-down box.
Then, navigate to Sharing context | All apps and then click on Next.
The last step allows you to review what we’ve done:
Click on Finish to create the finalized wizard.

Creating a dashbord

Now it’s time to see how the dashboards work. Let’s find the regions where the visitors face problems (status = 500) while using our online store:

index="digital_analytics" status=500 | iplocation clientip | geostats latfield=lat longfield=lon count by Country

You should see the map and the portions of error for the countries:

Now let’s save it as dashboard. Click on Save as and select Dashboard panel from drop-down menu. Name it as Web Operations.

You should get a new dashboard with a single panel and our report on it. We have several previously created reports. Let’s add them to the newly created dashboard using separate panels:

Click on Edit and then Edit panels.
Select Add new panel and then New from report, and add one of our previous reports.

Summary

In this article, you learned how to extract Hunk to VM. We also saw how to set up Hunk variables and configuration files. You learned how to run Hunk and how to set up the data provided and a virtual index for the CDR data. Setting up a connection to Hadoop and a virtual index for the data stored in Hadoop were also covered in detail. Apart from these, you also learned how to create a dashboard.