In this article, Dmitry Anoshin, the author of Learning Hunk will talk about Hadoop—how to extract Hunk to VM to set up a connection with Hadoop to create dashboards.
We are living in a century of information technology. There are a lot of electronic devices around us that generate a lot of data. For example, you can surf on the Internet, visit a couple news portals, order new Airmax on the web store, write a couple of messages to your friend, and chat on Facebook. Every action produces data; we can multiply these actions with the number of people who have access to the Internet or just use a mobile phone and we will get really big data. Of course, you have a question, how big is it? I suppose, now it starts from terabytes or even petabytes. The volume is not the only issue; we struggle with a variety of data. As a result, it is not enough to analyze only structure data. We should dive deep into the unstructured data, such as machine data, that are generated by various machines.
World famous enterprises try to collect this extremely big data in order to monetize it and find business insights. Big data offers us new opportunities, for example, we can enrich customer data via social networks using the APIs of Facebook or Twitter. We can build customer profiles and try to predict customer wishes in order to sell our product or improve customer experience. It is easy to say, but difficult to do. However, organizations try to overcome these challenges and use big data stores, such as Hadoop.
(For more resources related to this topic, see here.)
Hadoop is a distributed file system and framework to compute. It is relatively easy to get data into Hadoop. There are plenty of tools to get data into different formats. However, it is extremely difficult to get value out of these data that you put into Hadoop.
Let’s look at the path from data to value. First, we have to start at the collection of data. Then, we also spend a lot of time preparing and making sure this data is available for analysis while being able to ask questions to this data. It looks as follows:
Unfortunately, the questions that you asked are not good or the answers that you got are not clear, and you have to repeat this cycle over again. Maybe, you have transformed and formatted your data. In other words, it is a long and challenging process.
What you actually want is something to collect data; spend some time preparing the data, then you would able to ask question and get answers from data repetitively. Now, you can spend a lot of time asking multiple questions. In addition, you are able to iterate with data on those questions to refine the answers that you are looking for.
What if we could take Splunk and put it on top of all these data stored in Hadoop? And it was, what the Splunk company actually did. The following figure shows how we got Hunk as name of the new product:
Let’s discuss some solution goals Hunk inventors were thinking about when they were planning Hunk:
In order to start exploring the Hadoop data, we have to install Hunk on the top of our Hadoop cluster. Hunk is easy to install and configure. Let’s learn how to deploy Hunk Version 6.2.1 on top of the existing CDH cluster. It’s assumed that your VM is up and running.
To extract Hunk to VM, perform the following steps:
[cloudera@quickstart ~]$ cd ~
[cloudera@quickstart ~]$ ls -la | grep hunk
-rw-r--r-- 1 root root 113913609 Mar 23 04:09 hunk-6.2.1-249325-Linux-x86_64.tgz
cd /opt
sudo tar xvzf /home/cloudera/hunk-6.2.1-249325-Linux-x86_64.tgz -C /opt
Perform the following steps to set up the Hunk variables and configuration files
export SPLUNK_HOME=/opt/hunk
sudocp /opt/hunk/etc/splunk-launch.conf.default /opt/hunk//etc/splunk-launch.conf
Perform the following steps to run Hunk:
sudo /opt/hunk/bin/splunk start --accept-license
Here is the sample output from the first run:
sudo /opt/hunk/bin/splunk start --accept-license
This appears to be your first time running this version of Splunk.
Copying '/opt/hunk/etc/openldap/ldap.conf.default' to '/opt/hunk/etc/openldap/ldap.conf'.
Generating RSA private key, 1024 bit long modulus
Some output lines were deleted to reduce amount of log text
Waiting for web server at http://127.0.0.1:8000 to be available…. Done
If you get stuck, we’re here to help.
Look for answers here: http://docs.splunk.com
The Splunk web interface is at http://vm-cluster-node1.localdomain:8000
We need to accomplish two tasks: provide a technical connector to the underlying data storage and create a virtual index for the data on this storage.
Log in to http://quickstart.cloudera:8000. The system would ask you to change the default admin user password. I did set it to admin:
Right now, we are ready to set up the integration between Hadoop and Hunk. At first, we need to specify the way Hunk connects to the current Hadoop installation. We are using the most recent way: YARN with MR2. Then, we have to point virtual indexes to the data stored on Hadoop. To do this, perform the following steps:
Property name | Value |
Name | hadoop-hunk-provider |
Java home | /usr/java/jdk1.7.0_67-cloudera |
Hadoop home | /usr/lib/hadoop |
Hadoop version | Hadoop 2.x, (Yarn) |
filesystem | hdfs://quickstart.cloudera:8020 |
Resource Manager Address | quickstart.cloudera:8032 |
Resource Scheduler Address | quickstart.cloudera:8030 |
HDFS Working Directory | /user/hunk |
Job Queue | default |
sudo -u hdfshadoop fs -mkdir -p /user/hunk
If you did everything correctly, you should see a screen similar to the following screenshot:
Let’s discuss briefly what we have done:
/opt/hunk/bin/jars/sudobash /usr/bin/hadoop jar "/opt/hunk/bin/jars/SplunkMR-s6.0-hy2.0.jar" "com.splunk.mr.SplunkMR"
Now it’s time to create virtual index. We are going to add the dataset with the avro files to the virtual index as an example data.
A virtual index is a metadata. It tells Hunk where the data is located and what provider should be used to read the data.
Property name | Value |
Name | milano_cdr_aggregated_10_min_activity |
Path to data in HDFS | /masterdata/stream/milano_cdr |
Here is an example screen you should see after you create your first virtual index:
To access data through the virtual index, perform the following steps:
You should see how Hunk automatically for timestamp from our CDR data:
Pay attention to the Time column and the field named Time_interval from the Event column. The time_interval column keeps the time of record. Hunk should automatically use that field as a time field:
Now it’s time to see how the dashboards work. Let’s find the regions where the visitors face problems (status = 500) while using our online store:
index="digital_analytics" status=500 | iplocation clientip | geostats latfield=lat longfield=lon count by Country
You should see the map and the portions of error for the countries:
Now let’s save it as dashboard. Click on Save as and select Dashboard panel from drop-down menu. Name it as Web Operations.
You should get a new dashboard with a single panel and our report on it. We have several previously created reports. Let’s add them to the newly created dashboard using separate panels:
In this article, you learned how to extract Hunk to VM. We also saw how to set up Hunk variables and configuration files. You learned how to run Hunk and how to set up the data provided and a virtual index for the CDR data. Setting up a connection to Hadoop and a virtual index for the data stored in Hadoop were also covered in detail. Apart from these, you also learned how to create a dashboard.
Further resources on this subject:
I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…
Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…
Once we learn how to deploy an Ubuntu server, how to manage users, and how…
Key-takeaways: Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…
While developing a web application, or setting dynamic pages and meta tags we need to deal with…
Software architecture is one of the most discussed topics in the software industry today, and…