(For more resources related to this topic, see here.)
Often, we use Hadoop to calculate analytics, which are basic statistics about data. In such cases, we walk through the data using Hadoop and calculate interesting statistics about the data. Some of the common analytics are show as follows:
However, Hadoop will only generate numbers. Although the numbers contain all the information, we humans are very bad at figuring out overall trends by just looking at numbers. On the other hand, the human eye is remarkably good at detecting patterns, and plotting the data often yields us a deeper understanding of the data. Therefore, we often plot the results of Hadoop jobs using some plotting program.
>bin/hadoopdfs -mkdir /data/
>bin/hadoopdfs -mkdir /data/amazon-dataset
>bin/hadoopdfs -put <SAMPLE_DIR>/amazon-meta.txt /data/amazondataset/
>bin/hadoopdfs -ls /data/amazon-dataset
$ bin/hadoop jar hadoop-microbook.jar microbook.frequency.
BuyingFrequencyAnalyzer/data/amazon-dataset /data/frequencyoutput1
$ bin/hadoop jar hadoop-microbook.jar microbook.frequency.
SimpleResultSorter /data/frequency-output1 frequency-output2
$ bin/Hadoop dfs -get /data/frequency-output2/part-r-00000 1.data
$gnuplot buyfreq.plot
As the figure depicts, few buyers have brought a very large number of items. The distribution is much steeper than normal distribution, and often follows what we call a Power Law distribution. This is an example that analytics and plotting results would give us insight into, underlying patterns in the dataset.
You can find the mapper and reducer code at src/microbook/frequency/BuyingFrequencyAnalyzer.java.
This figure shows the execution of two MapReduce jobs. Also the following code listing shows the map function and the reduce function of the first job:
public void map(Object key, Text value, Context context
) throwsIOException, InterruptedException {
List<BuyerRecord> records =
BuyerRecord.parseAItemLine(value.toString());
for(BuyerRecord record: records){
context.write(new Text(record.customerID),
new IntWritable(record.itemsBrought.size()));
}
}
public void reduce(Text key, Iterable<IntWritable> values, Context context) {
int sum = 0;
for (IntWritableval : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
As shown by the figure, Hadoop will read the input file from the input folder and read records using the custom formatter we introduced in the Writing a formatter (Intermediate) article. It invokes the mapper once per each record, passing the record as input.
The mapper extracts the customer ID and the number of items the customer has brought, and emits the customer ID as the key and number of items as the value.
Then, Hadoop sorts the key-value pairs by the key and invokes a reducer once for each key passing all values for that key as inputs to the reducer. Each reducer sums up all item counts for each customer ID and emits the customer ID as the key and the count as the value in the results.
Then the second job sorted the results. It reads output of the first job as the result and passes each line as argument to the map function. The map function extracts the customer ID and the number of items from the line and emits the number of items as the key and the customer ID as the value. Hadoop will sort the key-value pairs by the key, thus sorting them by the number of items, and invokes the reducer once per key in the same order. Therefore, the reducer prints them out in the same order essentially sorting the dataset.
Since we have generated the results, let us look at the plotting. You can find the source for the gnuplot file from buyfreq.plot. The source for the plot will look like the following:
set terminal png
set output "buyfreq.png"
set title "Frequency Distribution of Items brought by Buyer";
setylabel "Number of Items Brought";
setxlabel "Buyers Sorted by Items count";
set key left top
set log y
set log x
plot "1.data" using 2 title "Frequency" with linespoints
Here the first two lines define the output format. This example uses png, but gnuplot supports many other terminals such as screen, pdf, and eps. The next four lines define the axis labels and the title, and the next two lines define the scale of each axis, and this plot uses log scale for both.
The last line defines the plot. Here, it is asking gnuplot to read the data from the 1.data file, and to use the data in the second column of the file via using 2, and to plot it using lines. Columns must be separated by whitespaces.
Here if you want to plot one column against another, for example data from column 1 against column 2, you should write using 1:2 instead of using 2.
We can use a similar method to calculate the most types of analytics and plot the results. Refer to the freely available article of Hadoop MapReduce Cookbook, Srinath Perera and Thilina Gunarathne, Packt Publishing at http://www.packtpub.com/article/advanced-hadoop-mapreduce-administration for more information.
In this article, we have learned how to process Amazon data with MapReduce, generate data for a histogram, and plot it using gnuplot.
Further resources on this subject:
I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…
Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…
Once we learn how to deploy an Ubuntu server, how to manage users, and how…
Key-takeaways: Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…
While developing a web application, or setting dynamic pages and meta tags we need to deal with…
Software architecture is one of the most discussed topics in the software industry today, and…