Analytics – Drawing a Frequency Distribution with MapReduce (Intermediate)

6 min read

(For more resources related to this topic, see here.)

Often, we use Hadoop to calculate analytics, which are basic statistics about data. In such cases, we walk through the data using Hadoop and calculate interesting statistics about the data. Some of the common analytics are show as follows:

Calculating statistical properties like minimum, maximum, mean, median, standard deviation, and so on of a dataset. For a dataset, generally there are multiple dimensions (for example, when processing HTTP access logs, names of the web page, the size of the web page, access time, and so on, are few of the dimensions). We can measure the previously mentioned properties by using one or more dimensions. For example, we can group the data into multiple groups and calculate the mean value in each case.
Frequency distributions histogram counts the number of occurrences of each item in the dataset, sorts these frequencies, and plots different items as X axis and frequency as Y axis.
Finding a correlation between two dimensions (for example, correlation between access count and the file size of web accesses).
Hypothesis testing: To verify or disprove a hypothesis using a given dataset.

However, Hadoop will only generate numbers. Although the numbers contain all the information, we humans are very bad at figuring out overall trends by just looking at numbers. On the other hand, the human eye is remarkably good at detecting patterns, and plotting the data often yields us a deeper understanding of the data. Therefore, we often plot the results of Hadoop jobs using some plotting program.

Getting ready

This article assumes that you have access to a computer that has Java installed and the JAVA_HOME variable configured.
Download a Hadoop distribution 1.1.x from http://hadoop.apache.org/releases.html page.
Unzip the distribution, we will call this directory HADOOP_HOME.
Download the sample code for the article and copy the data files.

How to do it…

If you have not already done so, let us upload the amazon dataset to the HDFS filesystem using the following commands:

>bin/hadoopdfs -mkdir /data/
>bin/hadoopdfs -mkdir /data/amazon-dataset
>bin/hadoopdfs -put <SAMPLE_DIR>/amazon-meta.txt /data/amazondataset/
>bin/hadoopdfs -ls /data/amazon-dataset

Copy the hadoop-microbook.jar file from SAMPLE_DIR to HADOOP_HOME.

Run the first MapReduce job to calculate the buying frequency. To do that run the following command from HADOOP_HOME:

$ bin/hadoop jar hadoop-microbook.jar microbook.frequency.
BuyingFrequencyAnalyzer/data/amazon-dataset /data/frequencyoutput1

Use the following command to run the second MapReduce job to sort the results of the first MapReduce job:

$ bin/hadoop jar hadoop-microbook.jar microbook.frequency.
SimpleResultSorter /data/frequency-output1 frequency-output2

You can find the results from the output directory. Copy results to HADOOP_HOME using the following command:
```
$ bin/Hadoop dfs -get /data/frequency-output2/part-r-00000 1.data
```
Copy all the *.plot files from SAMPLE_DIR to HADOOP_HOME.
Generate the plot by running the following command from HADOOP_HOME.
```
$gnuplot buyfreq.plot
```
It will generate a file called buyfreq.png, which will look like the following:

As the figure depicts, few buyers have brought a very large number of items. The distribution is much steeper than normal distribution, and often follows what we call a Power Law distribution. This is an example that analytics and plotting results would give us insight into, underlying patterns in the dataset.

How it works…

You can find the mapper and reducer code at src/microbook/frequency/BuyingFrequencyAnalyzer.java.

This figure shows the execution of two MapReduce jobs. Also the following code listing shows the map function and the reduce function of the first job:

public void map(Object key, Text value, Context context
) throwsIOException, InterruptedException {
List<BuyerRecord> records = 
BuyerRecord.parseAItemLine(value.toString());
for(BuyerRecord record: records){
context.write(new Text(record.customerID), 
new IntWritable(record.itemsBrought.size()));
}
}
public void reduce(Text key, Iterable<IntWritable> values, Context context) {
int sum = 0;
for (IntWritableval : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}

As shown by the figure, Hadoop will read the input file from the input folder and read records using the custom formatter we introduced in the Writing a formatter (Intermediate) article. It invokes the mapper once per each record, passing the record as input.

The mapper extracts the customer ID and the number of items the customer has brought, and emits the customer ID as the key and number of items as the value.

Then, Hadoop sorts the key-value pairs by the key and invokes a reducer once for each key passing all values for that key as inputs to the reducer. Each reducer sums up all item counts for each customer ID and emits the customer ID as the key and the count as the value in the results.

Then the second job sorted the results. It reads output of the first job as the result and passes each line as argument to the map function. The map function extracts the customer ID and the number of items from the line and emits the number of items as the key and the customer ID as the value. Hadoop will sort the key-value pairs by the key, thus sorting them by the number of items, and invokes the reducer once per key in the same order. Therefore, the reducer prints them out in the same order essentially sorting the dataset.

Since we have generated the results, let us look at the plotting. You can find the source for the gnuplot file from buyfreq.plot. The source for the plot will look like the following:

set terminal png
set output "buyfreq.png"
set title "Frequency Distribution of Items brought by Buyer";
setylabel "Number of Items Brought";
setxlabel "Buyers Sorted by Items count";
set key left top
set log y
set log x
plot "1.data" using 2 title "Frequency" with linespoints

Here the first two lines define the output format. This example uses png, but gnuplot supports many other terminals such as screen, pdf, and eps. The next four lines define the axis labels and the title, and the next two lines define the scale of each axis, and this plot uses log scale for both.

The last line defines the plot. Here, it is asking gnuplot to read the data from the 1.data file, and to use the data in the second column of the file via using 2, and to plot it using lines. Columns must be separated by whitespaces.

Here if you want to plot one column against another, for example data from column 1 against column 2, you should write using 1:2 instead of using 2.

There’s more…

We can use a similar method to calculate the most types of analytics and plot the results. Refer to the freely available article of Hadoop MapReduce Cookbook, Srinath Perera and Thilina Gunarathne, Packt Publishing at http://www.packtpub.com/article/advanced-hadoop-mapreduce-administration for more information.