7 min read

[box type=”note” align=”” class=”” width=””]The article given below is a book extract from Java Data Analysis written by John R. Hubbard. The book will give you the most out of popular Java libraries and tools to perform efficient data analysis.[/box]

In this article, we will explore Google’s MapReduce framework to analyze big data.

How do you quickly sort a list of billion elements? Or multiply two matrices, each with a million rows and a million columns?

In implementing their PageRank algorithm, Google quickly discovered the need for a systematic framework for processing massive datasets. That could be done only by distributing the data and the processing over many storage units and processors. Implementing a single algorithm, such as PageRank in that environment is difficult, and maintaining the implementation as the dataset grows is even more challenging.

The solution: MapReduce framework

The answer lay in separating the software into two levels: a framework that manages the big data access and parallel processing at a lower level, and a couple of user-written methods at an upper-level. The independent user who writes the two methods need not be concerned with the details of the big data management at the lower level.

How does it function

Specifically, the data flows through a sequence of stages:

  1. The input stage divides the input into chunks, usually 64MB or 128MB.
  2. The mapping stage applies a user-defined map() function that generates from one key-value pair a larger collection of key-value pairs of a different type.
  3. The partition/grouping stage applies hash sharding to those keys to group them.
  4. The reduction stage applies a user-defined reduce() function to apply some specific algorithm to the data in the value of each key-value pair.
  5. The output stage writes the output from the reduce() method.

The user’s choice of map() and reduce() methods determines the outcome of the entire process; hence the name MapReduce.

This idea is a variation on the old algorithmic paradigm called divide and conquer. Think of the proto-typical mergesort, where an array is sorted by repeatedly dividing it into two halves until the pieces have only one element, and then they are systematically pairwise merged back together.

MapReduce is actually a meta-algorithm—a framework, within which specific algorithms can be implemented through its map() and reduce() methods. Extremely powerful, it has been used to sort a petabyte of data in only a few hours. Recall that a petabyte is 10005 = 1015 bytes, which is a thousand terabytes or a million gigabytes.

Some examples of MapReduce applications

Here are a few examples of big data problems that can be solved with the MapReduce framework:

  1. Given a repository of text files, find the frequency of each word. This is called the WordCount problem.
  2. Given a repository of text files, find the number of words of each word length.
  3. Given two matrices in a sparse matrix format, compute their product.
  4. Factor a matrix given in sparse matrix format.
  5. Given a symmetric graph whose nodes represent people and edges represent friendship, compile a list of common friends.
  6. Given a symmetric graph whose nodes represent people and edges represent friendship, compute the average number of friends by age.
  7. Given a repository of weather records, find the annual global minima and maxima by year.
  8. Sort a large list. Note that in most implementations of the MapReduce framework, this problem is trivial, because the framework automatically sorts the output from the map() function.
  9. Reverse a graph.
  10. Find a minimal spanning tree (MST) of a given weighted graph.
  11. Join two large relational database tables.

The WordCount example

In this section, we present the MapReduce solution to the WordCount problem, sometimes called the Hello World example for MapReduce.

The diagram in the figure below shows the data flow for the WordCount program. On the left are two of the 80 files that are read into the program:

During the mapping stage, each word, followed by the number 1, is copied into a temporary file, one pair per line. Notice that many words are duplicated many times. For example, image appears five times among the 80 files (including both files shown), so the string image 1 will appear four times in the temporary file. Each of the input files has about 110 words, so over 8,000 word-number pairs will be written to the temporary file.

Note that this figure shows only a very small part of the data involved. The output from the mapping stage includes every word that is input, as many times that it appears. And the output from the grouping stage includes every one of those words, but without duplication.

The grouping process reads all the words from the temporary file into a key-value hash table, where the key is the word, and the value is a string of 1s, one for each occurrence of that word in the temporary file. Notice that these 1s written to the temporary file are not used. They are included simply because the MapReduce framework in general expects the map() function to generate key-value pairs.The reducing stage transcribed the contents of the hash table to an output file, replacing each string of 1s with the number of them. For example, the key-value pair ("book", "1 1 1 1")
 is written as book 4 in the output file.

Keep in mind that this is a toy example of the MapReduce process. The input consists of 80 text files containing about 9073 words. So, the temporary file has 9073 lines, with one word per line. Only 2149 of those words are distinct, so the hash table has 2149 entries and the output file has 2149 lines, with one word per line.

The main idea

So, this is the main idea of the MapReduce meta-algorithm: provide a framework for processing massive datasets, a framework that allows the independent programmer to plug in specialized map() and reduce() methods that actually implement the required particular algorithm. If that particular algorithm is to count words, then write the map() method to extract each individual word from a specified file and write the key-value pair (word, 1) to wherever the specified writer will put them, and write the reduce() method to take a key-value pair such as
(word, 1 1 1 1) and return the corresponding key-value pair as (word, 4) to wherever its specified writer will put it. These two methods are completely localized—they simply operate on key-value pairs. And, they are completely independent of the size of the dataset.

The diagram below illustrates the general flow of data through an application of the MapReduce framework:

The original dataset could be in various forms and locations: a few files in a local directory, a large collection of files distributed over several nodes on the same cluster, a database on a database system (relational or NoSQL), or data sources available on the World Wide Web. The MapReduce controller then carries out these five tasks:

  1. Split the data into smaller datasets, each of which can be easily accessed on a single machine.
  2. Simultaneously (that is, in parallel), run a copy of the user-supplied map() method, one on each dataset, producing a set of key-value pairs in a temporary file on that local machine.
  3. Redistribute the datasets among the machines, so that all instances of each key are in the same dataset. This is typically done by hashing the keys.
  4. Simultaneously (in parallel), run a copy of the user-supplied reduce() method, one on each of the temporary files, producing one output file on each machine.
  5. Combine the output files into a single result. If the reduce() method also sorts its output, then this last step could also include merging those outputs.

The genius of the MapReduce framework is that it separates the data management (moving, partitioning, grouping, sorting, and so on) from the data crunching (counting, averaging, maximizing, and so on). The former is done with no attention required by the user. The latter is done in parallel, separately in each node, by invoking the two user-supplied methods map() and reduce(). Essentially, the only obligation of the user is to devise the correct implementations of these two methods that will solve the given problem.

As we mentioned earlier, these examples are presented mainly to elucidate how the MapReduce algorithm works. Real-world implementations would, however, use MongoDB or Hadoop frameworks.

If you enjoyed this excerpt, check out the book Java Data Analysis to get an understanding of the various data analysis techniques, and how to implement them using Java.

Java data analysis

 

Content Marketing Editor at Packt Hub. I blog about new and upcoming tech trends ranging from Data science, Web development, Programming, Cloud & Networking, IoT, Security and Game development.

LEAVE A REPLY

Please enter your comment!
Please enter your name here