How to Process Big Data with MapReduce

0
164
5 min read

What’s happening in Data?

This week Microsoft, AWS, Google and Meta released these announcements and updates. 

Microsoft Dataverse 

  • Govern low-code assets with Managed Environments for Microsoft Power Platform – Managed Environments offer a set of enhanced out-of-the-box governance capabilities that will simplify, automate, and streamline IT administration of Microsoft Power Platform at scale. 

  • Enable data self-sufficiency with datamarts in Microsoft Power BI – A new Power BI Premium self-service feature that allows users to analyze relational databases and discover actionable data insights. Datamarts in Power BI accelerate time to insight while alleviating demands on IT. 

  • Azure Cosmos DB API for MongoDB in the cloud – now easier than ever – The API for MongoDB allows MongoDB developers to treat Azure Cosmos DB as if it were a MongoDB database. This makes it easier for developers to leverage their MongoDB skills while gaining the many benefits of Azure Cosmos DB such as instantaneous scalability, automatic transparent sharding, and five 9’s of availability. 


AWS Big Data 

Google Cloud 

Meta research 

Weekly Picks 

We’ve selected some interesting articles from the world of data for you. 

  • Build Complex Time Series Regression Pipelines with sktime – Using sktime, we can convert a time series forecasting problem into a regression problem. With the popular library, XGBoost, you will also learn how to build a complex time series forecaster. 
  • Computational Linear Algebra for Coders – This post focuses on – How do we do matrix computations with acceptable speed and acceptable accuracy? It is taught in Python with Jupyter Notebooks, using libraries such as Scikit-Learn and Numpy for most lessons, as well as Numba and PyTorch.

Tutorial of the Week 

How to Process Big Data with MapReduce 

Post credit: Sridhar Alla 

This section will focus on the practical use case of building an end-to-end pipeline to perform big data analytics. 

The MapReduce framework 

The MapReduce framework enables you to write distributed applications to process large amounts of data from a filesystem, such as a Hadoop Distributed File System (HDFS), in a reliable and fault-tolerant manner. 

An example of using a MapReduce job to count frequencies of words is shown in the following diagram: 

 

 

MapReduce uses YARN as a resource manager, which is shown in the following diagram: 

 

Each map task in Hadoop is broken into the following phases: record reader, mapper, combiner, and partitioner. The output of the map tasks, called the intermediate keys and values, is sent to the reducers. The reduce tasks are broken into the following phases: shuffle, sort, reducer, and output format. The nodes in which the map tasks run are optimally on the nodes in which the data rests. This way, the data typically does not have to move over the network, and can be computed on the local machine. 
 

MapReduce job types 

MapReduce jobs can be written in multiple ways, depending on what the desired outcome is. The fundamental structure of a MapReduce job is as follows: 

import java.io.IOException; 

import java.util.StringTokenizer; 

import java.util.Map; 

import java.util.HashMap; 

import org.apache.hadoop.conf.Configuration; 

import org.apache.hadoop.fs.Path; 

import org.apache.hadoop.mapreduce.Reducer;

public class EnglishWordCounter { 

public static class WordMapper 

extends Mapper<Object, Text, Text, IntWritable> { … } 

public static class CountReducer 

extends Reducer<Text, IntWritable, Text, IntWritable> { … } 

public static void main(String[] args) throws Exception { 

Configuration conf = new Configuration(); 

Job job = new Job(conf, “English Word Counter”); 

job.setJarByClass(EnglishWordCounter.class); 

job.setMapperClass(WordMapper.class); 

job.setOutputValueClass(IntWritable.class); 

FileInputFormat.addInputPath(job, new Path(args[0])); 

FileOutputFormat.setOutputPath(job, new Path(args[1])); 

System.exit(job.waitForCompletion(true) ? 0 : 1); } } 

The purpose of the driver is to orchestrate the jobs. The first few lines of the main are all about parsing command-line arguments. Then, we start setting up the job object by telling it what classes to use for computations and what input paths and output paths to use.  
This how-to was curated from the book Big Data Analytics with Hadoop 3. Explore Big data more deeply by clicking the button below! 

 

Read the Book