Home Data Tutorials Advanced Hadoop MapReduce Administration

Advanced Hadoop MapReduce Administration

April 8, 2013 - 12:00 am

3001

6 min read

(For more resources related to this topic, see here.)

Tuning Hadoop configurations for cluster deployments

Getting ready

Shut down the Hadoop cluster if it is already running, by executing the bin/stop-dfs.sh and bin/stop-mapred.sh commands from HADOOP_HOME.

How to do it…

We can control Hadoop configurations through the following three configuration files:

conf/core-site.xml: This contains the configurations common to whole Hadoop distribution
conf/hdfs-site.xml: This contains configurations for HDFS
conf/mapred-site.xml: This contains configurations for MapReduce

Each configuration file has name-value pairs expressed in an XML format, and they define the workings of different aspects of Hadoop. The following code snippet shows an example of a property in the configuration file. Here, the <configuration> tag is the top-level XML container, and the <property> tags that define individual properties go as child elements of the <configuration> tag.

<configuration>
<property>
<name>mapred.reduce.parallel.copies</name>
<value>20</value>
</property>
...
</configuration>

The following instructions show how to change the directory to which we write Hadoop logs and configure the maximum number of map and reduce tasks:

Create a directory to store the logfiles. For example, /root/hadoop_logs.
Uncomment the line that includes HADOOP_LOG_DIR in HADOOP_HOME/conf/ hadoop-env.sh and point it to the new directory.

Add the following lines to the HADOOP_HOME/conf/mapred-site.xml file:

<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>2 </value>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>2 </value>
</property>

Restart the Hadoop cluster by running the bin/stop-mapred.sh and bin/start-mapred.sh commands from the HADOOP_HOME directory.
You can verify the number of processes created using OS process monitoring tools. If you are in Linux, run the watch ps –ef|grep hadoop command. If you are in Windows or MacOS use the Task Manager.

How it works…

HADOOP_LOG_DIR redefines the location to which Hadoop writes its logs. The mapred. tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks. maximum properties define the maximum number of map and reduce tasks that can run within a single TaskTracker at a given moment.

These and other server-side parameters are defined in the HADOOP_HOME/conf/*-site. xml files. Hadoop reloads these configurations after a restart.

There’s more…

There are many similar configuration properties defined in Hadoop. You can see some of them in the following tables.

The configuration properties for conf/core-site.xml are listed in the following table:

Name	Default value	Description
fs.inmemory.size.mb	100	This is the amount of memory allocated to the in-memory filesystem that is used to merge map outputs at reducers in MBs.
io.sort.factor	100	This is the maximum number of streams merged while sorting files.
io.file.buffer.size	131072	This is the size of the read/write buffer used by sequence files.

The configuration properties for conf/mapred-site.xml are listed in the following table:

Name	Default value	Description
mapred.reduce. parallel.copies	5	This is the maximum number of parallel copies the reduce step will execute to fetch output from many parallel jobs.
mapred.map.child.java. opts	-Xmx200M	This is for passing Java options into the map JVM.
mapred.reduce.child. java.opts	-Xmx200M	This is for passing Java options into the reduce JVM.
io.sort.mb	200	The memory limit while sorting data in MBs.

The configuration properties for conf/hdfs-site.xml are listed in the following table:

Name

Default value

Description

dfs.block.size

67108864

This is the HDFS block size.

dfs.namenode.handler.

count

This is the number of server threads to handle RPC calls in the NameNode.

Running benchmarks to verify the Hadoop installation

The Hadoop distribution comes with several benchmarks. We can use them to verify our Hadoop installation and measure Hadoop’s performance. This recipe introduces these benchmarks and explains how to run them.

Getting ready

Start the Hadoop cluster. You can run these benchmarks either on a cluster setup or on a pseudo-distributed setup.

How to do it…

Let us run the sort benchmark. The sort benchmark consists of two jobs. First, we generate some random data using the randomwriter Hadoop job and then sort them using the sort sample.

Change the directory to HADOOP_HOME.
Run the randomwriter Hadoop job using the following command:
```
>bin/hadoop jar hadoop-examples-1.0.0.jarrandomwriter
-Dtest.randomwrite.bytes_per_map=100
-Dtest.randomwriter.maps_per_host=10 /data/unsorted-data
```
Here the two parameters, test.randomwrite.bytes_per_map and test. randomwriter.maps_per_host specify the size of data generated by a map and the number of maps respectively.

Run the sort program:

>bin/hadoop jar hadoop-examples-1.0.0.jar sort /data/unsorted-data
/data/sorted-data

Verify the final results by running the following command:

>bin/hadoop jar hadoop-test-1.0.0.jar testmapredsort -sortInput /
data/unsorted-data -sortOutput /data/sorted-data

Finally, when everything is successful, the following message will be displayed:

The job took 66 seconds.
SUCCESS! Validated the MapReduce framework's 'sort' successfully.

How it works…

First, the randomwriter application runs a Hadoop job to generate random data that can be used by the second sort program. Then, we verify the results through testmapredsort job. If your computer has more capacity, you may run the initial randomwriter step with increased output sizes.

There’s more…

Hadoop includes several other benchmarks.

TestDFSIO: This tests the input output (I/O) performance of HDFS
nnbench: This checks the NameNode hardware
mrbench: This runs many small jobs
TeraSort: This sorts a one terabyte of data

More information about these benchmarks can be found at http://www.michaelnoll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoopcluster- with-terasort-testdfsio-nnbench-mrbench/.

Reusing Java VMs to improve the performance

In its default configuration, Hadoop starts a new JVM for each map or reduce task. However, running multiple tasks from the same JVM can sometimes significantly speed up the execution. This recipe explains how to control this behavior.

How to do it…

Run the WordCount sample by passing the following option as an argument:

>bin/hadoop jar hadoop-examples-1.0.0.jar wordcount –Dmapred.job.
reuse.jvm.num.tasks=-1 /data/input1 /data/output1

Monitor the number of processes created by Hadoop (through ps –ef|grephadoop command in Unix or task manager in Windows). Hadoop starts only a single JVM per task slot and then reuses it for an unlimited number of tasks in the job.

However, passing arguments through the –D option only works if the job implements the org.apache.hadoop.util.Tools interface. Otherwise, you should set the option through the JobConf.setNumTasksToExecutePerJvm(-1) method.

How it works…

By setting the job configuration property through mapred.job.reuse.jvm.num.tasks, we can control the number of tasks for the JVM run by Hadoop. When the value is set to -1, Hadoop runs the tasks in the same JVM.

Top 6 Cybersecurity Books from Packt to Accelerate Your Career

Your Quick Introduction to Extended Events in Analysis Services from Blog…

Logging the history of my past SQL Saturday presentations from Blog…

Storage savings with Table Compression from Blog Posts – SQLServerCentral

Daily Coping 31 Dec 2020 from Blog Posts – SQLServerCentral

Learning Essential Linux Commands for Navigating the Shell Effectively

Exploring the Strategy Behavioral Design Pattern in Node.js

How to integrate a Medium editor in Angular 8

Implementing memory management with Golang’s garbage collector

How to create sales analysis app in Qlik Sense using DAR…

Advanced Hadoop MapReduce Administration

Tuning Hadoop configurations for cluster deployments

Getting ready

How to do it…

How it works…

There’s more…

Running benchmarks to verify the Hadoop installation

Getting ready

How to do it…

How it works…

There’s more…

Reusing Java VMs to improve the performance

How to do it…

How it works…

LEAVE A REPLY Cancel reply

MobilePro

datapro

Programming

Subscribe to our newsletter