6 min read

(For more resources related to this topic, see here.)

Tuning Hadoop configurations for cluster deployments

Getting ready

Shut down the Hadoop cluster if it is already running, by executing the bin/stop-dfs.sh and bin/stop-mapred.sh commands from HADOOP_HOME.

How to do it…

We can control Hadoop configurations through the following three configuration files:

  • conf/core-site.xml: This contains the configurations common to whole Hadoop distribution

  • conf/hdfs-site.xml: This contains configurations for HDFS

  • conf/mapred-site.xml: This contains configurations for MapReduce

Each configuration file has name-value pairs expressed in an XML format, and they define the workings of different aspects of Hadoop. The following code snippet shows an example of a property in the configuration file. Here, the <configuration> tag is the top-level XML container, and the <property> tags that define individual properties go as child elements of the <configuration> tag.

<configuration>
<property>
<name>mapred.reduce.parallel.copies</name>
<value>20</value>
</property>
...
</configuration>

The following instructions show how to change the directory to which we write Hadoop logs and configure the maximum number of map and reduce tasks:

  1. Create a directory to store the logfiles. For example, /root/hadoop_logs.

  2. Uncomment the line that includes HADOOP_LOG_DIR in HADOOP_HOME/conf/ hadoop-env.sh and point it to the new directory.

  3. Add the following lines to the HADOOP_HOME/conf/mapred-site.xml file:

    <property>
    <name>mapred.tasktracker.map.tasks.maximum</name>
    <value>2 </value>
    </property>
    <property>
    <name>mapred.tasktracker.reduce.tasks.maximum</name>
    <value>2 </value>
    </property>

  4. Restart the Hadoop cluster by running the bin/stop-mapred.sh and bin/start-mapred.sh commands from the HADOOP_HOME directory.

  5. You can verify the number of processes created using OS process monitoring tools. If you are in Linux, run the watch ps –ef|grep hadoop command. If you are in Windows or MacOS use the Task Manager.

How it works…

HADOOP_LOG_DIR redefines the location to which Hadoop writes its logs. The mapred. tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks. maximum properties define the maximum number of map and reduce tasks that can run within a single TaskTracker at a given moment.

These and other server-side parameters are defined in the HADOOP_HOME/conf/*-site. xml files. Hadoop reloads these configurations after a restart.

There’s more…

There are many similar configuration properties defined in Hadoop. You can see some of them in the following tables.

The configuration properties for conf/core-site.xml are listed in the following table:

Name

Default value

Description

fs.inmemory.size.mb

100

This is the amount of memory allocated to the in-memory filesystem that is used to merge map outputs at reducers in MBs.

io.sort.factor

100

This is the maximum number of streams

merged while sorting files.

io.file.buffer.size

131072

This is the size of the read/write buffer used by sequence files.

The configuration properties for conf/mapred-site.xml are listed in the following table:

Name

Default value

Description

mapred.reduce.

parallel.copies

5

This is the maximum number of parallel copies the reduce step will execute to fetch output from many parallel jobs.

mapred.map.child.java.

opts

-Xmx200M

This is for passing Java options into the map JVM.

mapred.reduce.child.

java.opts

-Xmx200M

This is for passing Java options into the reduce JVM.

io.sort.mb

200

The memory limit while sorting data in MBs.

The configuration properties for conf/hdfs-site.xml are listed in the following table:

Name

Default value

Description

dfs.block.size

67108864

This is the HDFS block size.

dfs.namenode.handler.

count

40

This is the number of server threads to handle RPC calls in the NameNode.

Running benchmarks to verify the Hadoop installation

The Hadoop distribution comes with several benchmarks. We can use them to verify our Hadoop installation and measure Hadoop’s performance. This recipe introduces these benchmarks and explains how to run them.

Getting ready

Start the Hadoop cluster. You can run these benchmarks either on a cluster setup or on a pseudo-distributed setup.

How to do it…

Let us run the sort benchmark. The sort benchmark consists of two jobs. First, we generate some random data using the randomwriter Hadoop job and then sort them using the sort sample.

  1. Change the directory to HADOOP_HOME.

  2. Run the randomwriter Hadoop job using the following command:

    >bin/hadoop jar hadoop-examples-1.0.0.jarrandomwriter
    -Dtest.randomwrite.bytes_per_map=100
    -Dtest.randomwriter.maps_per_host=10 /data/unsorted-data

    Here the two parameters, test.randomwrite.bytes_per_map and test. randomwriter.maps_per_host specify the size of data generated by a map and the number of maps respectively.

  3. Run the sort program:

    >bin/hadoop jar hadoop-examples-1.0.0.jar sort /data/unsorted-data
    /data/sorted-data

  4. Verify the final results by running the following command:

    >bin/hadoop jar hadoop-test-1.0.0.jar testmapredsort -sortInput /
    data/unsorted-data -sortOutput /data/sorted-data

Finally, when everything is successful, the following message will be displayed:

The job took 66 seconds.
SUCCESS! Validated the MapReduce framework's 'sort' successfully.

How it works…

First, the randomwriter application runs a Hadoop job to generate random data that can be used by the second sort program. Then, we verify the results through testmapredsort job. If your computer has more capacity, you may run the initial randomwriter step with increased output sizes.

There’s more…

Hadoop includes several other benchmarks.

  • TestDFSIO: This tests the input output (I/O) performance of HDFS

  • nnbench: This checks the NameNode hardware

  • mrbench: This runs many small jobs

  • TeraSort: This sorts a one terabyte of data

More information about these benchmarks can be found at http://www.michaelnoll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoopcluster- with-terasort-testdfsio-nnbench-mrbench/.

Reusing Java VMs to improve the performance

In its default configuration, Hadoop starts a new JVM for each map or reduce task. However, running multiple tasks from the same JVM can sometimes significantly speed up the execution. This recipe explains how to control this behavior.

How to do it…

  1. Run the WordCount sample by passing the following option as an argument:

    >bin/hadoop jar hadoop-examples-1.0.0.jar wordcount –Dmapred.job.
    reuse.jvm.num.tasks=-1 /data/input1 /data/output1

  2. Monitor the number of processes created by Hadoop (through ps –ef|grephadoop command in Unix or task manager in Windows). Hadoop starts only a single JVM per task slot and then reuses it for an unlimited number of tasks in the job.

    However, passing arguments through the –D option only works if the job implements the org.apache.hadoop.util.Tools interface. Otherwise, you should set the option through the JobConf.setNumTasksToExecutePerJvm(-1) method.

How it works…

By setting the job configuration property through mapred.job.reuse.jvm.num.tasks, we can control the number of tasks for the JVM run by Hadoop. When the value is set to -1, Hadoop runs the tasks in the same JVM.

LEAVE A REPLY

Please enter your comment!
Please enter your name here