8 min read

Setting up Hadoop to spread disk I/O

Modern servers usually have multiple disk devices to provide large storage capacities. These disks are usually configured as RAID arrays, as their factory settings. This is good for many cases but not for Hadoop.

The Hadoop slave node stores HDFS data blocks and MapReduce temporary files on its local disks. These local disk operations benefit from using multiple independent disks to spread disk I/O.

In this recipe, we will describe how to set up Hadoop to use multiple disks to spread its disk I/O.

Getting ready

We assume you have multiple disks for each DataNode node. These disks are in a JBOD (Just a Bunch Of Disks) or RAID0 configuration. Assume that the disks are mounted at /mnt/d0, /mnt/d1, …, /mnt/dn, and the user who starts HDFS has write permission on each mount point.

How to do it…

In order to set up Hadoop to spread disk I/O, follow these instructions:

  1. On each DataNode node, create directories on each disk for HDFS to store its data blocks:
  2. hadoop$ mkdir -p /mnt/d0/dfs/data
    hadoop$ mkdir -p /mnt/d1/dfs/data

    hadoop$ mkdir -p /mnt/dn/dfs/data

  3. Add the following code to the HDFS configuration file (hdfs-site.xml):
  4. hadoop@master1$ vi $HADOOP_HOME/conf/hdfs-site.xml
    <property>
    <name>dfs.data.dir</name>
    <value>/mnt/d0/dfs/data,/mnt/d1/dfs/data,...,/mnt/dn/dfs/data</value>
    </property>

  5. Sync the modified hdfs-site.xml file across the cluster:
  6. hadoop@master1$ for slave in `cat $HADOOP_HOME/conf/slaves`
    do
    rsync -avz $HADOOP_HOME/conf/ $slave:$HADOOP_HOME/conf/
    done

  7. Restart HDFS:
  8. hadoop@master1$ $HADOOP_HOME/bin/stop-dfs.sh
    hadoop@master1$ $HADOOP_HOME/bin/start-dfs.sh

How it works…

We recommend JBOD or RAID0 for the DataNode disks, because you don’t need the redundancy of RAID, as HDFS ensures its data redundancy using replication between nodes. So, there is no data loss when a single disk fails.

Which one to choose, J BOD or RAID0? You will theoretically get better performance from a JBOD configuration than from a RAID configuration. This is because, in a RAID configuration, you have to wait for the slowest disk in the array to complete before the entire write operation can complete, which makes the average I/O time equivalent to the slowest disk’s I/O time. In a JBOD configuration, operations on a faster disk will complete independently of the slower ones, which makes the average I/O time faster than the slowest one. However, enterprise-class RAID cards might make big differences. You might want to benchmark your JBOD and RAID0 configurations before deciding which one to go with.

For both JBOD and RAID0 configurations, you will have the disks mounted at different paths. The key point here is to set the dfs.data.dirproperty to all the directories created on each disk. The dfs.data.dirproperty specifies where the DataNode should store its local blocks. By setting it to comma-separated multiple directories, DataNode stores its blocks across all the disks in round robin fashion. This causes Hadoop to efficiently spread disk I/O to all the disks.

Warning
Do not leave blanks between the directory paths in the dfs.data.dir property value, or it won’t work as expected.

You will need to sync the changes across the cluster and restart HDFS to apply them.

There’s more…

If you run MapReduce, as MapReduce stores its temporary files on TaskTracker’s local file system, you might also like to set up MapReduce to spread its disk I/O:

  1. On each TaskTracker node, create directories on each disk for MapReduce to store its intermediate data files:

    hadoop$ mkdir -p /mnt/d0/mapred/local
    hadoop$ mkdir -p /mnt/d1/mapred/local

    hadoop$ mkdir -p /mnt/dn/mapred/local

  2. Add the following to MapReduce’s configuration file (mapred-site.xml):
  3. hadoop@master1$ vi $HADOOP_HOME/conf/mapred-site.xml
    <property>
    <name>mapred.local.dir</name>
    <value>/mnt/d0/mapred/local,/mnt/d1/mapred/local,...,/mnt/dn/mapred/local</value>
    </property>

  4. Sync the modified mapred-site.xml file across the cluster and restart MapReduce.

MapReduce generates a lot of temporary files on TaskTrackers’ local disks during its execution. Like HDFS, setting up multiple directories on different disks helps spread MapReduce disk I/O significantly.

Using network topology script to make Hadoop rack-aware

Hadoop has the concept of “Rack Awareness “. Administrators are able to define the rack of each DataNode in the cluster. Making Hadoop rack-aware is extremely important because:

  • Rack awareness prevents data loss
  • Rack awareness improves network performance

In this recipe, we will describe how to make Hadoop rack-aware and why it is important.

Getting ready

You will need to know the rack to which each of your slave nodes belongs. Log in to the master node as the user who started Hadoop.

How to do it…

The following steps describe how to make Hadoop rack-aware:

  1. Create a topology.sh script and store it under the Hadoop configuration directory. Change the path for topology.data, in line 3, to fit your environment:

    hadoop@master1$ vi $HADOOP_HOME/conf/topology.sh
    while [ $# -gt 0 ] ; do
    nodeArg=$1
    exec< /usr/local/hadoop/current/conf/topology.data
    result=""
    while read line ; do
    ar=( $line )
    if [ "${ar[0]}" = "$nodeArg" ] ; then
    result="${ar[1]}"
    fi
    done
    shift
    if [ -z "$result" ] ; then
    echo -n "/default/rack "
    else
    echo -n "$result "
    fi
    done

    Don’t forget to set the execute permission on the script file:

    hadoop@master1$ chmod +x $HADOOP_HOME/conf/topology.sh

  2. Create a topology.data file, as shown in the following snippet; change the IP addresses and racks to fit your environment:
  3. hadoop@master1$ vi $HADOOP_HOME/conf/topology.data
    10.161.30.108 /dc1/rack1
    10.166.221.198 /dc1/rack2
    10.160.19.149 /dc1/rack3

  4. Add the following to the Hadoop core configuration file (core-site.xml):
  5. hadoop@master1$ vi $HADOOP_HOME/conf/core-site.xml
    <property>
    <name>topology.script.file.name</name>
    <value>/usr/local/hadoop/current/conf/topology.sh</value>
    </property>

  6. Sync the modified files across the cluster and restart HDFS and MapReduce.
  7. Make sure HDFS is now rack-aware. If everything works well, you should be able to find something like the following snippet in your NameNode log file:
  8. 
        2012-03-10 13:43:17,284 INFO org.apache.hadoop.net.NetworkTopology: 
        Adding a new node: /dc1/rack3/10.160.19.149:50010
        2012-03-10 13:43:17,297 INFO org.apache.hadoop.net.NetworkTopology: 
        Adding a new node: /dc1/rack1/10.161.30.108:50010
        2012-03-10 13:43:17,429 INFO org.apache.hadoop.net.NetworkTopology: 
        Adding a new node: /dc1/rack2/10.166.221.198:50010
        
  9. Make sure MapReduce is now rack-aware. If everything works well, you should be able to find something like the following snippet in your JobTracker log file:
  10. 
        2012-03-10 13:50:38,341 INFO org.apache.hadoop.net.NetworkTopology: 
        Adding a new node: /dc1/rack3/ip-10-160-19-149.us-west-1.compute.internal
        2012-03-10 13:50:38,485 INFO org.apache.hadoop.net.NetworkTopology: 
        Adding a new node: /dc1/rack1/ip-10-161-30-108.us-west-1.compute.internal
        2012-03-10 13:50:38,569 INFO org.apache.hadoop.net.NetworkTopology: 
        Adding a new node: /dc1/rack2/ip-10-166-221-198.us-west-1.compute.internal
        

How it works…

The following diagram shows the concept of Hadoop rack awareness:

Each block of the HDFS files will be replicated to multiple DataNodes, to prevent loss of all the data copies due to failure of one machine. However, if all copies of data happen to be replicated on DataNodes in the same rack, and that rack fails, all the data copies will be lost. So to avoid this, the NameNode needs to know the network topology in order to use that information to make intelligent data replication.

As shown in the previous diagram, with the default replication factor of three, two data copies will be placed on the machines in the same rack, and another one will be put on a machine in a different rack. This ensures that a single rack failure won’t result in the loss of all data copies. Normally, two machines in the same rack have more bandwidth and lower latency between them than two machines in different racks. With the network topology information, Hadoop is able to maximize network performance by reading data from proper DataNodes. If data is available on the local machine, Hadoop will read data from it. If not, Hadoop will try reading data from a machine in the same rack, and if it is available on neither, data will be read from machines in different racks.

In step 1, we create a topology.sh script. The script takes DNS names as arguments and returns network topology (rack) names as the output. The mapping of DNS names to network topology is provided by the topology.data file, which was created in step 2. If an entry is not found in the topology.data file, the script returns /default/rack as a default rack name.

Note that we use IP addresses, and not hostnames in the topology. data file. There is a known bug that Hadoop does not correctly process hostnames that start with letters “a” to “f”. Check HADOOP-6682 for more details.

In step 3, we set the topology.script.file.name property in core-site.xml, telling Hadoop to invoke topology.sh to resolve DNS names to network topology names.

After restarting Hadoop, as shown in the logs of steps 5 and 6, HDFS and MapReduce add the correct rack name as a prefix to the DNS name of slave nodes. This indicates that the HDFS and MapReduce rack awareness work well with the aforementioned settings.

LEAVE A REPLY

Please enter your comment!
Please enter your name here