Optimizing Hadoop

9 min read

In this article by Gurmukh Singh, the author of the book Hadoop Administration 2.x Cookbook, we will cover the some Hadoop concepts throughout this article.

Hadoop being a distributed system, makes it complex to troubleshoot, tune and operate at scale. In a multi-tenancy environment with varied data sizes and format with different rate of data flow, it is important that the Hadoop clusters are setup in the optimal way, by following best practices. The recipe article walks the users through the phases of installation, planning, tuning, securing, and scaling out the clusters.

(For more resources related to this topic, see here.)

HDFS as storage layer

One of the most important components of Hadoop is the distributed file system like HDFS. It provides high throughput by having parallel operations across multiple nodes and multiple disk. It provides fault tolerance by replicating data across multiple nodes.

HDFS is a core part, designed not just for storage of data but also for distributed cache and localization of files and jars for easy access across all nodes in the cluster. We can have custom implementations as long as we call the Hadoop File system API and can extend the features.

Hadoop is based on client server architecture with Namenode as master and Datanodes as slave. The Datanodes register with the Namenode and advertise their blocks, which are by default128 MB in size. The HDFS blocks are actual files on the lower file system, with each block having one block file and one checksum file on ext3/4 filesystem. The Namenode keeps a metadata of the objects in the form of namespace and bitmap called inode structure.

In HDFS, everything is an object, whether file, block, or directory and it is represented by approximately 150 bytes per object in the Namenode metadata. The entire metadata is loaded into the memory of the Namenode for its operation. Thus, the memory of the Namenode is a limiting factor on the size of the cluster.

During the operation of the cluster, Datanodes send heartbeat to the cluster, updating the status and sending the block report. Any client reading, writing data must contact Namenodefor the location of blocks or Datanodesand once Namenode returns the address of the Datanodes, the clientcommunicates directly with the Datanodes for read, write operations. The protocols like RPC, HTTP help with the communication between various parties in the cluster. Each daemon in Hadoop has a light weight Jetty server inbuilt for presenting web UIand also to provide rest API calls, for Data transfer between primary, secondary and other components in the cluster.

The recipestalks about, how we setup Hadoop cluster and configureHadoop for addressing various complex production issues,communication channels, failure recovery, location of the Datanode blocks, the Namenode metadata location and many more.

YARNFramework

For Hadoop 2.x, the YARN framework has provided independence of the process engine we can use in Hadoop. It can be MapReduce, Spark, run with Mesos, and so on. The master ResourceManager(RM) accepts jobs and launches Application master, which controls the task and resources for that job with the help of master.

Having separated Application Master (AM) per application, relieves the RM from all the overheads of job management and just concentrate as pure scheduler and honour the resource requests from AM. The task failure, re-launch of containers is the responsibility of AM and it asks for containers on a node after getting a grant from the Resource Manager.

Each application makes a resource request, which is a combination of node address, memory, cpu cores, disks, and AM has to take permission from the master RM called grant, to be allocated a container by the Node manager (NM). This is to make sure that any errant AM, does not bring the cluster to a dead lock byaggressively asking for resources. All the job configurations files, jars are localized on the nodes where containers for that particular job will run with the help of HDFS.

The NodeManagers (NM), periodically send heart beats to the RM, updating it about the capacity used, capacity available and its presence. A container reservation for particular task can also be done on a particular node, due to locality of reference and the framework will make the best effort to fulfil it. Any containers allocation is given 600secs, to move from Allocated to Running state, else they will be killed. The amount of data that a mapper or reducer can work on, depends upon the memory it has, example the mapper memory mapreduce.map.memory.mb, specifies the mapper memory. It is important to understand the percentage of JVM memory that will contribute to heap, as many buffers are allocated from it like mapred.io.sort.mb. Increasing HDFS block size, does not mean that job will run faster, as memory must be available to accommodate that.

This is addressed with example in the recipe, to help better understand the relationship between various parameters and how changing one, impacts the other.

The recipes address the layout of YARN components for optimal performance and distribution of load.

Hadoop ecosystem

Hadoop ecosystem consists of Hive, HBase, Sqoop, Presto, Oozie, and so on. In any organization, there will be a Hadoop stack, consisting of multiple components rather than just HDFS and YARN. The use case could be databases, analytics, NoSQL, ETL, or real time streaming with Spark.

As the number of components in a cluster increase, so does the complexity and challenges to manage it. If we design the cluster properly from the very beginning, keeping in mind the future growth prospects, it becomes easy to scale. Many traditional technologies co-exist with Hadoop and there will be need to migrate data across systems using tools like Hive, HBase, Sqoop, and so on.

In the recipe, we address migration process, configuration and best practices for Hive, HBase and its interaction. It is important to understand the partitioning and bucketing in Hive to enable scaling to large data sets. The recipes address the partitioning schemes needed to work at scale and return result in a reasonable time. In large organizations with multiple tenants with different use cases and data formats, it is important to understand the cluster layout. With ease of scheduling jobs to run at particular time and on a precondition, users do not need to baby sit the jobs.

Capturing logs of various components and jobs, help to diagnose the issues and also help to predict the need to cluster expansion, if any. The granularity, at which the logs and events can be controlled with capture is different tools makes the life easier for the Hadoop operations team. The Hadoop logj4 configuration settings to capture job failures, memory errors or GC events help to quickly troubleshoot any failures as shown in the recipe.

Cluster planning and tuning

In a production environment with many nodes in a cluster and large number of operations per second, it is important to address performance, failures, backup and recovery of critical components and keep the downtime to minimal.At any given time, there can be many clients connected to the cluster, so it is important to layout the Datanodes across racks evenly, with right set of disks per node.

As the cluster size grows to few thousand nodes, the load on the Namenode increases as it has to address larger namespace. What if in a cluster with 3000 Datanode nodes, all Datanode daemons are started in one go? It will generate block advertisement storm, overloading the Namenode, causing it to hang. To address this, we introduce initDelay, tune the Namenode handler count dfs.namenode.handler.count.

What about decommissioning a Datanode with 4 TB of disk? How much time will it take?

All these needs to be addressed by tuning the appropriate parameters. It is important to monitor the bandwidth/transfer rate to keep a track of the cluster health in terms of network, disk and other resources. As shown in the picture below, the AM container talks with all containers of a job as can be seen in the following image.Node master1.cyrus.com is the RM and AM got allocated on node rt2.cyrus.com. Fan-in and Fan-out of containers gives a great picture about the bottlenecks and the constraints per application type, as shown in the following image:

Despite running Hadoop master nodes of reliable hardware, it is important to plan for failures and upgrades. Namenode and ResourceManagerHA, eliminates single point of failure and avoid cluster outages. The high availability (HA) can be achieved by using Shared NFS storage or using QJM.

Securing the cluster

When Hadoop becomes part of critical businesses like banks, financial domains, aviation, weather forecast, stock exchange, medicine and many more, it becomes important to secure the cluster and safe guard critical and sensitive data.

By default, Hadoop is not security enabled for HDFS access or job execution. Every block can be access by any container in the cluster, thus making it unsuitable for critical multi-tenant environments. In any organization, there will not be a separate cluster for each BU, but there might be a few clusters shared by finance, sales, marketing. How do we safeguard each BU’s data, do data masking and make sure that any HDFS block is accessed by an authenticated user?

To enable security, we configure secure block access, in-transit encryption, Kerberos, ssl. The container on a Datanode, should access only the data blocks for which it is authorized and at the same time is should make things transparent to the end user, hiding the complexity of the implementation.

Providing isolated networks for Hadoop clusters with VLAN segregation and firewalls to control the in and out-flow of data.

This is well covered with details on secure access, in-transit transfer, Kerberos and single sign-on.

Summary

Hadoop distributed environment is a complex ecosystem with various challenges of installation, configuration, performance and security. There is no one solution, which fits all, as the work load, data formats for each job will be different with various file formats and data sizes. The recipe addresses the challenges faced by a Hadoop engineer to setup, maintain, optimize, and scale the cluster.