In this article by Sourav Gulati and Sumit Kumar authors of book Apache Spark 2.x for Java Developers , explain in classical sense if we are to talk of Hadoop, then it comprises of two components a storage layer called HDFS and a processing layer called MapReduce. The resource management task prior to Hadoop 2.X was done using MapReduce Framework of Hadoop itself, however that changed with the introduction of YARN. In Hadoop 2.0 YARN was introduced as the third component of Hadoop to manage the resources of Hadoop Cluster and make it more Map Reduce agnostic.
(For more resources related to this topic, see here.)
Hadoop Distributed File System as the name suggests is a distributed file system based on the lines of Google File System written in Java. In practice HDFS resembles closely like any other UNIX file system with support for common file operations like ls, cp, rm, du, cat and so on. However what makes HDFS stand out despite its simplicity, is its mechanism to handle node failure in Hadoop cluster without effectively changing the seek time for accessing stored files. HDFS cluster consists of two major components: Data Nodes and Name Node.
HDFS has a unique way of storing data on HDFS clusters (cheap commodity networked commodity computers). It splits the regular file in smaller chunks called blocks and then makes an exact number of copies of such chunks depending on the replication factor for that file. After that it copies such chunks to different Data Nodes of the Cluster.
Name Node is responsible for managing the metadata of HDFS cluster such as list of files and folders that exist in a cluster, number of splits each file is divided into and their replication and storage at different Data Nodes. It also maintains and manages the namespace and file permission of all the files available in HDFS cluster. Apart from bookkeeping Name Node also has a supervisory role that keeps a watch on the replication factor of all the files and if some block goes missing then issue commands to replicate the missing block of data. It also generates reports to ascertain cluster health too. It is important to note that all the communication for supervisory task happens from Data Node to Name node that is Data Node sends reports a.k.a block reports to Name Node and it is then that Name Node responds to them by issuing different commands or instructions as the need may be.
A HDFS read operation from a client involves:
- Client requests the NameNode to determine where the actual data blocks are stored for a given file.
- Name Node obliges by providing the Block IDs and locations of the hosts (Data Node ) where the data can be found.
- The client contacts the Data Node with respective Block IDs to fetches the data from Data Node while preserving the order of the block files.
A HDFS write operation from a client involves:
- Client contacts the Name Node to update the namespace with the file name and verify necessary permissions.
- If the file exists then Name Node throws an error else return the client FSDataOutputStream which points to data queue.
- The data queue negotiates with the NameNode to allocate new blocks on suitable DataNodes.
- The data is then copied to that DataNode, and as per replication strategy the data it further copied from that DataNode to rest of the DataNodes.
- It’s important to note that the data is never moved through the NameNode as it would have caused performance bottleneck.
Simplest way to understand Yet Another Resource manager (YARN) is to think of it as an operating system on a Cluster; provisioning resources, scheduling jobs & node maintenance. With Hadoop 2.x, MapReduce model of processing the data and managing the cluster (job tracker/task tracker) was divided. While data processing was still left to MapReduce, the cluster’s resource allocation (or rather, scheduling) task was assigned to a new component called YARN. Another objective that YARN met was that it made MapReduce one of the techniques to process the data rather than being the only technology to process data on HDFS as was the case in Hadoop 1.x systems. This paradigm shift opened the flood gate for the development of interesting applications around Hadoop and a new eco-system of not only classical MapReduce processing system evolved. It didn’t take much time after that for Apache Spark to break the hegemony of classical MapReduce and become arguably the most popular processing framework for parallel computing as far as active development and adoption is concerned.
In order to serve Multi-tenancy, fault tolerance, and resource isolation in YARN, it developed below components to manage the cluster seamlessly.
- ResourceManager: It negotiates resources for different compute programmes on a Hadoop cluster while guaranteeing the following: resource isolation, data locality, fault tolerance, task prioritization and effective cluster capacity utilization. A configurable scheduler allows Resource Manager the flexibility to schedule and prioritize different applications as per the need.
- Tasks served by RM while serving clients: Using client or APIs user can submit or terminate an application. The user can also gather statistics on submitted application, cluster and queue information. RM also priorities ADMIN tasks higher over any other task to perform clean up or maintenance activities on a cluster like refreshing node-list, the queues configuration.
- Tasks served by RM while serving Cluster Nodes: Provisioning and de-provisioning of new nodes forms an important task of RM. Each node sends a heartbeat at a configured interval, default being 10 minutes. Any failure of node in doing so is treated as dead node. As a clean-up activity all the supposedly running process including containers are marked dead too.
- Tasks served by RM while serving Application Master: RM registers new AM while terminating the successfully executed ones. Just like Cluster Nodes if the heartbeat of AM is not received within a preconfigured duration, default value being 10 minutes, then AM is marked dead and all the associated containers too are marked dead. But since YARN is reliable as far as Application execution is concerned hence a new AM is rescheduled to try another execution on a new container until it reaches the retry configurable default count of 4.
- Scheduling and other miscellaneous tasks served by RM: RM maintains a list of running, submitted and executed applications along with its statistics such as execution time , status etc. Privileges of user as well as of applications are maintained and compared while serving various requests of user per application life cycle. RM scheduler oversees resource allocation for application such as memory allocation. Two common scheduling algorithms used in YARN are fair scheduling and capacity scheduling algorithms.
- NodeManager: NM exist per node of the cluster on a slightly similar fashion as to what slave nodes are in master slave architecture. When a NM starts it sends the information to RM for its availability to share its resources for upcoming jobs. There on NM sends periodic signal also called heartbeat to RM informing them of its status as being alive in the cluster. Primarily NM is responsible for launching containers that has been requested by AM with certain resource requirement such as memory, disk and so on. Once the containers are up and running the NM keeps a watch not on the status of the container’s task but on the resource utilization of the container and kill them if the container start utilizing more resources then it has been provisioned for. Apart from managing the life cycle of the container the NM also keeps RM informed about node’s health.
- ApplicationMaster: AM gets launched per submitted application and manages the life cycle of submitted application. However the first and foremost task AM does is to negotiate resources from RM to launch task specific containers at different nodes. Once containers are launched the AM keeps track of all the containers’ task status. If any node goes down or the container gets killed because of using excess resources or otherwise in such cases AM renegotiates resources from RM and launch those pending tasks again. AM also keeps reporting the status of the submitted application directly to the user and other such statistics to RM. ApplicationMaster implementation is framework specific and it is because of this reason application/framework specific code if transferred the AM , and it the AM that distributes it further across. This important feature also makes YARN technology agnostic as any framework can implement its ApplicationMaster and then utilized the resources of YARN cluster seamlessly.
- Container: Container in an abstract sense is a set of minimal resources such as CPU, RAM, Disk I/O, Disk space etc. that are required to run a task independently on a node. The first container after submitting the job is launched by RM to host ApplicationMaster. It is the AM which then negotiates resources from RM in the form of containers, which then gets hosted in different nodes across the Hadoop Cluster.
Process flow of application submission in YARN:
- Step 1: Using a client or APIs the user submits the application let’s say a Spark Job jar. Resource Manager, whose primary task is to gather and report all the applications running on entire Hadoop cluster and available resources on respective Hadoop nodes, depending on the privileges of the user submitting the job accepts the newly submitted task.
- Step2: After this RM delegates the task to scheduler. The scheduler then searches for a container which can host the application-specific Application Master. While Scheduler does takes into consideration parameters like availability of resources, task priority, data locality etc. before scheduling or launching an Application Master, it has no role in monitoring or restarting a failed job. It is the responsibility of RM to keep track of AM and restart them in a new container when be it fails.
- Step 3: Once the Application Master gets launched it becomes the prerogative of AM to oversee the resources negotiation with RM for launching task specific containers. Negotiations with RM is typically over:
- The priority of the tasks at hand.
- Number of containers to be launched to complete the tasks.
- The resources need to execute the tasks i.e. RAM, CPU (since Hadoop 3.x).
- Available nodes where job containers can be launched with required resources
- Depending on the priority and availability of resources the RM grants containers represented by container ID and hostname of the node on which it can be launched.
- Step 4: The AM then request the NM of the respective hosts to launch the containers with specific ID’s and resource configuration. The NM then launches the containers but keeps a watch on the resources usage of the task. If for example the container starts utilizing more resources than it has been provisioned for then in such scenario the said containers are killed by the NM. This greatly improves the job isolation and fair sharing of resources guarantee that YARN provides as otherwise it would have impacted the execution of other containers. However, it is important to note that the job status and application status as a whole is managed by AM. It falls in the domain of AM to continuously monitor any delay or dead containers, simultaneously negotiating with RM to launch new containers to reassign the task of dead containers.
- Step 5: The Containers executing on different nodes sends Application specific statistics to AM at specific intervals.
- Step 6: AM also reports the status of the application directly to the client that submitted the specific application, in our case a Spark Job.
- Step 7: NM monitors the resources being utilized by all the containers on the respective nodes and keeps sending a periodic update to RM.
- Step 8: The AM sends periodic statistics such application status, task failure, log information to RM
Overview Of MapReduce
Before delving deep into MapReduce implementation in Hadoop, let’s first understand the MapReduce as a concept in parallel computing and why it is a preferred way of computing. MapReduce comprises two mutually exclusive but dependent phases each capable of running on two different machines or nodes:
Map: In Map phase transformation of data takes place. It splits data into key value pair by splitting it on a keyword.
Suppose we have a text file and we would want to do an analysis such as to count total number of words or even the frequency with which the word has occurred in the text file. This is the classical Word Count problem of MapReduce, now to address this problem first we will have to identify the splitting keyword so that the data can be spilt and be converted into a key value pair.
Let’s begin with John Lennon’s song Imagine.
Imagine there's no heaven It's easy if you try No hell below us Above us only sky Imagine all the people living for today
After running Map phase on the sampled text and splitting it over
The key here represents the word and value represents the count, also it should be noted that we have converted all the keys to lowercase to reduce any further complexity arising out of matching case sensitive keys.
Reduce: Reduce phase deals with aggregation of Map phase result and hence all the key value pairs are aggregated over key.
So the Map output of the text would get aggregated as follows:
As we can see both Map and Reduce phase can be run exclusively and hence can use independent nodes in cluster to process the data. This approach of separation of tasks into smaller units called Map and Reduce has revolutionized general purpose distributed/parallel computing, which we now know as MapReduce.
Apache Hadoop’s MapReduce has been implemented pretty much the same way as discussed except for adding extra features into how the data from Map phase of each node gets transferred to their designated Reduce phase node.
Hadoop’s implementation of MapReduce enriches the Map and Reduce phase by adding few more concrete steps in between to make it fault tolerant and truly distributed. We can describe MR jobs on YARN in five stages.
Job Submission Stage: When a client submits a MR Job following things happen
- RM is requested for an application ID.
- Input data location is checked and if present then file split size is computed.
- Job’s output location need to exist as well.
If all the three conditions are met then the MR job jar along with its configuration ,details of input split are copied to HDFS in a directory named the application ID provided by RM. And then the job is submitted to RM to launch a job specific Application Master, MRAppMaster.
MAP Stage: Once RM receives the client’s request for launching MRAppMaster, a call is made to YARN scheduler for assigning a container. As per resource availability the container is granted and hence the MRAppMaster is launched at the designated node with provisioned resources. After this MRAppMaster fetches input split information from the HDFS path that was submitted by the client and computes the number of Mapper task that will be launched based on the splits. Depending on number of Mappers it also calculates the required number of Reducers as per configuration, If MRAppMaster now finds the number of Mapper ,Reducer & size of input files to be small enough to be run in the same JVM then it goes ahead in doing so, such tasks are called Uber task. However, in other scenarios MRAppMaster negotiates container resources from RM for running these tasks albeit Mapper tasks having higher order and priority. This is so as Mapper tasks must finish before sorting phase can start.
Data locality is another concern for containers hosting Mappers as data local nodes are preferred over rack local, with least preference being given to remote node hosted data. But when it comes to Reduce phase no such preference of data locality exist for containers. Containers hosting Mapper function first copy mapReduce JAR & configuration files locally and then launch a class YarnChild in the JVM. The mapper then start reading the input files, process them by making key value pairs and writes them in a circular buffer.
Shuffle and Sort Phase: Considering circular buffer has size constraint, after a certain percentage where default being 80, a thread gets spawned which spills the data from buffer. But before copying the spilled data to disk, it is first partitioned with respect to its Reducer then the background thread also sorts the partitioned data on key and if combiner is mentioned then combines the data too. This process optimizes the data once it is copied to their respective partitioned folder. This process is continued until all the data from circular buffer gets written to disk. A background thread again checks if the number of spilled files in each partition is within the range of configurable parameter or else the files are merged and combiner is run over them until it falls within the limit of the parameter.
Map task keeps updating the status to ApplicationMaster its entire life cycle, it is only when 5 percent of Map task has been completed that the reduce task start. An auxiliary service in the NodeManager serving Reduce task starts a Netty web server that makes a request to MRAppMaster for Mapper hosts having specific Mapper partitioned files. All the partitioned files that pertain to the Reducer is copied to their respective nodes in similar fashion. Since multiple files gets copied as data from various nodes representing that reduce nodes gets collected, a background thread merges the sorted map file again sorts them and if Combiner is configured then combines the result too.
Reduce Stage: It is important to note here that at this stage every input file of each reducer should have been sorted by key, this is the presumption with which Reducer starts processing these records and converts the key value pair into aggregated list. Once reducer processes the data it writes them to the output folder as was mentioned during Job submission.
Clean up stage: Each Reducer sends periodic update to MRAppMaster about the task completion, once the Reduce task is over the application master starts the clean-up activity. The submitted job status is changed from running to successful, all the temporary and intermediate files and folders are deleted .The application statistics are archived to job history server.
In this article we saw what is HDFS and YARN along with MapReduce in which we learned different function of MapReduce and HDFS I/O.
Resources for Article:
- Getting Started with Apache Spark DataFrames [article]
- Getting Started with Apache Hadoop and Apache Spark [article]