HDFS and MapReduce

0
122
10 min read

(For more resources related to this topic, see here.)

Essentials of HDFS

HDFS is a distributed filesystem that has been designed to run on top of a cluster of industry standard hardware. The architecture of HDFS is such that there is no specific need for high-end hardware. HDFS is a highly fault-tolerant system and can handle failures of nodes in a cluster without loss of data. The primary goal behind the design of HDFS is to serve large data files efficiently. HDFS achieves this efficiency and high throughput in data transfer by enabling streaming access to the data in the filesystem.

The following are the important features of HDFS:

  • Fault tolerance: Many computers working together as a cluster are bound to have hardware failures. Hardware failures such as disk failures, network connectivity issues, and RAM failures could disrupt processing and cause major downtime. This could lead to data loss as well slippage of critical SLAs. HDFS is designed to withstand such hardware failures by detecting faults and taking recovery actions as required.

    The data in HDFS is split across the machines in the cluster as chunks of data called blocks. These blocks are replicated across multiple machines of the cluster for redundancy. So, even if a node/machine becomes completely unusable and shuts down, the processing can go on with the copy of the data present on the nodes where the data was replicated.

  • Streaming data: Streaming access enables data to be transferred in the form of a steady and continuous stream. This means if data from a file in HDFS needs to be processed, HDFS starts sending the data as it reads the file and does not wait for the entire file to be read. The client who is consuming this data starts processing the data immediately, as it receives the stream from HDFS. This makes data processing really fast.

  • Large data store: HDFS is used to store large volumes of data. HDFS functions best when the individual data files stored are large files, rather than having large number of small files. File sizes in most Hadoop clusters range from gigabytes to terabytes. The storage scales linearly as more nodes are added to the cluster.

  • Portable: HDFS is a highly portable system. Since it is built on Java, any machine or operating system that can run Java should be able to run HDFS. Even at the hardware layer, HDFS is flexible and runs on most of the commonly available hardware platforms. Most production level clusters have been set up on commodity hardware.

  • Easy interface: The HDFS command-line interface is very similar to any Linux/Unix system. The commands are similar in most cases. So, if one is comfortable with the performing basic file actions in Linux/Unix, using commands with HDFS should be very easy.

The following two daemons are responsible for operations on HDFS:

  • Namenode

  • Datanode

The namenode and datanodes daemons talk to each other via TCP/IP.

Configuring HDFS

All HDFS-related configuration is done by adding/updating the properties in the hdfs-site.xml file that is found in the conf folder under the Hadoop installation folder.

The following are the different properties that are part of the hdfs-site.xml file:

  • dfs.namenode.servicerpc-address: This specifies the unique namenode RPC address in the cluster. Services/daemons such as the secondary namenode and datanode daemons use this address to connect to the namenode daemon whenever it needs to communicate. This property is shown in the following code snippet:

    <property> <name>dfs.namenode.servicerpc-address</name> <value>node1.hcluster:8022</value> </property>

  • dfs.namenode.http-address: This specifies the URL that can be used to monitor the namenode daemon from a browser. This property is shown in the following code snippet:

    <property> <name>dfs.namenode.http-address</name> <value>node1.hcluster:50070</value> </property>

  • dfs.replication: This specifies the replication factor for data block replication across the datanode daemons. The default is 3 as shown in the following code snippet:

    <property> <name>dfs.replication</name> <value>3</value> </property>

  • dfs.blocksize: This specifies the block size. In the following example, the size is specified in bytes (134,217,728 bytes is 128 MB):

    <property> <name>dfs.blocksize</name> <value>134217728</value> </property>

  • fs.permissions.umask-mode: This specifies the umask value that will be used when creating files and directories in HDFS. This property is shown in the following code snippet:

    <property> <name>fs.permissions.umask-mode</name> <value>022</value> </property>

The read/write operational flow in HDFS

To get a better understanding of HDFS, we need to understand the flow of operations for the following two scenarios:

  • A file is written to HDFS

  • A file is read from HDFS

HDFS uses a single-write, multiple-read model, where the files are written once and read several times. The data cannot be altered once written. However, data can be appended to the file by reopening it. All files in the HDFS are saved as data blocks.

Writing files in HDFS

The following sequence of steps occur when a client tries to write a file to HDFS:

  1. The client informs the namenode daemon that it wants to write a file. The namenode daemon checks to see whether the file already exists.

  2. If it exists, an appropriate message is sent back to the client. If it does not exist, the namenode daemon makes a metadata entry for the new file.

  3. The file to be written is split into data packets at the client end and a data queue is built. The packets in the queue are then streamed to the datanodes in the cluster.

  4. The list of datanodes is given by the namenode daemon, which is prepared based on the data replication factor configured. A pipeline is built to perform the writes to all datanodes provided by the namenode daemon.

  5. The first packet from the data queue is then transferred to the first datanode daemon. The block is stored on the first datanode daemon and is then copied to the next datanode daemon in the pipeline. This process goes on till the packet is written to the last datanode daemon in the pipeline.

  6. The sequence is repeated for all the packets in the data queue. For every packet written on the datanode daemon, a corresponding acknowledgement is sent back to the client.

  7. If a packet fails to write onto one of the datanodes, the datanode daemon is removed from the pipeline and the remainder of the packets is written to the good datanodes. The namenode daemon notices the under-replication of the block and arranges for another datanode daemon where the block could be replicated.

  8. After all the packets are written, the client performs a close action, indicating that the packets in the data queue have been completely transferred.

  9. The client informs the namenode daemon that the write operation is now complete.

The following diagram shows the data block replication process across the datanodes during a write operation in HDFS:

Reading files in HDFS

The following steps occur when a client tries to read a file in HDFS:

  1. The client contacts the namenode daemon to get the location of the data blocks of the file it wants to read.

  2. The namenode daemon returns the list of addresses of the datanodes for the data blocks.

  3. For any read operation, HDFS tries to return the node with the data block that is closest to the client. Here, closest refers to network proximity between the datanode daemon and the client.

  4. Once the client has the list, it connects the closest datanode daemon and starts reading the data block using a stream.

  5. After the block is read completely, the connection to datanode is terminated and the datanode daemon that hosts the next block in the sequence is identified and the data block is streamed. This goes on until the last data block for that file is read.

The following diagram shows the read operation of a file in HDFS:

Understanding the namenode UI

Hadoop provides web interfaces for each of its services. The namenode UI or the namenode web interface is used to monitor the status of the namenode and can be accessed using the following URL:

http://<namenode-server>:50070/

The namenode UI has the following sections:

  • Overview: The general information section provides basic information of the namenode with options to browse the filesystem and the namenode logs.

    The following is the screenshot of the Overview section of the namenode UI:

    The Cluster ID parameter displays the identification number of the cluster. This number is same across all the nodes within the cluster.

    A block pool is a set of blocks that belong to a single namespace. The Block Pool Id parameter is used to segregate the block pools in case there are multiple namespaces configured when using HDFS federation. In HDFS federation, multiple namenodes are configured to scale the name service horizontally. These namenodes are configured to share datanodes amongst themselves. We will be exploring HDFS federation in detail a bit later.

  • Summary: The following is the screenshot of the cluster’s summary section from the namenode web interface:

    Under the Summary section, the first parameter is related to the security configuration of the cluster. If Kerberos (the authorization and authentication system used in Hadoop) is configured, the parameter will show as Security is on. If Kerberos is not configured, the parameter will show as Security is off.

    The next parameter displays information related to files and blocks in the cluster. Along with this information, the heap and non-heap memory utilization is also displayed. The other parameters displayed in the Summary section are as follows:

    • Configured Capacity: This displays the total capacity (storage space) of HDFS.

    • DFS Used: This displays the total space used in HDFS.

    • Non DFS Used: This displays the amount of space used by other files that are not part of HDFS. This is the space used by the operating system and other files.

    • DFS Remaining: This displays the total space remaining in HDFS.

    • DFS Used%: This displays the total HDFS space utilization shown as percentage.

    • DFS Remaining%: This displays the total HDFS space remaining shown as percentage.

    • Block Pool Used: This displays the total space utilized by the current namespace.

    • Block Pool Used%: This displays the total space utilized by the current namespace shown as percentage. As you can see in the preceding screenshot, in this case, the value matches that of the DFS Used% parameter. This is because there is only one namespace (one namenode) and HDFS is not federated.

    • DataNodes usages% (Min, Median, Max, stdDev): This displays the usages across all datanodes in the cluster. This helps administrators identify unbalanced nodes, which may occur when data is not uniformly placed across the datanodes. Administrators have the option to rebalance the datanodes using a balancer.

    • Live Nodes: This link displays all the datanodes in the cluster as shown in the following screenshot:

    • Dead Nodes: This link displays all the datanodes that are currently in a dead state in the cluster. A dead state for a datanode daemon is when the datanode daemon is not running or has not sent a heartbeat message to the namenode daemon. Datanodes are unable to send heartbeats if there exists a network connection issue between the machines that host the datanode and namenode daemons. Excessive swapping on the datanode machine causes the machine to become unresponsive, which also prevents the datanode daemon from sending heartbeats.

    • Decommissioning Nodes: This link lists all the datanodes that are being decommissioned.

    • Number of Under-Replicated Blocks: This represents the number of blocks that have not replicated as per the replication factor configured in the hdfs-site.xml file.

  • Namenode Journal Status: The journal status provides location information of the fsimage file and the state of the edits logfile. The following screenshot shows the NameNode Journal Status section:

  • NameNode Storage: The namenode storage table provides the location of the fsimage file along with the type of the location. In this case, it is IMAGE_AND_EDITS, which means the same location is used to store the fsimage file as well as the edits logfile. The other types of locations are IMAGE, which stores only the fsimage file and EDITS, which stores only the edits logfile. The following screenshot shows the NameNode Storage information:

LEAVE A REPLY

Please enter your comment!
Please enter your name here