(For more resources related to this topic, see here.)
Apache Hadoop is the leading Big Data platform that allows to process large datasets efficiently and at low cost. Other Big Data 0platforms are MongoDB, Cassandra, and CouchDB. This section describes Apache Hadoop core concepts and its ecosystem.
The following image shows core Hadoop components:
At the core, Hadoop has two key components:
For example, say we need to store a large file of 1 TB in size and we only have some commodity servers each with limited storage. Hadoop Distributed File System can help here. We first install Hadoop, then we import the file, which gets split into several blocks that get distributed across all the nodes. Each block is replicated to ensure that there is redundancy. Now we are able to store and retrieve the 1 TB file.
Now that we are able to save the large file, the next obvious need would be to process this large file and get something useful out of it, like a summary report. To process such a large file would be difficult and/or slow if handled sequentially. Hadoop MapReduce was designed to address this exact problem statement and process data in parallel fashion across several machines in a fault-tolerant mode. MapReduce programing models use simple key-value pairs for computation.
One distinct feature of Hadoop in comparison to other cluster or grid solutions is that Hadoop relies on the “share nothing” architecture. This means when the MapReduce program runs, it will use the data local to the node, thereby reducing network I/O and improving performance. Another way to look at this is when running MapReduce, we bring the code to the location where the data resides. So the code moves and not the data.
HDFS and MapReduce together make a powerful combination, and is the reason why there is so much interest and momentum with the Hadoop project.
Each Hadoop cluster has three special master nodes (also known as servers):
All other nodes of the Hadoop cluster are slaves and perform the following two functions:
The following image shows a typical Apache Hadoop cluster:
As Hadoop’s popularity has increased, several related projects have been created that simplify accessibility and manageability to Hadoop. I have organized them as per the stack, from top to bottom.
The following image shows the Hadoop ecosystem:
Data access
The following software are typically used access mechanisms for Hadoop:
Data processing
The following are the key programming tools available for processing data in Hadoop:
The Hadoop data store
The following are the common data stores in Hadoop:
Management and integration
The following are the management and integration software:
Apache Hadoop is an open-source software and is repackaged and distributed by vendors offering enterprise support. The following is the listing of popular distributions:
HDInsight is an enterprise-ready distribution of Hadoop that runs on Windows servers and on Azure HDInsight cloud service. It is 100 percent compatible with Apache Hadoop. HDInsight was developed in partnership with Hortonworks and Microsoft. Enterprises can now harness the power of Hadoop on Windows servers and Windows Azure cloud service.
The following are the key differentiators for HDInsight distribution:
In this article, we reviewed the Apache Hadoop components and the ecosystem of projects that provide a cost-effective way to deal with Big Data problems. We then looked at how Microsoft HDInsight makes the Apache Hadoop solution better by simplified management, integration, development, and reporting.
Further resources on this subject:
I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…
Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…
Once we learn how to deploy an Ubuntu server, how to manage users, and how…
Key-takeaways: Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…
While developing a web application, or setting dynamic pages and meta tags we need to deal with…
Software architecture is one of the most discussed topics in the software industry today, and…