(For more resources related to this topic, see here.)

Apache Hadoop is the leading Big Data platform that allows to process large datasets efficiently and at low cost. Other Big Data 0platforms are MongoDB, Cassandra, and CouchDB. This section describes Apache Hadoop core concepts and its ecosystem.

Core components

The following image shows core Hadoop components:

hadoop-and-hdinsight-heartbeat-img-0

At the core, Hadoop has two key components:

Hadoop Distributed File System (HDFS)
Hadoop MapReduce (distributed computing for batch jobs)

For example, say we need to store a large file of 1 TB in size and we only have some commodity servers each with limited storage. Hadoop Distributed File System can help here. We first install Hadoop, then we import the file, which gets split into several blocks that get distributed across all the nodes. Each block is replicated to ensure that there is redundancy. Now we are able to store and retrieve the 1 TB file.

Now that we are able to save the large file, the next obvious need would be to process this large file and get something useful out of it, like a summary report. To process such a large file would be difficult and/or slow if handled sequentially. Hadoop MapReduce was designed to address this exact problem statement and process data in parallel fashion across several machines in a fault-tolerant mode. MapReduce programing models use simple key-value pairs for computation.

One distinct feature of Hadoop in comparison to other cluster or grid solutions is that Hadoop relies on the "share nothing" architecture. This means when the MapReduce program runs, it will use the data local to the node, thereby reducing network I/O and improving performance. Another way to look at this is when running MapReduce, we bring the code to the location where the data resides. So the code moves and not the data.

HDFS and MapReduce together make a powerful combination, and is the reason why there is so much interest and momentum with the Hadoop project.

Hadoop cluster layout

Each Hadoop cluster has three special master nodes (also known as servers):

NameNode: This is the master for the distributed filesystem and maintains a metadata. This metadata has the listing of all the files and the location of each block of a file, which are stored across the various slaves (worker bees). Without a NameNode HDFS is not accessible.
Secondary NameNode: This is an assistant to the NameNode. It communicates only with the NameNode to take snapshots of the HDFS metadata at intervals configured at cluster level.
JobTracker: This is the master node for Hadoop MapReduce. It determines the execution plan of the MapReduce program, assigns it to various nodes, monitors all tasks, and ensures that the job is completed by automatically relaunching any task that fails.

All other nodes of the Hadoop cluster are slaves and perform the following two functions:

DataNode: Each node will host several chunks of files known as blocks. It communicates with the NameNode.
TaskTracker: Each node will also serve as a slave to the JobTracker by performing a portion of the map or reduce task, as decided by the JobTracker.

The following image shows a typical Apache Hadoop cluster:

hadoop-and-hdinsight-heartbeat-img-1

The Hadoop ecosystem

As Hadoop's popularity has increased, several related projects have been created that simplify accessibility and manageability to Hadoop. I have organized them as per the stack, from top to bottom.

The following image shows the Hadoop ecosystem:

hadoop-and-hdinsight-heartbeat-img-2

Data access

The following software are typically used access mechanisms for Hadoop:

Hive: It is a data warehouse infrastructure that provides SQL-like access on HDFS. This is suitable for the ad hoc queries that abstract MapReduce.
Pig: It is a scripting language such as Python that abstracts MapReduce and is useful for data scientists.
Mahout: It is used to build machine learning and recommendation engines.
MS Excel 2013: With HDInsight, you can connect Excel to HDFS via Hive queries to analyze your data.

Data processing

The following are the key programming tools available for processing data in Hadoop:

MapReduce: This is the Hadoop core component that allows distributed computation across all the TaskTrackers
Oozie: It enables creation of workflow jobs to orchestrate Hive, Pig, and MapReduce tasks

The Hadoop data store

The following are the common data stores in Hadoop:

HBase: It is the distributed and scalable NOSQL (Not only SQL) database that provides a low-latency option that can handle unstructured data
HDFS: It is a Hadoop core component, which is the foundational distributed filesystem

Management and integration

The following are the management and integration software:

Zookeeper: It is a high-performance coordination service for distributed applications to ensure high availability
Hcatalog: It provides abstraction and interoperability across various data processing software such as Pig, MapReduce, and Hive
Flume: Flume is distributed and reliable software for collecting data from various sources for Hadoop
Sqoop: It is designed for transferring data between HDFS and any RDBMS

Hadoop distributions

Apache Hadoop is an open-source software and is repackaged and distributed by vendors offering enterprise support. The following is the listing of popular distributions:

Amazon Elastic MapReduce (cloud, http://aws.amazon.com/elasticmapreduce/)
Cloudera (http://www.cloudera.com/content/cloudera/en/home.html)
EMC PivitolHD (http://gopivotal.com/)
Hortonworks HDP (http://hortonworks.com/)
MapR (http://mapr.com/)
Microsoft HDInsight (cloud, http://www.windowsazure.com/)

HDInsight distribution differentiator

HDInsight is an enterprise-ready distribution of Hadoop that runs on Windows servers and on Azure HDInsight cloud service. It is 100 percent compatible with Apache Hadoop. HDInsight was developed in partnership with Hortonworks and Microsoft. Enterprises can now harness the power of Hadoop on Windows servers and Windows Azure cloud service.

The following are the key differentiators for HDInsight distribution:

Enterprise-ready Hadoop: HDInsight is backed by Microsoft support, and runs on standard Windows servers. IT teams can leverage Hadoop with Platform as a Service ( PaaS ) reducing the operations overhead.
Analytics using Excel: With Excel integration, your business users can leverage data in Hadoop and analyze using PowerPivot.
Integration with Active Directory: HDInsight makes Hadoop reliable and secure with its integration with Windows Active directory services.
Integration with .NET and JavaScript: .NET developers can leverage the integration, and write map and reduce code using their familiar tools.
Connectors to RDBMS: HDInsight has ODBC drivers to integrate with SQL Server and other relational databases.
Scale using cloud offering: Azure HDInsight service enables customers to scale quickly as per the project needs and have seamless interface between HDFS and Azure storage vault.
JavaScript console: It consists of easy-to-use JavaScript console for configuring, running, and post processing of Hadoop MapReduce jobs.

Summary

In this article, we reviewed the Apache Hadoop components and the ecosystem of projects that provide a cost-effective way to deal with Big Data problems. We then looked at how Microsoft HDInsight makes the Apache Hadoop solution better by simplified management, integration, development, and reporting.