Introduction to Hadoop

10 min read

In this article by Shiva Achari, author of the book Hadoop Essentials, you’ll get an introduction about Hadoop, its uses, and advantages

(For more resources related to this topic, see here.)

Hadoop

In big data, the most widely used system is Hadoop. Hadoop is an open source implementation of big data, which is widely accepted in the industry, and benchmarks for Hadoop are impressive and, in some cases, incomparable to other systems. Hadoop is used in the industry for large-scale, massively parallel, and distributed data processing. Hadoop is highly fault tolerant and configurable to as many levels as we need for the system to be fault tolerant, which has a direct impact to the number of times the data is stored across.

As we have already touched upon big data systems, the architecture revolves around two major components: distributed computing and parallel processing. In Hadoop, the distributed computing is handled by HDFS, and parallel processing is handled by MapReduce. In short, we can say that Hadoop is a combination of HDFS and MapReduce, as shown in the following image:

Hadoop history

Hadoop began from a project called Nutch, an open source crawler-based search, which processes on a distributed system. In 2003–2004, Google released Google MapReduce and GFS papers. MapReduce was adapted on Nutch. Doug Cutting and Mike Cafarella are the creators of Hadoop. When Doug Cutting joined Yahoo, a new project was created along the similar lines of Nutch, which we call Hadoop, and Nutch remained as a separate sub-project. Then, there were different releases, and other separate sub-projects started integrating with Hadoop, which we call a Hadoop ecosystem.

The following figure and description depicts the history with timelines and milestones achieved in Hadoop:

Description

2002.8: The Nutch Project was started
2003.2: The first MapReduce library was written at Google
2003.10: The Google File System paper was published
2004.12: The Google MapReduce paper was published
2005.7: Doug Cutting reported that Nutch now uses new MapReduce implementation
2006.2: Hadoop code moved out of Nutch into a new Lucene sub-project
2006.11: The Google Bigtable paper was published
2007.2: The first HBase code was dropped from Mike Cafarella
2007.4: Yahoo! Running Hadoop on 1000-node cluster
2008.1: Hadoop made an Apache Top Level Project
2008.7: Hadoop broke the Terabyte data sort Benchmark
2008.11: Hadoop 0.19 was released
2011.12: Hadoop 1.0 was released
2012.10: Hadoop 2.0 was alpha released
2013.10: Hadoop 2.2.0 was released
2014.10: Hadoop 2.6.0 was released

Advantages of Hadoop

Hadoop has a lot of advantages, and some of them are as follows:

Low cost—Runs on commodity hardware: Hadoop can run on average performing commodity hardware and doesn’t require a high performance system, which can help in controlling cost and achieve scalability and performance. Adding or removing nodes from the cluster is simple, as an when we require. The cost per terabyte is lower for storage and processing in Hadoop.
Storage flexibility: Hadoop can store data in raw format in a distributed environment. Hadoop can process the unstructured data and semi-structured data better than most of the available technologies. Hadoop gives full flexibility to process the data and we will not have any loss of data.
Open source community: Hadoop is open source and supported by many contributors with a growing network of developers worldwide. Many organizations such as Yahoo, Facebook, Hortonworks, and others have contributed immensely toward the progress of Hadoop and other related sub-projects.
Fault tolerant: Hadoop is massively scalable and fault tolerant. Hadoop is reliable in terms of data availability, and even if some nodes go down, Hadoop can recover the data. Hadoop architecture assumes that nodes can go down and the system should be able to process the data.
Complex data analytics: With the emergence of big data, data science has also grown leaps and bounds, and we have complex and heavy computation intensive algorithms for data analysis. Hadoop can process such scalable algorithms for a very large-scale data and can process the algorithms faster.

Uses of Hadoop

Some examples of use cases where Hadoop is used are as follows:

Searching/text mining
Log processing
Recommendation systems
Business intelligence/data warehousing
Video and image analysis
Archiving
Graph creation and analysis
Pattern recognition
Risk assessment
Sentiment analysis

Hadoop ecosystem

A Hadoop cluster can be of thousands of nodes, and it is complex and difficult to manage manually, hence there are some components that assist configuration, maintenance, and management of the whole Hadoop system. In this article, we will touch base upon the following components:

Layer	Utility/Tool name
Distributed filesystem	Apache HDFS
Distributed programming	Apache MapReduce
	Apache Hive
	Apache Pig
	Apache Spark
NoSQL databases	Apache HBase
Data ingestion	Apache Flume
	Apache Sqoop
	Apache Storm
Service programming	Apache Zookeeper
Scheduling	Apache Oozie
Machine learning	Apache Mahout
System deployment	Apache Ambari

All the components above are helpful in managing Hadoop tasks and jobs.

Apache Hadoop

The open source Hadoop is maintained by the Apache Software Foundation. The official website for Apache Hadoop is http://hadoop.apache.org/, where the packages and other details are described elaborately. The current Apache Hadoop project (version 2.6) includes the following modules:

Hadoop common: The common utilities that support other Hadoop modules
Hadoop Distributed File System (HDFS): A distributed filesystem that provides high-throughput access to application data
Hadoop YARN: A framework for job scheduling and cluster resource management
Hadoop MapReduce: A YARN-based system for parallel processing of large datasets

Apache Hadoop can be deployed in the following three modes:

Standalone: It is used for simple analysis or debugging.
Pseudo distributed: It helps you to simulate a multi-node installation on a single node. In pseudo-distributed mode, each of the component processes runs in a separate JVM. Instead of installing Hadoop on different servers, you can simulate it on a single server.
Distributed: Cluster with multiple worker nodes in tens or hundreds or thousands of nodes.

In a Hadoop ecosystem, along with Hadoop, there are many utility components that are separate Apache projects such as Hive, Pig, HBase, Sqoop, Flume, Zookeper, Mahout, and so on, which have to be configured separately. We have to be careful with the compatibility of subprojects with Hadoop versions as not all versions are inter-compatible.

Apache Hadoop is an open source project that has a lot of benefits as source code can be updated, and also some contributions are done with some improvements. One downside for being an open source project is that companies usually offer support for their products, not for an open source project. Customers prefer support and adapt Hadoop distributions supported by the vendors.

Let’s look at some Hadoop distributions available.

Hadoop distributions

Hadoop distributions are supported by the companies managing the distribution, and some distributions have license costs also. Companies such as Cloudera, Hortonworks, Amazon, MapR, and Pivotal have their respective Hadoop distribution in the market that offers Hadoop with required sub-packages and projects, which are compatible and provide commercial support. This greatly reduces efforts, not just for operations, but also for deployment, monitoring, and tools and utility for easy and faster development of the product or project.

For managing the Hadoop cluster, Hadoop distributions provide some graphical web UI tooling for the deployment, administration, and monitoring of Hadoop clusters, which can be used to set up, manage, and monitor complex clusters, which reduce a lot of effort and time.

Some Hadoop distributions which are available are as follows:

Cloudera: According to The Forrester Wave™: Big Data Hadoop Solutions, Q1 2014, this is the most widely used Hadoop distribution with the biggest customer base as it provides good support and has some good utility components such as Cloudera Manager, which can create, manage, and maintain a cluster, and manage job processing, and Impala is developed and contributed by Cloudera which has real-time processing capability.
Hortonworks: Hortonworks’ strategy is to drive all innovation through the open source community and create an ecosystem of partners that accelerates Hadoop adoption among enterprises. It uses an open source Hadoop project and is a major contributor to Hadoop enhancement in Apache Hadoop. Ambari was developed and contributed to Apache by Hortonworks. Hortonworks offers a very good, easy-to-use sandbox for getting started. Hortonworks contributed changes that made Apache Hadoop run natively on the Microsoft Windows platforms including Windows Server and Microsoft Azure.
MapR: MapR distribution of Hadoop uses different concepts than plain open source Hadoop and its competitors, especially support for a network file system (NFS) instead of HDFS for better performance and ease of use. In NFS, Native Unix commands can be used instead of Hadoop commands. MapR have high availability features such as snapshots, mirroring, or stateful failover.
Amazon Elastic MapReduce (EMR): AWS’s Elastic MapReduce (EMR) leverages its comprehensive cloud services, such as Amazon EC2 for compute, Amazon S3 for storage, and other services, to offer a very strong Hadoop solution for customers who wish to implement Hadoop in the cloud. EMR is much advisable to be used for infrequent big data processing. It might save you a lot of money.

Pillars of Hadoop

Hadoop is designed to be highly scalable, distributed, massively parallel processing, fault tolerant and flexible and the key aspect of the design are HDFS, MapReduce and YARN. HDFS and MapReduce can perform very large scale batch processing at a much faster rate. Due to contributions from various organizations and institutions Hadoop architecture has undergone a lot of improvements, and one of them is YARN. YARN has overcome some limitations of Hadoop and allows Hadoop to integrate with different applications and environments easily, especially in streaming and real-time analysis. One such example that we are going to discuss are Storm and Spark, they are well known in streaming and real-time analysis, both can integrate with Hadoop via YARN.

Data access components

MapReduce is a very powerful framework, but has a huge learning curve to master and optimize a MapReduce job. For analyzing data in a MapReduce paradigm, a lot of our time will be spent in coding. In big data, the users come from different backgrounds such as programming, scripting, EDW, DBA, analytics, and so on, for such users there are abstraction layers on top of MapReduce. Hive and Pig are two such layers, Hive has a SQL query-like interface and Pig has Pig Latin procedural language interface. Analyzing data on such layers becomes much easier.

Data storage component

HBase is a column store-based NoSQL database solution. HBase’s data model is very similar to Google’s BigTable framework. HBase can efficiently process random and real-time access in a large volume of data, usually millions or billions of rows. HBase’s important advantage is that it supports updates on larger tables and faster lookup. The HBase data store supports linear and modular scaling. HBase stores data as a multidimensional map and is distributed. HBase operations are all MapReduce tasks that run in a parallel manner.

Data ingestion in Hadoop

In Hadoop, storage is never an issue, but managing the data is the driven force around which different solutions can be designed differently with different systems, hence managing data becomes extremely critical. A better manageable system can help a lot in terms of scalability, reusability, and even performance. In a Hadoop ecosystem, we have two widely used tools: Sqoop and Flume, both can help manage the data and can import and export data efficiently, with a good performance. Sqoop is usually used for data integration with RDBMS systems, and Flume usually performs better with streaming log data.

Streaming and real-time analysis

Storm and Spark are the two new fascinating components that can run on YARN and have some amazing capabilities in terms of processing streaming and real-time analysis. Both of these are used in scenarios where we have heavy continuous streaming data and have to be processed in, or near, real-time cases. The example which we discussed earlier for traffic analyzer is a good example for use cases of Storm and Spark.

Summary

In this article, we explored a bit about Hadoop history, finally migrating to the advantages and uses of Hadoop.

Hadoop systems are complex to monitor and manage, and we have separate sub-projects’ frameworks, tools, and utilities that integrate with Hadoop and help in better management of tasks, which are called a Hadoop ecosystem.