Deploying Storm on Hadoop for Advertising Analysis

0
285
4 min read

(For more resources related to this topic, see here.)

Establishing the architecture

The recent componentization within Hadoop allows any distributed system to use it for resource management. In Hadoop 1.0, resource management was embedded into the MapReduce framework as shown in the following diagram:

Hadoop 2.0 separates out resource management into YARN, allowing other distributed processing frameworks to run on the resources managed under the Hadoop umbrella. In our case, this allows us to run Storm on YARN as shown in the following diagram:

As shown in the preceding diagram, Storm fulfills the same function as MapReduce. It provides a framework for the distributed computation. In this specific use case, we use Pig scripts to articulate the ETL/analysis that we want to perform on the data. We will convert that script into a Storm topology that performs the same function, and then we will examine some of the intricacies involved in doing that transformation.

To understand this better, it is worth examining the nodes in a Hadoop cluster and the purpose of the processes running on those nodes. Assume that we have a cluster as depicted in the following diagram:

There are two different components/subsystems shown in the diagram. The first is YARN, which is the new resource management layer introduced in Hadoop 2.0. The second is HDFS. Let’s first delve into HDFS since that has not changed much since Hadoop 1.0.

Examining HDFS

HDFS is a distributed filesystem. It distributes blocks of data across a set of slave nodes. The NameNode is the catalog. It maintains the directory structure and the metadata indicating which nodes have what information. The NameNode does not store any data itself, it only coordinates create, read, update, and delete (CRUD) operations across the distributed filesystem. Storage takes place on each of the slave nodes that run DataNode processes. The DataNode processes are the workhorses in the system. They communicate with each other to rebalance, replicate, move, and copy data. They react and respond to the CRUD operations of clients.

Examining YARN

YARN is the resource management system. It monitors the load on each of the nodes and coordinates the distribution of new jobs to the slaves in the cluster. The ResourceManager collects status information from the NodeManagers. The ResourceManager also services job submissions from clients.

One additional abstraction within YARN is the concept of an ApplicationMaster. An ApplicationMaster manages resource and container allocation for a specific application. The ApplicationMaster negotiates with the ResourceManager for the assignment of resources. Once the resources are assigned, the ApplicationMaster coordinates with the NodeManagers to instantiate containers. The containers are logical holders for the processes that actually perform the work.

The ApplicationMaster is a processing-framework-specific library. Storm-YARN provides the ApplicationMaster for running Storm processes on YARN. HDFS distributes the ApplicationMaster as well as the Storm framework itself. Presently, Storm-YARN expects an external ZooKeeper. Nimbus starts up and connects to the ZooKeeper when the application is deployed.

The following diagram depicts the Hadoop infrastructure running Storm via Storm-YARN:

As shown in the preceding diagram, YARN is used to deploy the Storm application framework. At launch, Storm Application Master is started within a YARN container. That, in turn, creates an instance of Storm Nimbus and the Storm UI.

After that, Storm-YARN launches supervisors in separate YARN containers. Each of these supervisor processes can spawn workers within its container.

Both Application Master and the Storm framework are distributed via HDFS. Storm-YARN provides command-line utilities to start the Storm cluster, launch supervisors, and configure Storm for topology deployment. We will see these facilities later in this article.

To complete the architectural picture, we need to layer in the batch and real-time processing mechanisms: Pig and Storm topologies, respectively. We also need to depict the actual data.

Often a queuing mechanism such as Kafka is used to queue work for a Storm cluster. To simplify things, we will use data stored in HDFS. The following depicts our use of Pig, Storm, YARN, and HDFS for our use case, omitting elements of the infrastructure for clarity. To fully realize the value of converting from Pig to Storm, we would convert the topology to consume from Kafka instead of HDFS as shown in the following diagram:

As the preceding diagram depicts, our data will be stored in HDFS. The dashed lines depict the batch process for analysis, while the solid lines depict the real-time system. In each of the systems, the following steps take place:

Step

Purpose

Pig Equivalent

Storm-Yarn Equivalent

1

The processing frameworks are deployed

The MapReduce Application Master is deployed and started

Storm-YARN launches Application Master and distributes Storm framework

2

The specific analytics are launched

The Pig script is compiled to MapReduce jobs and submitted as a job

Topologies are deployed to the cluster

3

The resources are reserved

Map and reduce tasks are created in YARN containers

Supervisors are instantiated with workers

4

The analyses reads the data from storage and performs the analyses

Pig reads the data out of HDFS

Storm reads the work, typically from Kafka; but in this case, the topology reads it from a flat file

Another analogy can be drawn between Pig and Trident. Pig scripts compile down into MapReduce jobs, while Trident topologies compile down into Storm topologies.

For more information on the Storm-YARN project, visit the following URL:

https://github.com/yahoo/storm-yarn


Subscribe to the weekly Packt Hub newsletter

* indicates required

LEAVE A REPLY

Please enter your comment!
Please enter your name here