In this article by Jagat Singh, the author of the book Apache Oozie Essentials, we will see a basic overview of Oozie and its concepts in brief.
(For more resources related to this topic, see here.)
Oozie is a workflow scheduler system to run Apache Hadoop jobs. Oozie workflow jobs are Directed Acyclic Graphs (DAGs) (https://en.wikipedia.org/wiki/Directed_acyclic_graph) representation of actions. Actions tell what to do in the job. Oozie supports running jobs of various types such as Java, Map-reduce, Pig, Hive, Sqoop, Spark, and Distcp. The output of one action can be consumed by the next action to create chain sequence.
Oozie has client server architecture, in which we install the server for storing the jobs and using client we submit our jobs to the server.
Let’s get an idea of few basic concepts of Oozie.
Workflow tells Oozie ‘what’ to do.
It is a collection of actions arranged in required dependency graph. So as part of workflows definition we write some actions and call them in certain order.
These are of various types for tasks, which we can do as part of workflow for example, Hadoop filesystem action, Pig action, Hive action, Mapreduce action , Spark action, and so on.
Coordinator tells Oozie ‘when’ to do.
Coordinators let us to run inter-dependent workflows as data pipelines based on some starting criteria. Most of the Oozie jobs are triggered at given scheduled time interval or when input dataset is present for triggering the job. Following are important definitions related to coordinators:
- Nominal time: The scheduled time at which job should execute. Example, we process pressrelease every day at 8:00PM.
- Actual time: The real time when the job ran. In some cases if the input data does not arrive the job might start late. This type of data dependent job triggering is indicated by done-flag (more on this later). The done-flag gives signal to start the job execution.
The general skeleton template of coordinator is shown in the following figure:
Bundles tell Oozie which all things to do together as a group. For example a set of coordinators, which can be run together to satisfy a given business requirement can be combined as Bundle.
Book case study
One of the main used cases of Hadoop is ETL data processing.
Suppose that we work for a large consulting company and have won project to setup Big data cluster inside customer data center. On high level the requirements are to setup environment that will satisfy the following flow:
- We get data from various sources in Hadoop (File based loads, Sqoop based loads)
- We preprocess them with various scripts (Pig, Hive, Mapreduce)
- Insert that data into Hive tables for use by analyst and data scientists
- Data scientists write machine learning models (Spark)
We will be using Oozie as our processing scheduling system to do all the above. In our architecture we have one landing server, which sits outside as front door of the cluster. All source systems send files to us via scp and we regularly (for example, nightly to keep simple) push them to HDFS using the hadoop fs -copyFromLocal command. This script is cron driven. It has very simple business logic run every night at 8:00 PM and moves all the files, which it sees, on landing server into HDFS.
The Oozie works as follows:
- Oozie picks the file and cleans it using Pig Script to replace all the delimiters from comma (,) to pipes (|). We will write the same code using Pig and Map Reduce.
- We then push those processed files into a Hive table.
- For different source system which is database based MySQL table we do nightly Sqoop when the load of Database in light. So we extract all the records that have been generated on previous business day.
- The output of that also we insert into Hive tables.
- Analyst and Data scientists write there magical Hive scripts and Spark machine learning models on those Hive tables.
- We will use Oozie to schedule all of these regular tasks.
Workflow is composed on nodes; the logical DAG of nodes represents ‘what’ part of the work done by Oozie. Each of the node does specified work and on success moves to one node or on failure moves to other node. For example on success go to OK node and on fail goes to Kill node.
Nodes in the Oozie workflow are of the following types.
Control flow nodes
These nodes are responsible for defining start, end, and control flow of what to do inside the workflow. These can be from following:
- Start node
- End node
- Kill node
- Decision node
- Fork and Join node
Actions nodes represent the actual processing tasks, which are executed when called. These are of various types for example Pig action, Hive action, and Mapreduce action.
So in this article we looked at the concepts of Oozie in brief. We also learnt the types on nodes in Oozie.
Resources for Article:
- Introduction to Hadoop[article]
- Hadoop and HDInsight in a Heartbeat[article]
- Cloudera Hadoop and HP Vertica [article]