6 min read

In the first part of this two-part blog series, we learned that using statistical algorithms gives us a 95 percent accuracy rate for big data analytics, is faster, and is a lot more beneficial than waiting for the exact results. We also took a look at a few algorithms along with a quick introduction to Spark. Now let’s take a look at two tools in depth that are used with statistical algorithms: Apache Spark and Apache Pig.

Apache Spark

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, and Python, as well as an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

At its core, Spark provides a general programming model that enables developers to write applications by composing arbitrary operators, such as mappers, reducers, joins, group-bys, and filters. This composition makes it easy to express a wide array of computations, including iterative machine learning, streaming, complex queries, and batch processing. In addition, Spark keeps track of the data that each of the operators produces, and enables applications to reliably store this data in memory. This is the key to Spark’s performance, as it allows applications to avoid costly disk accesses. It would be wonderful to have one tool for everyone, and one architecture and language for investigative as well as operational analytics.

Spark’s ease of use comes from its general programming model, which does not constrain users to structure their applications into a bunch of map and reduce operations. Spark’s parallel programs look very much like sequential programs, which make them easier to develop and reason about. Finally, Spark allows users to easily combine batch, interactive, and streaming jobs in the same application. As a result, a Spark job can be up to 100 times faster and requires writing 210 times less code than an equivalent Hadoop job.

Spark allows users and applications to explicitly cache a dataset by calling the cache() operation. This means that your applications can now access data from RAM instead of disk, which can dramatically improve the performance of iterative algorithms that access the same dataset repeatedly. This use case covers an important class of applications, as all machine learning and graph algorithms are iterative in nature.

When constructing a complex pipeline of MapReduce jobs, the task of correctly parallelizing the sequence of jobs is left to you. Thus, a scheduler tool such as Apache Oozie is often required to carefully construct this sequence. With Spark, a whole series of individual tasks is expressed as a single program flow that is lazily evaluated so that the system has a complete picture of the execution graph. This approach allows the core scheduler to correctly map the dependencies across different stages in the application, and automatically parallelize the flow of operators without user intervention.

With a low-latency data analysis system at your disposal, it’s natural to extend the engine towards processing live data streams. Spark has an API for working with streams, providing exactly-once semantics and full recovery of stateful operators. It also has the distinct advantage of giving you the same Spark APIs to process your streams, including reuse of your regular Spark application code.

Pig on Spark

Pig on Spark combines the power and simplicity of Apache Pig on Apache Spark, making existing ETL pipelines 100 times faster than before. We do that via a unique mix of our operator toolkit, called DataDoctor, and Spark. The following are the primary goals for the project:

  1. Make data processing more powerful
  2. Make data processing more simple
  3. Make data processing 100 times faster than before

DataDoctor is a high-level operator DSL on top of Spark. It has frameworks for no-symmetrical joins, sorting, grouping, and embedding native Spark functions. It hides a lot of complexity and makes it simple to implement data operators used in applications like Pig and Apache Hive on Spark.

Pig operates in a similar manner to big data applications like Hive and Cascading. It has a query language quite akin to SQL that allows analysts and developers to design and write data flows. The query language is translated in to a “logical plan” that is further translated in to a “physical plan” containing operators. Those operators are then run on the designated execution engine (MapReduce, Apache Tez, and now Spark). There are a whole bunch of details around tracking progress, handling errors, and so on that I will skip here.

Query planning on Spark will vary significantly from MapReduce, as Spark handles data wrangling in a much more optimized way. Further query planning can benefit greatly from ongoing effort on Catalyst inside Spark. At this moment, we have simply introduced a SparkPlanner that will undertake the conversion from a logical to a physical plan for Pig. Databricks is working actively to enable Catalyst to handle much of the operator optimizations that will plug into SparkPlanner in the near future. Longer term, we plan to rely on Spark itself for logical plan generation. An early version of this integration has been prototyped in partnership with Databricks.

Pig Core hands off Spark execution to SparkLauncher with the physical plan. SparkLauncher creates a SparkContext providing all the Pig dependency JAR files and Pig itself. SparkLauncher gets an MR plan object created from the physical plan. At this point, we override all the Pig operators to DataDoctor operators recursively in the whole plan. Two iterations are performed over the plan — one that looks at the store operations and recursively travels down the execution tree, and a second iteration that does a breadth-first traversal over the plan and calls convert on each of the operators. The base class of converters in DataDoctor is a POConverter class and defines the abstract method convert, which is called during plan execution. More details of Pig on Spark can be found at PIG4059. As we merge with Apache Pig, we need to focus on the following enhancements to further improve the speed of Pig:

  • Cache operator: Adding a new operator to explicitly tell Spark to cache certain datasets for faster execution
  • Storage hints: Allowing the user to specify the storage location of datasets in Spark for better control of memory
  • YARN and Mesos support: Adding resource manager support for more global deployment and support


In many large-scale data applications, statistical perspectives provide us with fruitful analytics in many ways, including speed and efficiency.

About the author

Praveen Rachabattuni is a tech lead at Sigmoid Analytics, a company that provides a real-time streaming and ETL framework on Apache Spark. Praveen is also a committer to Apache Pig.


Please enter your comment!
Please enter your name here