In this article by Venkat Ankam, author of the book, Big Data Analytics with Spark and Hadoop, we will understand the features of Hadoop and Spark and how we can combine them.
(For more resources related to this topic, see here.)
This article is divided into the following subtopics:
- Introducing Apache Spark
- Why Hadoop + Spark?
Introducing Apache Spark
Hadoop and MapReduce have been around for 10 years and have proven to be the best solution to process massive data with high performance. However, MapReduce lacked performance in iterative computing where the output between multiple MapReduce jobs had to be written to Hadoop Distributed File System (HDFS). In a single MapReduce job as well, it lacked performance because of the drawbacks of the MapReduce framework.
Let’s take a look at the history of computing trends to understand how computing paradigms have changed over the last two decades.
The trend was to reference the URI when the network was cheaper (in 1990), Replicate when storage became cheaper (in 2000), and Recompute when memory became cheaper (in 2010), as shown in Figure 1:
Figure 1: Trends of computing
So, what really changed over a period of time?
Over a period of time, tape is dead, disk has become tape, and SSD has almost become disk. Now, caching data in RAM is the current trend.
Let’s understand why memory-based computing is important and how it provides significant performance benefits.
Figure 2 indicates that data transfer rates from various mediums to the CPU. Disk to CPU is 100 MB/s, SSD to CPU is 600 MB/s, and over a network to CPU is 1 MB to 1 GB/s. However, the RAM to CPU transfer speed is astonishingly fast, which is 10 GB/s. So, the idea is to cache all or partial data in memory so that higher performance can be achieved.
Figure 2: Why memory?
Spark history
Spark started in 2009 as a research project in the UC Berkeley RAD Lab, that later became AMPLab. The researchers in the lab had previously been working on Hadoop MapReduce and observed that MapReduce was inefficient for iterative and interactive computing jobs. Thus, from the beginning, Spark was designed to be fast for interactive queries and iterative algorithms, bringing in ideas such as support for in-memory storage and efficient fault recovery.
In 2011, AMPLab started to develop high-level components in Spark, such as Shark and Spark Streaming. These components are sometimes referred to as Berkeley Data Analytics Stack (BDAS).
Spark was first open sourced in March 2010 and transferred to the Apache Software Foundation in June 2013, where it is now a top-level project.
In February 2014, it became a top-level project at the Apache Software Foundation. Spark has since become one of the largest open source communities in big data. Now, over 250+ contributors in 50+ organizations are contributing to Spark development. User base has increased tremendously from small companies to Fortune 500 companies.Figure 3 shows the history of Apache Spark:
Figure 3: The history of Apache Spark
What is Apache Spark?
Let’s understand what Apache Spark is and what makes it a force to reckon with in big data analytics:
- Apache Spark is a fast enterprise-grade large-scale data processing, which is interoperable with Apache Hadoop.
- It is written in Scala, which is both an object-oriented and functional programming language that runs in a JVM.
- Spark enables applications to distribute data in-memory reliably during processing. This is the key to Spark’s performance as it allows applications to avoid expensive disk access and performs computations at memory speeds.
- It is suitable for iterative algorithms by having every iteration access data through memory.
- Spark programs perform 100 times faster than MapReduce in-memory or 10 times faster on the disk (http://spark.apache.org/).
- It provides native support for Java, Scala, Python, and R languages with interactive shells for Scala, Python, and R. Applications can be developed easily and often 2 to 10 times less code is needed.
- Spark powers a stack of libraries including Spark SQL and DataFrames for interactive analytics, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for real-time analytics. You can combine these features seamlessly in the same application.
- Spark runs on Hadoop, Mesos, standalone resource managers, on-premise hardware, or in the cloud.
What Apache Spark is not
Hadoop provides us with HDFS for storage and MapReduce for compute. However, Spark does not provide any specific storage medium. Spark is mainly a compute engine, but you can store data in-memory or on Tachyon to process it.
Spark has the ability to create distributed datasets from any file stored in the HDFS or other storage systems supported by Hadoop APIs (including your local filesystem, Amazon S3, Cassandra, Hive, HBase, Elasticsearch, and so on).
It’s important to note that Spark is not Hadoop and does not require Hadoop to run. It simply has support for storage systems implementing Hadoop APIs. Spark supports text files, SequenceFiles, Avro, Parquet, and any other Hadoop InputFormat.
Can Spark replace Hadoop?
Spark is designed to interoperate with Hadoop. It’s not a replacement for Hadoop but for the MapReduce framework on Hadoop. All Hadoop processing frameworks (Sqoop, Hive, Pig, Mahout, Cascading, Crunch, and so on) using MapReduce as the engine now use Spark as an additional processing engine.
MapReduce issues
MapReduce developers faced challenges with respect to performance and converting every business problem to a MapReduce problem. Let’s understand the issues related to MapReduce and how they are addressed in Apache Spark:
- MapReduce (MR) creates separate JVMs for every Mapper and Reducer. Launching JVMs takes time.
- MR code requires a significant amount of boilerplate coding. The programmer needs to think and design every business problem in terms of Map and Reduce, which makes it a very difficult program. One MR job can rarely do a full computation. You need multiple MR jobs to finish the complete task and the programmer needs to design and keep track of optimizations at all levels.
- An MR job writes the data to the disk between each job and hence is not suitable for iterative processing.
- A higher level of abstraction, such as Cascading and Scalding, provides better programming of MR jobs, but it does not provide any additional performance benefits.
- MR does not provide great APIs either.
MapReduce is slow because every job in a MapReduce job flow stores data on the disk. Multiple queries on the same dataset will read the data separately and create a high disk I/O, as shown in Figure 4:
Figure 4: MapReduce versus Apache Spark
Spark takes the concept of MapReduce to the next level to store the intermediate data in-memory and reuse it, as needed, multiple times. This provides high performance at memory speeds, as shown in Figure 4.
If I have only one MapReduce job, does it perform the same as Spark?
No, the performance of the Spark job is superior to the MapReduce job because of in-memory computations and shuffle improvements. The performance of Spark is superior to MapReduce even when the memory cache is disabled. A new shuffle implementation (sort-based shuffle instead of hash-based shuffle), a new network module (based on netty instead of using block manager to send shuffle data), and a new external shuffle service make Spark perform the fastest petabyte sort (on 190 nodes with 46TB RAM) and terabyte sort. Spark sorted 100 TB of data using 206 EC2 i2.8x large machines in 23 minutes. The previous world record was 72 minutes, set by a Hadoop MapReduce cluster of 2,100 nodes. This means that Spark sorted the same data 3x faster using 10x less machines. All the sorting took place on the disk (HDFS) without using Spark’s in-memory cache (https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html).
To summarize, here are the differences between MapReduce and Spark:
MapReduce |
Spark |
|
Ease of use |
Not easy to code and use |
Spark provides a good API and is easy to code and use |
Performance |
Performance is relatively poor when compared with Spark |
In-memory performance |
Iterative processing |
Every MR job writes the data to the disk and the next iteration reads from the disk |
Spark caches data in-memory |
Fault Tolerance |
Its achieved by replicating the data in HDFS |
Spark achieves fault tolerance by resilient distributed dataset (RDD) lineage |
Runtime Architecture |
Every Mapper and Reducer runs in a separate JVM |
Tasks are run in a preallocated executor JVM |
Shuffle |
Stores data on the disk |
Stores data in-memory and on the disk |
Operations |
Map and Reduce |
Map, Reduce, Join, Cogroup, and many more |
Execution Model |
Batch |
Batch, Interactive, and Streaming |
Natively supported Programming Languages |
Java |
Java, Scala, Python, and R |
Spark’s stack
Spark’s stack components are Spark Core, Spark SQL and DataFrames, Spark Streaming, MLlib, and Graphx, as shown in Figure 5:
Figure 5: The Apache Spark ecosystem
Here is a comparison of Spark components versus Hadoop components:
Spark |
Hadoop |
Spark Core |
MapReduce Apache Tez |
Spark SQL and DataFrames |
Apache Hive Impala Apache Tez Apache Drill |
Spark Streaming |
Apache Storm |
Spark MLlib |
Apache Mahout |
Spark GraphX |
Apache Giraph |
To understand the framework at a higher level, let’s take a look at these core components of Spark and their integrations:
Feature |
Details |
Programming languages |
Java, Scala, Python, and R. Scala, Python, and R shell for quick development. |
Core execution engine |
Spark Core: Spark Core is the underlying general execution engine for the Spark platform and all the other functionality is built on top of it. It provides Java, Scala, Python, and R APIs for the ease of development. Tungsten: This provides memory management and binary processing, cache-aware computation and code generation. |
Frameworks |
Spark SQL and DataFrames: Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine. Spark Streaming: Spark Streaming enables us to build scalable and fault-tolerant streaming applications. It integrates with a wide variety of data sources, including filesystems, HDFS, Flume, Kafka, and Twitter. MLlib: MLlib is a machine learning library to create data products or extract deep meaning from the data. MLlib provides a high performance because of in-memory caching of data. Graphx: GraphX is a graph computation engine with graph algorithms to build graph applications. |
Off-heap storage |
Tachyon: This provides reliable data sharing at memory speed within and across cluster frameworks/jobs. Spark’s default OFF_HEAP (experimental) storage is Tachyon. |
Cluster resource managers |
Standalone: By default, applications are submitted to the standalone mode cluster and each application will try to use all the available nodes and resources. YARN: YARN controls the resource allocation and provides dynamic resource allocation capabilities. Mesos: Mesos has two modes, Coarse-grained and Fine-grained. The coarse-grained approach has a static number of resources just like the standalone resource manager. The fine-grained approach has dynamic resource allocation just like YARN. |
Storage |
HDFS, S3, and other filesystems with the support of Hadoop InputFormat. |
Database integrations |
HBase, Cassandra, Mongo DB, Neo4J, and RDBMS databases. |
Integrations with streaming sources |
Flume, Kafka and Kinesis, Twitter, Zero MQ, and File Streams. |
Packages |
http://spark-packages.org/ provides a list of third-party data source APIs and packages. |
Distributions |
Distributions from Cloudera, Hortonworks, MapR, and DataStax. |
The Spark ecosystem is a unified stack that provides you with the power of combining SQL, streaming, and machine learning in one program. The advantages of unification are as follows:
- No need of copying or ETL of data between systems
- Combines processing types in one program
- Code reuse
- One system to learn
- One system to maintain
An example of unification is shown in Figure 6:
Figure 6: Unification of the Apache Spark ecosystem
Why Hadoop + Spark?
Apache Spark shines better when it is combined with Hadoop. To understand this, let’s take a look at Hadoop and Spark features.
Hadoop features
The Hadoop features are described as follows:
Feature |
Details |
Unlimited scalability |
Stores unlimited data by scaling out HDFS Effectively manages the cluster resources with YARN Runs multiple applications along with Spark Thousands of simultaneous users |
Enterprise grade |
Provides security with Kerberos authentication and ACLs authorization Data encryption High reliability and integrity Multitenancy |
Wide range of applications |
Files: Strucutured, semi-structured, or unstructured Streaming sources: Flume and Kafka Databases: Any RDBMS and NoSQL database |
Spark features
The Spark features are described as follows:
Feature |
Details |
Easy development |
No boilerplate coding Multiple native APIs: Java, Scala, Python, and R REPL for Scala, Python, and R |
In-memory performance |
RDDs Direct Acyclic Graph (DAG) to unify processing |
Unification |
Batch, SQL, machine learning, streaming, and graph processing |
When both frameworks are combined, we get the power of enterprise-grade applications with in-memory performance, as shown in Figure 7:
Figure 7: Spark applications on the Hadoop platform
Frequently asked questions about Spark
The following are the frequent questions that practitioners raise about Spark:
- My dataset does not fit in-memory. How can I use Spark?
Spark’s operators spill data to the disk if it does not fit in-memory, allowing it to run well on data of any size. Likewise, cached datasets that do not fit in-memory are either spilled to the disk or recomputed on the fly when needed, as determined by the RDD’s storage level. By default, Spark will recompute the partitions that don’t fit in-memory. The storage level can be changed to MEMORY_AND_DISK to spill partitions to the disk.
Figure 8 shows the performance difference in fully cached versus on the disk:Figure 8: Spark performance: Fully cached versus on the disk
How does fault recovery work in Spark?
Spark’s in-built fault tolerance based on RDD lineage will automatically recover from failures. Figure 9 shows the performance over failure in the 6th iteration in a k-means algorithm:
Figure 9: Fault recovery performance
Summary
In this article, we saw an introduction to Apache Spark and the features of Hadoop and Spark and discussed how we can combine them together.
Resources for Article:
Further resources on this subject:
- Adding a Spark to R[article]
- Big Data Analytics[article]
- Big Data Analysis (R and Hadoop)[article]