9 min read

[box type=”note” align=”” class=”” width=””]This article is an excerpt from a book by Muhammad Asif Abbasi titled Learning Apache Spark 2. In this book, author explains how to perform big data analytics using Spark streaming, machine learning techniques and more.[/box]

Today, we will learn about the basics of Spark architecture and its various components. We will also explore how to install Spark for running it in the local mode.

Apache Spark architecture overview

Apache Spark is an open source distributed data processing engine for clusters, which provides a unified programming model engine across different types data processing workloads and platforms.

Apache Spark

At the core of the project is a set of APIs for Streaming, SQL, Machine Learning (ML), and

Graph. Spark community supports the Spark project by providing connectors to various open source and proprietary data storage engines. Spark also has the ability to run on a variety of cluster managers like YARN and Mesos, in addition to the Standalone cluster manager which comes bundled with Spark for standalone installation. This is thus a marked difference from Hadoop ecosystem where Hadoop provides a complete platform in terms of storage formats, compute engine, cluster manager, and so on. Spark has been designed with the single goal of being an optimized compute engine. This therefore allows you to run Spark on a variety of cluster managers including being run standalone, or being plugged into YARN and Mesos. Similarly, Spark does not have its own storage, but it can connect to a wide number of storage engines.

Currently Spark APIs are available in some of the most common languages including Scala, Java, Python, and R.

Let’s start by going through various API’s available in Spark.

Spark-core

At the heart of the Spark architecture is the core engine of Spark, commonly referred to as spark-core, which forms the foundation of this powerful architecture. Spark-core provides services such as managing the memory pool, scheduling of tasks on the cluster (Spark works as a Massively Parallel Processing (MPP) system when deployed in cluster mode), recovering failed jobs, and providing support to work with a wide variety of storage systems such as HDFS, S3, and so on.

Note: Spark-Core provides a full scheduling component for Standalone Scheduling: Code is available at:

https://github.com/apache/spark/tree/master/core/src/main/scala/org/apache/spark/scheduler

Spark-Core abstracts the users of the APIs from lower-level technicalities of working on a cluster. Spark-Core also provides the RDD APIs which are the basis of other higher-level APIs, and are the core programming elements on Spark.

Note: MPP systems generally use a large number of processors (on separate hardware or virtualized) to perform a set of operations in parallel. The objective of the MPP systems is to divide work into smaller task pieces and running them in parallel to increase in throughput time.

Spark SQL


Spark SQL is one of the most popular modules of Spark designed for structured and semistructured data processing. Spark SQL allows users to query structured data inside Spark programs using SQL or the DataFrame and the Dataset API, which is usable in Java, Scala, Python, and R. Because of the fact that the DataFrame API provides a uniform way to access a variety of data sources, including Hive datasets, Avro, Parquet, ORC, JSON, and JDBC, users should be able to connect to any data source the same way, and join across these multiple sources together. The usage of Hive meta store by Spark SQL gives the user full compatibility with existing Hive data, queries, and UDFs. Users can seamlessly run their current Hive workload without modification on Spark. Spark SQL can also be accessed through spark-sql shell, and existing business tools can connect via standard JDBC and ODBC interfaces.

Spark streaming

More than 50% of users consider Spark Streaming to be the most important component of Apache Spark. Spark Streaming is a module of Spark that enables processing of data arriving in passive or live streams of data. Passive streams can be from static files that you choose to stream to your Spark cluster. This can include all sorts of data ranging from web server logs, social-media activity (following a particular Twitter hashtag), sensor data from your car/phone/home, and so on. Spark-streaming provides a bunch of APIs that help you to create streaming applications in a way similar to how you would create a batch job, with minor tweaks.

As of Spark 2.0, the philosophy behind Spark Streaming is not to reason about streaming and building data application as in the case of a traditional data source. This means the data from sources is continuously appended to the existing tables, and all the operations are run on the new window. A single API lets the users create batch or streaming applications, with the only difference being that a table in batch applications is finite, while the table for a streaming job is considered to be infinite.

MLlib

MLlib is Machine Learning Library for Spark, if you remember from the preface, iterative algorithms are one of the key drivers behind the creation of Spark, and most machine learning algorithms perform iterative processing in one way or another.

Note: Machine learning is a type of artificial intelligence (AI) that provides computers with the ability to learn without being explicitly programmed. Machine learning focuses on the development of computer programs that can teach themselves to grow and change when exposed to new data.

Spark MLlib allows developers to use Spark API and build machine learning algorithms by tapping into a number of data sources including HDFS, HBase, Cassandra, and so on. Spark is super fast with iterative computing and it performs 100 times better than MapReduce. Spark MLlib contains a number of algorithms and utilities including, but not limited to, logistic regression, Support Vector Machine (SVM), classification and regression trees, random forest and gradient-boosted trees, recommendation via ALS, clustering via K-Means, Principal Component Analysis (PCA), and many others.

GraphX

GraphX is an API designed to manipulate graphs. The graphs can range from a graph of web pages linked to each other via hyperlinks to a social network graph on Twitter connected by followers or retweets, or a Facebook friends list.

Graph theory is a study of graphs, which are mathematical structures used to model pairwise relations between objects. A graph is made up of vertices (nodes/points), which are connected by edges (arcs/lines).

                                                                                                                           – Wikipedia.org

Spark provides a built-in library for graph manipulation, which therefore allows the developers to seamlessly work with both graphs and collections by combining ETL, discovery analysis, and iterative graph manipulation in a single workflow. The ability to combine transformations, machine learning, and graph computation in a single system at high speed makes Spark one of the most flexible and powerful frameworks out there. The ability of Spark to retain the speed of computation with the standard features of fault-tolerance makes it especially handy for big data problems. Spark GraphX has a number of built-in graph algorithms including PageRank, Connected components, Label propagation, SVD++, and Triangle counter.

Spark deployment

Apache Spark runs on both Windows and Unix-like systems (for example, Linux and Mac OS). If you are starting with Spark you can run it locally on a single machine. Spark requires Java 7+, Python 2.6+, and R 3.1+. If you would like to use Scala API (the language in which Spark was written), you need at least Scala version 2.10.x.

Spark can also run in a clustered mode, using which Spark can run both by itself, and on several existing cluster managers. You can deploy Spark on any of the following cluster managers, and the list is growing everyday due to active community support:

  • Hadoop YARN.
  • Apache Mesos.
  • Standalone scheduler.
  • Yet Another Resource Negotiator (YARN) is one of the key features including a redesigned resource manager thus splitting out the scheduling and resource management capabilities from original MapReduce in Hadoop.
  • Apache Mesos is an open source cluster manager that was developed at the University of California, Berkeley. It provides efficient resource isolation and sharing across distributed applications, or frameworks.

Installing Apache Spark

As mentioned in the earlier pages, while Spark can be deployed on a cluster, you can also run it in local mode on a single machine.

In this chapter, we are going to download and install Apache Spark on a Linux machine and run it in local mode. Before we do anything we need to download Apache Spark from Apache’s web page for the Spark project:

  1. Use your recommended browser to navigate to http://spark.apache.org/downloads.html.
  2. Choose a Spark release. You’ll find all previous Spark releases listed here. We’ll go with release 2.0.0 (at the time of writing, only the preview edition was available).
  3. You can download Spark source code, which can be built for several versions of Hadoop, or download it for a specific Hadoop version. In this case, we are going to download one that has been pre-built for Hadoop 2.7 or later.
  4. You can also choose to download directly or from among a number of different Mirrors. For the purpose of our exercise we’ll use direct download and download it to our preferred location.
Note: If you are using Windows, please remember to use a pathname without any spaces.

5. The file that you have downloaded is a compressed TAR archive. You need to extract the archive.

Note: The TAR utility is generally used to unpack TAR files. If you don’t have TAR, you might want to download that from the repository or use 7-ZIP, which is also one of my favorite utilities.

6. Once unpacked, you will see a number of directories/files. Here’s what you would typically see when you list the contents of the unpacked directory:

The bin folder contains a number of executable shell scripts such as pypark, sparkR, spark-shell, spark-sql, and spark-submit. All of these executables are used to interact with Spark, and we will be using most if not all of these.

7. If you see my particular download of spark you will find a folder called yarn. The example below is a Spark that was built for Hadoop version 2.7 which comes with YARN as a cluster manager.

Spark folder contents

We’ll start by running Spark shell, which is a very simple way to get started with Spark and learn the API. Spark shell is a Scala Read-Evaluate-Print-Loop (REPL), and one of the few REPLs available with Spark which also include Python and R.

You should change to the Spark download directory and run the Spark shell as follows: /bin/spark-shell.

Starting Spark shell

We now have Spark running in standalone mode. We’ll discuss the details of the deployment architecture a bit later in this chapter, but now let’s kick start some basic Spark programming to appreciate the power and simplicity of the Spark framework.

We have gone through the Spark architecture overview and the steps to install Spark on your own personal machine.

To know more about Spark SQL, Spark Streaming, and Machine Learning with Spark, you can refer our book Learning Apache Spark 2.

Learning Apache Spark 2

 

 

Data Science fanatic. Cricket fan. Series Binge watcher. You can find me hooked to my PC updating myself constantly if I am not cracking lame jokes with my team.

LEAVE A REPLY

Please enter your comment!
Please enter your name here