In this article by Rajanarayanan Thottuvaikkatumana, author of the book Apache Spark 2 for Beginners, you will get an overview of Spark. By exampledata is one of the most important assets of any organization. The scale at which data is being collected and used in organizations is growing beyond imagination. The speed at which data is being ingested, the variety of the data types in use, and the amount of data that is being processed and stored are breaking all time records every moment. It is very common these days, even in small scale organizations, the data is growing from gigabytes to terabytes to petabytes. Just because of the same reason, the processing needs are also growing that asks for capability to process data at rest as well as data on the move.

(For more resources related to this topic, see here.)

Take any organization, its success depends on the decisions made by its leaders and for taking sound decisions, you need the backing of good data and the information generated by processing the data. This poses a big challenge on how to process the data in a timely and cost-effective manner so that right decisions can be made. Data processing techniques have evolved since the early days of computers. Countless data processing products and frameworks came into the market and disappeared over these years. Most of these data processing products and frameworks were not general purpose in nature. Most of the organizations relied on their own bespoke applications for their data processing needs in a silo way or in conjunction with specific products.

Large-scale Internet applications popularly known as Internet of Things (IoT) applications heralded the common need to have open frameworks to process huge amount of data ingested at great speed dealing with various types of data. Large scale websites, media streaming applications, and huge batch processing needs of the organizations made the need even more relevant. The open source community is also growing considerably along with the growth of Internet delivering production quality software supported by reputed software companies. A huge number of companies started using open source software and started deploying them in their production environments.

Apache Spark

Spark is a Java Virtual Machine (JVM) based distributed data processing engine that scales, and it is fast as compared to many other data processing frameworks. Spark was born out of University of California, Berkeley, and later became one of the top projects in Apache. The research paper Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center talks about the philosophy behind the design of Spark. The research paper says:

"To test the hypothesis that simple specialized frameworks provide value, we identified one class of jobs that were found to perform poorly on Hadoop by machine learning researchers at our lab: iterative jobs, where a dataset is reused across a number of iterations. We built a specialized framework called Spark optimized for these workloads."

The biggest claim from Spark on the speed is Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Spark could make this claim because Spark does the processing in the main memory of the worker nodes andprevents the unnecessary I/O operations with the disks. The other advantage Spark offers is the ability to chain the tasks even at an application programming level without writing onto the disks at all or minimizing the number of writes to the disks.

The Spark programming paradigm is very powerful and exposes a uniform programming model supporting the application development in multiple programming languages. Spark supports programming in Scala, Java, Python, and R even though there is no functional parity across all the programming languages supported. Apart from writing Spark applications in these programming languages, Spark has an interactive shell with Read, Evaluate, Print, and Loop (REPL) capabilities for the programming languages Scala, Python, and R. At this moment, there is no REPL support for Java in Spark. The Spark REPL is a very versatile tool that can be used to try and test Spark application code in an interactive fashion. The Spark REPL enables easy prototyping, debugging, and much more.

In addition to the core data processing engine, Spark comes with a powerful stack of domain-specific libraries that use the core Spark libraries and provide various functionalities useful for various big data processing needs. The following list gives the supported libraries:

Library	Use	Supported Languages
Spark SQL	Enables the use of SQL statements or DataFrame API inside Spark applications	Scala, Java, Python, and R
Spark Streaming	Enables processing of live data streams	Scala, Java, and Python
Spark MLlib	Enables development of machine learning applications	Scala, Java, Python, and R
Spark GraphX	Enables graph processing and supports a growing library of graph algorithms	Scala

Understanding the Spark programming model

Spark became an instant hit in the market because of its ability to process a huge amount of data types and growing number of data sources and data destinations. The most important and basic data abstraction Spark provides is the resilient distributed dataset (RDD). Spark supports distributed processing on a cluster of nodes. The moment there is a cluster of nodes, there are good chances that when the data processing is going on, some of the nodes can die. When such failures happen, the framework should be capable of coming out of such failures. Spark is designed to do that and that is what the resilient part in the RDD signifies. If there is a huge amount of data to be processed and there are nodes available in the cluster, the framework should have the capability to split the big dataset into smaller chunks and distribute them to be processed on more than one node in a cluster in parallel. Spark is capable of doing that and that is what the distributed part in the RDD signifies. In other words, Spark is designed from ground up to have its basic dataset abstraction capable of getting split into smaller pieces deterministically and distributed to more than one nodes in the cluster for parallel processing while elegantly handling the failures in the nodes.

Spark RDD is immutable. Once an RDD is created, intentionally or unintentionally, it cannot be changed. This gives another insight into the construction of an RDD. There are some strong rules based on which an RDD is created. Because of that, when the nodes processing some part of an RDD die, the driver program can recreate those parts and assign the task of processing it to another node and ultimately completing the data processing job successfully. Since the RDD is immutable, splitting a big one to smaller ones, distributing them to various worker nodes for processing and finally compiling the results to produce the final result can be done safely without worrying about the underlying data getting changed.

Spark RDD is distributable. If Spark is run in a cluster mode where there are multiple worker nodes available to take the tasks, all these nodes are having different execution contexts. The individual tasks are distributed and run on different JVMs. All these activities of a big RDD getting divided into smaller chunks, getting distributed for processing to the worker nodes and finally assembling the results back are completely hidden from the users. Spark has its on mechanism from recovering from the system faults and other forms of errors happening during the data processing.Hence this data abstraction is highly resilient.

Spark RDD lives in memory (most of the time). Spark does keep all the RDDs in the memory as much as it can. Only in rare situations where Spark is running out of memory or if the data size is growing beyond the capacity, it is written into disk. Most of the processing on RDD happens in the memory and that is the reason why Spark is able to process the data in a lightning fast speed.

Spark RDD is strongly typed. Spark RDD can be created using any supported data types. These data types can be Scala/Java supported intrinsic data types or custom created data types such as your own classes. The biggest advantage coming out of this design decision is the freedom from runtime errors. If it is going to break because of a data type issue, it will break during the compile time.

Spark does the data processing using the RDDs. From the relevant data source such as text files, and NoSQL data stores, data is read to form the RDDs. On such an RDD, various data transformations are performed and finally the result is collected. To be precise, Spark comes with Spark Transformations and Spark Actions that act upon RDDs.Whenever a Spark Transformation is done on an RDD, a new RDD gets created. This is because RDDs are inherently immutable. These RDDs that are getting created at the end of each Spark Transformation can be saved for future reference or they will go out of scope eventually. The Spark Actions are used to return the computed values to the driver program. The process of creating one or more RDDs, apply transformations and actions on them is a very common usage pattern seen ubiquitously in Spark applications.

Spark SQL

Spark SQL is a library built on top of Spark. It exposes SQL interface, and DataFrame API. DataFrame API supports programming languages Scala, Java, Python and R. In programming languages such as R, there is a data frame abstraction used to store data tables in memory. The Python data analysis library named Pandas also has a similar data frame concept. Once that data structure is available in memory, the programs can extract the data, slice and dice the way as per the need. The same data table concept is extended to Spark known as DataFrame built on top of RDD and there is a very comprehensive API known as DataFrame API in Spark SQL to process the data in the DataFrame. An SQL-like query language is also developed on top of the DataFrame abstraction catering to the needs of the end users to query and process the underlying structured data. In summary, DataFrame is a distributed data table organized in rows and columns having names for each column.

There is no doubt that SQL is the lingua franca for doing data analysis and Spark SQL is the answer from the Spark family of toolsets to do data analysis. So what it provides? It provides the ability to run SQL on top of Spark. Whether the data is coming from CSV, Avro, Parquet, Hive, NoSQL data stores such as Cassandra, or even RDBMS, Spark SQL can be used to analyze data and mix in with Spark programs. Many of the data sources mentioned here are supported intrinsically by Spark SQL and many others are supported by external packages. The most important aspect to highlight here is the ability of Spark SQL to deal with data from a very wide variety of data sources.Once it is available as a DataFrame in Spark, Spark SQL can process them in a completely distributed way, combine the DataFrames coming from various data sources to process, and query as if the entire dataset is coming from a single source.

In the previous section, the RDD was discussed and introduced as the Spark programming model. Are the DataFrames API and the usage of SQL dialects in Spark SQL replacing RDD-based programming model? Definitely not! The RDD-based programming model is the generic and the basic data processing model in Spark. RDD-based programming requires the need to use real programming techniques. The Spark Transformations and Spark Actions use a lot of functional programming constructs. Even though the amount code that is required to be written in RDD-based programming model is less as compared to Hadoop MapReduce or any other paradigm, still there is a need to write some amount of functional code. The is is a barrier to entry enter for many data scientists, data analysts, and business analysts who may perform major exploratory kind of data analysis or doing some prototyping with the data. Spark SQL completely removes those constraints. Simple and easy-to-use domain-specific language (DSL) based methods to read and write data from data sources, SQL-like language to select, filter, aggregate, and capability to read data from a wide variety of data sources makes it easy for anybody who knows the data structure to use it.

Which is the best use case to use RDD and which is the best use case to use Spark SQL? The answer is very simple. If the data is structured, it can be arranged in tables, and if each column can be given a name, then use Spark SQL. This doesn't mean that the RDD and DataFrame are two disparate entities. They interoperate very well. Conversions from RDD to DataFrame and vice versa are very much possible. Many of the Spark Transformations and Spark Actions that are typically applied on RDDs can also be applied on DataFrames. Interaction with Spark SQL library is done mainly through two methods. One is through SQL-like queries and the other is through DataFrame API.

The Spark programming paradigm has many abstractions to choose from when it comes to developing data processing applications. The fundamentals of Spark programming starts with RDDs that can easily deal with unstructured, semi-structured, and structured data. The Spark SQL library offers highly optimized performance when processing structured data. This makes the basic RDDs look inferior in terms of performance. To fill this gap, from Spark 1.6 onwards, a new abstraction named Dataset was introduced that compliments the RDD-based Spark programming model. It works pretty much the same way as RDD when it comes to Spark Transformations and Spark Actions at the same time it is highly optimized like the Spark SQL. Dataset API provides strong compile-time type safety when it comes to writing programs and because of that the Dataset API is available only in Scala and Java.

Too many choices confuses everybody. Here in the Spark programming model also the same problem is seen. But it is not as confusing as in many other programming paradigms. Whenever there is a need to process any kind of data with very high flexibility in terms of the data processing requirements and having the lowest API level control such as library development, RDD-based programming model is ideal. Whenever there is a need to process structured data with flexibility for accessing and processing data with optimized performance across all the supported programming languages, DataFrame-based Spark SQL programming model is ideal. Whenever there is a need to process unstructured data with optimized performance requirements as well as compile-time type safety but not very complex Spark Transformations and Spark Actions usage requirements, dataset-based programming model is ideal. At a data processing application development level, if the programming language of choice permits, it is better to use Dataset and DataFrame to have better performance.

R on Spark

A base R installation cannot interact with Spark. The SparkR package popularly known as R on Spark exposes all the required objects, and functions for R to talk to the Spark ecosystem. As compared to Scala, Java, and Python, the Spark programming in R is different and the SparkR package mainly exposes R API for DataFrame-based Spark SQL programming. At this moment, R cannot be used to manipulate the RDDs of Spark directly. So for all practical purposes, the R API for Spark has access to only Spark SQL abstractions.

How SparkR is going to help the data scientists to do better data processing? The base R installation mandates that all the data to be stored (or accessible) on the computer where R is installed. The data processing happen on the single computer on which the R installation is available. More over if the data size is more than the main memory available on the computer, R will not be able to do the required processing. With SparkR package, there is an access to a whole new world of a cluster of nodes for data storage and for carrying out data processing. With the help of SparkR package, R can be used to access the Spark DataFrames as well as R DataFrames. It is very important to have a distinction of the two types of data frames. R DataFrame is completely local and a data structure of the R language. Spark DataFrame is a parallel collection of structured data managed by the Spark infrastructure. An R DataFrame can be converted to a Spark DataFrame. A Spark DataFrame can be converted to an R DataFrame. When a Spark DataFrame is converted to R DataFrame, it should fit in the available memory of the computer. This conversion is a great feature. By converting an R DataFrame to Spark DataFrame, the data can be distributed and processed in parallel. By converting a Spark DataFrame to an R DataFrame, many computations, charting and plotting that is done by other R functions can be done. In a nutshell, the SparkR package brings in the power of distributed and parallel computing capabilities to R.

Many times when doing data processing with R, because of the sheer size of the data and the need to fit it into the main memory of the computer, the data processing is done in multiple batches and the results are consolidated to compute the final results. This kind of multibatch processing can be completely avoided if Spark with R is used to process the data.

Many times reporting, charting, and plotting are done on the aggregated and summarized raw data. The raw data size can be huge and need not fit into one computer. In such cases, Spark with R can be used to process the entire raw data and finally the aggregated and summarized data can be used to produce the reports, charts, or plots.

Because of the inability to process huge amount of data and for carrying data analysis with R, many times ETL tools are made to use for doing the preprocessing or transformations on the raw data.Only in the final stage the data analysis is done using R. Because of Spark's ability to process data at scale, Spark with R can replace the entire ETL pipeline and do the desired data analysis with R.

SparkR package is yet another R package but that is not stopping anybody from using any of the R packages that are already being used. At the same time, it supplements the data processing capability of R manifold by making use of the huge data processing capabilities of Spark.

Spark data analysis with Python

The ultimate goal of processing data is to use the results for answering business questions. It is very important to understand the data that is being used to answer the business questions. To understand the data better, various tabulation methods, charting and plotting techniques are used. Visual representation of the data reinforces the understanding of the underlying data. Because of this, data visualization is used extensively in data analysis.

There are different terms that are being used in various publications to mean the analysis of data for answering business questions. Data analysis, data analytics, business intelligence, and so on, are some of the ubiquitous terms floating around. This section is not going to delve into the discussion on the meaning, similarities or differences of these terms. On the other hand, the focus is going to be on how to bridge the gap of two major activities typically done by data scientists or data analysts. The first one being the data processing. The second one being the use of the processed data to do analysis with the help of charting and plotting. Data analysis is the forte of data analysts and data scientists. This book focuses on the usage of Spark and Python to process the data and produce charts and plots.

In many data analysis use cases, a super set of data is processed and the reduced resultant dataset is used for the data analysis. This is specifically valid in the case of big data analysis, where a small set of processed data is used for analysis. Depending on the use case, for various data analysis needs an appropriate data processing is to be done as a prerequisite. Most of the use cases that are going to be covered in this book falls into this model where the first step deals with the necessary data processing and the second step deals with the charting and plotting required for the data analysis.

In typical data analysis use cases, the chain of activities involves an extensive and multi staged Extract-Transform-Load (ETL) pipeline ending with a data analysis platform or application. The end result of this chain of activities include but not limited to tables of summary data and various visual representations of the data in the form of charts and plots. Since Spark can process data from heterogeneous distributed data sources very effectively, the huge ETL pipeline that existed in legacy data analysis applications can be consolidated into self contained applications that do the data processing and data analysis.

Process data using Spark, analyze using Python

Python is a programming language heavily used by the data analysts and data scientists these days. There are numerous scientific and statistical data processing libraries as well as charting and plotting libraries available that can be used in Python programs. It is also a widely used programming language to develop data processing applications in Spark. This brings in a great flexibility to have a unified data processing and data analysis framework with Spark, Python,and Python libraries, enabling to carry out scientific, and statistical processing, charting and plotting. There are numerous such libraries that work with Python. Out of all those, the NumPy and SciPylibraries are being used here to do numerical, statistical, and scientific data processing. The library matplotlib is being used here to carry out charting and plotting that produces 2D images.

Processed data is used for data analysis. It requires deep understanding of the processed data. Charts and plots enhance the understanding of the characteristics of the underlying data. In essence, for a data analysis application, data processing, charting and plotting are essential. This book covers the usage of Spark with Python in conjunction with Python charting and plotting libraries for developing data analysis applications.

Spark Streaming

Data processing use cases can be mainly divided into two types. The first type is the use cases where the data is static and processing is done in its entirety as one unit of work or by dividing that into smaller batches. While doing the data processing, neither the underlying dataset changes nor new datasets get added to the processing units. This is batch processing. The second type is the use cases where the data is getting generated like a stream, and the processing is done as and when the data is generated. This is stream processing.

Data sources generate data like a stream and many real-world use cases require them to be processed in a real-time fashion. The meaning of real-time can change from use case to use case. The main parameter that defines what is meant by realtime for a given use case is how soon the ingested data needs to be processed. Or the frequent interval in which all the data ingested since the last interval needs to be processed. For example, when a major sports event is happening, the application that consumes the score events and sending it to the subscribed users should be processing the data as fast as it can. The faster they can be sent, the better it is. But what is the definition of fast here? Is it fine to process the score data say after an hour of the score event happened? Probably not. Is it fine to process the data say after a minute of the score event happened? It is definitely better than processing after an hour. Is it fine to process the data say after a second of the score event happened? Probably yes, and much better than the earlier data processing time intervals.

In any data stream processing use cases, this time interval is very important. The data processing framework should have the capability to process the data stream in an appropriate time interval of choice to deliver good business value.

When processing stream data in regular intervals of choice, the data is collected from the beginning of the time interval to the end of the time interval, grouped them in a micro batch and data processing is done on that batch of data. Over an extended period of time, the data processing application would have processed many such micro batches of data. In this type of processing, the data processing application will have visibility to only the specific micro batch that is getting processed at a given point in time. In other words, the application will not have any visibility or access to the already processed micro batches of data.

Now, there is another dimension to this type of processing. Suppose a given use case mandates the need to process the data every minute, but at the same time, while processing the data of a given micro batch, there is a need to peek into the data that was already processed in the last 15 minutes. A fraud detection module of a retail banking transaction processing application is a good example of this particular business requirement. There is no doubt that the retail banking transactions are to be processed within milliseconds of its occurrence. When processing an ATM cash withdrawal transaction, it is a good idea to see whether somebody is trying to continuously withdraw cash in quick succession and if found, send proper alerting. For this, when processing a given cash withdrawal transaction, check whether there are any other cash withdrawals from the same ATM using the same card happened in the last 15 minutes. The business rule is to alert when such transactions are more than two in the last 15 minutes. In this use case, the fraud detection application should have the visibility to all the transactions happened in a window of 15 minutes.

A good stream data processing framework should have the capability to process the data in any given interval of time with ability to peek into the data ingested within a sliding window of time. The Spark Streaming library that is working on top of Spark is one of the best data stream processing framework that has both of these capabilities.

Spark machine learning

Calculations based on formulae or algorithms were very common since ancient times to find the output for a given input. But without knowing the formulae or algorithms, computer scientists and mathematicians devised methods to generate formulae or algorithms based on an existing set of input, output dataset and predict the output of a new input data based on the generated formulae or algorithms. Generally, this process of 'learning' from a dataset and doing predictions based on the 'learning' is known as Machine Learning. It has its origin from the study of artificial intelligence in computer science. Practical machine learning has numerous applications that are being consumed by the laymen on a daily basis. YouTube users now get suggestions for the next items to be played in the playlist based on the video they are currently viewing. Popular movie rating sites are giving ratings and recommendations based on the user preferences. Social media websites, such as Facebook, suggest a list of names of the users' friends for easy tagging of pictures. What Facebook is doing here is that, it is classifying the pictures by name that is already available in the albums and checking whether the newly added picture has any similarity with the existing ones. If it finds a similarity, it suggests the name.The applications of this kind of picture identification are many. The way all these applications are working is based on the huge amount of input, output dataset that is already collected and the learning done based on that dataset. When a new input dataset comes, a prediction is made by making use of the 'learning' that the computer or machine already did.

In traditional computing, input data is fed to a program to generate output. But in machine learning, input data and output data are fed to a machine learning algorithm to generate a function or program that can be used to predict the output of an input according to the 'learning' done on the input, output dataset fed to the machine learning algorithm.

The data available in the wild may be classified into groups, or it may form clusters or may fit into certain relationships. These are different kinds of machine learning problems. For example, if there is a databank of preowned car sale prices with its associated attributes or features, it is possible to predict the fair price of a car just by knowing the associated attributes or features. Regression algorithms are used to solve these kinds of problems. If there is a databank of spam and non-spam e-mails, then when a new mail comes, it is possible to predict whether the new mail is a spam or non-spam.Classification algorithms are used to solve these kind of problems. These are just a few machine learning algorithm types. But in general, when using a bank of data, if there is a need to apply a machine learning algorithm and using that model predictions are to be done, then the data should be divided into features and outputs. So whichever may be the machine learning algorithm that is being used, there will be a set of features and one or more output(s). Many books and publications use the term label for output. In other words, features are the input and label is the output. Data comes in various shapes and forms. Depending on the machine learning algorithm used, the training data has to be preprocessed to have the features and labels in the right format to be fed to the machine learning algorithm. That in turn generates the appropriate hypothesis function, which takes the features as the input and produces the predicted label.

Why Spark for machine learning?

Spark Machine learning library uses many Spark core functionalities as well as the Spark libraries such as Spark SQL. The Spark machine learning library makes the machine learning application development easy by combining data processing and machine learning algorithm implementations in a unified framework with ability to do data processing on a cluster of nodes combined with ability to read and write data to a variety of data formats.

Spark comes with two flavors of the machine learning library. They are spark.mllib and spark.ml. The first one is developed on top of Spark's RDD abstraction and the second one is developed on top of Spark's DataFrame abstraction. It is recommended to use the spark.ml library for any future machine learning application developments.

Spark graph processing

Graph is a mathematical concept and a data structure in computer science. It has huge applications in many real-world use cases. It is used to model pair-wise relationship between entities. The entity here is known as Vertex and two vertices are connected by an Edge. A graph comprises of a collection of vertices and edges connecting them.

Conceptually, it is a deceptively simple abstraction but when it comes to processing a huge number of vertices and edges, it is computationally intensive and consumes lots of processing time and computing resources.

There are numerous application constructs that can be modeled as graph. In a social networking application, the relationship between users can be modeled as a graph in which the users form the vertices of the graph and the the relationship between users form the edges of the graph. In a multistage job scheduling application, the individual tasks form the vertices of the graph and the sequencing of the tasks forms the edges. In a road traffic modeling system, the towns form the vertices of the graph and the roads connecting the towns form the edges.

The edges of a given graph have a very important property, namely, the direction of the connection. In many use cases, the direction of connection doesn't matter. In the case of connectivity between the cities by roads is one such example. But if the use case is to produce driving directions within a city, the connectivity between traffic-junctions has a direction. Take any two traffic-junctions and there will be a road connectivity, but it is possible that it is a oneway. So, it depends on in which direction the traffic is flowing. If the road is open for traffic from traffic-junction J1 to J2 but closed from J2 to J1, then the graph of driving directions will have a connectivity from J1 to J2 and not from J2 to J1. In such cases, the edge connecting J1 and J2 has a direction. If the traffic between J2 and J3 are open in both ways, then the the edge connecting J2 and J3 has no direction. A graph with all the edges having direction is called a directed graph.

For graph processing, so many libraries are available in the open source world itself. Giraph, Pregel, GraphLab, and Spark GraphX are some of them. The Spark GraphX is one of the recent entrants into this space.

What is so special about Spark GraphX? It is a graph processing library built on top of the Spark data processing framework. Compared to the other graph processing libraries, Spark GraphX has a real advantage. It can make use of all the data processing capabilities of Spark. In reality, the performance of graph processing algorithms is not the only one aspect that needs consideration.

In many of the applications, the data that needs to be modeled as graph does not exist in that form naturally. In many use cases more than the graph processing, lot of processor time and other computing resources are expended to get the data in the right format so that the graph processing algorithms can be applied. This is the sweet spot where the combination of Spark data processing framework and Spark GraphX library delivering its most value. The data processing jobs to make the data ready to be consumed by the Spark GraphX can be easily done using the plethora of tools available in the Spark toolkit. In summary, the Spark GraphX library, which is part of the Spark family combines the power of the core data processing capabilities of Spark and a very easy to use graph processing library.

The biggest limitation of Spark GraphX library is that its API is not currently supported with programming languages such as Python and R. But there is an external Spark package named GraphFrames that solves this limitation. Since GraphFrames is a DataFrame-based library, once it is matured, it will enable the graph processing in all the programming languages supported by DataFrames. This Spark external package is definitely a potential candidate to be included as part of the Spark itself.

Summary

Any technology learned or taught has to be concluded with an application developed covering its salient features. Spark is no different. This book, accomplishes an end-to-end application developed using Lambda Architecture using Spark as the data processing platform and its family of libraries built on top of it.