7 min read

[box type=”note” align=”” class=”” width=””]Below given post is a book excerpt taken from Learning Spark SQL written by Aurobindo Sarkar. This book will help you design, implement, and deliver successful streaming applications, machine learning pipelines and graph applications using Spark SQL API.[/box]

Our article aims to give you an understanding of how exploratory data analysis is performed with Spark SQL.

What is Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA), or Initial Data Analysis (IDA), is an approach to data analysis that attempts to maximize insight into data. This includes assessing the quality and structure of the data, calculating summary or descriptive statistics, and plotting appropriate graphs. It can uncover underlying structures and suggest how the data should be modeled. Furthermore, EDA helps us detect outliers, errors, and anomalies in our data, and deciding what to do about such data is often more important than other, more sophisticated analysis. EDA enables us to test our underlying assumptions, discover clusters and other patterns in our data, and identify the possible relationships between various variables. A careful EDA process is vital to understanding the data and is sometimes sufficient to reveal such poor data quality that using a more sophisticated model-based analysis is not justified.

Typically, the graphical techniques used in EDA are simple, consisting of plotting the raw data and simple statistics. The focus is on the structures and models revealed by the data or best fit the data. EDA techniques include scatter plots, box plots, histograms, probability plots, and so on. In most EDA techniques, we use all of the data, without making any underlying assumptions. The analyst builds intuition, or gets a “feel”, for the Dataset as a result of such exploration. More specifically, the graphical techniques allow us to efficiently select and validate appropriate models, test our assumptions, identify relationships, select estimators, detect outliers, and so on.

EDA involves a lot of trial and error, and several iterations. The best way is to start simple and then build in complexity as you go along. There is a major trade-off in modeling between the simple and the more accurate ones. Simple models may be much easier to interpret and understand. These models can get you to 90% accuracy very quickly, versus a more complex model that might take weeks or months to get you an additional 2% improvement. For example, you should plot simple histograms and scatter plots to quickly start developing an intuition for your data.

Using Spark SQL for basic data analysis

Interactively, processing and visualizing large data is challenging as the queries can take a long time to execute and the visual interface cannot accommodate as many pixels as data points. Spark supports in-memory computations and a high degree of parallelism to achieve interactivity with large distributed data. In addition, Spark is capable of handling petabytes of data and provides a set of versatile programming interfaces and libraries. These include SQL, Scala, Python, Java and R APIs, and libraries for distributed statistics and machine learning.

For data that fits into a single computer, there are many good tools available, such as R, MATLAB, and others. However, if the data does not fit into a single machine, or if it is very complicated to get the data to that machine, or if a single computer cannot easily process the data, then this section will offer some good tools and techniques for data exploration. In this section, we will go through some basic data exploration exercises to understand a sample Dataset. We will use a Dataset that contains data related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The marketing campaigns were based on phone calls to customers. We’ll use the bank-additional-full.csv file that contains 41,188 records and 20 input fields, ordered by date (from May 2008 to November 2010). The Dataset has been contributed by S. Moro, P. Cortez, and P. Rita, and can be downloaded from https://archive.ics.uci.edu/ml/datasets/Bank+Marketing.

  1. As a first step, let’s define a schema and read in the CSV file to create a DataFrame. You can use :paste command to paste initial set of statements in your Spark shell session (use Ctrl+D to exit the paste mode), as shown:

Data Extraction on Spark SQL

2. After the DataFrame has been created, we first verify the number of records:

EDA on Spark SQL

  1. We can also define a case class called Call for our input records, and then create a strongly-typed Dataset, as follows:

EDA on Spark SQL 2

Identifying missing data

Missing data can occur in Datasets due to reasons ranging from negligence to a refusal on the part of respondents to provide a specific data point. However, in all cases, missing data is a common occurrence in real-world Datasets. Missing data can create problems in data analysis and sometimes lead to wrong decisions or conclusions. Hence, it is very important to identify missing data and devise effective strategies to deal with it. In this section, we analyze the numbers of records with missing data fields in our sample Dataset. In order to simulate missing data, we will edit our sample Dataset by replacing fields containing “unknown” values with empty strings. First, we created a DataFrame/Dataset from our edited file, as shown:

Data extraction on Spark SQL

In the next section, we will compute some basic statistics for our sample Dataset to improve our understanding of the data.

Computing basic statistics

Computing basic statistics is essential for a good preliminary understanding of our data.

First, for convenience, we create a case class and a Dataset containing a subset of fields from our original DataFrame. In the following example, we choose some of the numeric fields and the outcome field, that is, the “term deposit subscribed” field:

Data extraction Spark SQL 3

Next, we use describe() compute the count, mean, stdev, min, and max values for the numeric columns in our Dataset. The describe() command gives a way to do a quick sense-check on your data. For example, the counts of rows of each of the columns selected matches the total number records in the DataFrame (no null or invalid rows),whether the average and range of values for the age column matching your expectations, and so on. Based on the values of the means and standard deviations, you can get select certain data elements for deeper analysis. For example, assuming normal distribution, the mean and standard deviation values for age suggest most values of age are in the range 30 to 50 years, for other columns the standard deviation values may be indicative of a skew in the data (as the standard deviation is greater than the mean).

Data extraction on Spark SQL 4

Identifying data outliers

An outlier or an anomaly is an observation of the data that deviates significantly from other observations in the Dataset. These erroneous outliers can be due to errors in the data collection or variability in measurement. They can impact the results significantly so it is imperative to identify them during the EDA process. However, these techniques define outliers as points, which do not lie in clusters. The user has to model the data points using statistical distributions, and the outliers are identified depending on how they appear in relation to the underlying model. The main problem with these approaches is that during EDA, the user typically does not have enough knowledge about the underlying data distribution. EDA, using a modeling and visualizing approach, is a good way of achieving a deeper intuition of our data. Spark MLlib supports a large (and growing) set of distributed machine learning algorithms to make this task simpler. For example, we can apply clustering algorithms and visualize the results to detect outliers in a combination columns. In the following example, we use the last contact duration, in seconds (duration), number of contacts performed during this campaign, for this client (campaign), number of days that have passed by after the client was last contacted from a previous campaign (pdays) and the previous: number of contacts performed before this campaign and for this client (prev) values to compute two clusters in our data by applying the k-means clustering algorithm:

Data extraction on Spark SQL 5

If you liked this article, please be sure to check out Learning Spark SQL which will help you learn more useful techniques on data extraction and data analysis.

Learning Spark SQL

 

 

LEAVE A REPLY

Please enter your comment!
Please enter your name here