14 min read

The Iris dataset is the simplest, yet the most famous data analysis task in the ML space. In this article, you will build a solution for data analysis & classification task from an Iris dataset using Scala.

This article is an excerpt taken from Modern Scala Projects written by Ilango Gurusamy.

The following diagrams together help in understanding the different components of this project. That said, this pipeline involves training (fitting), transformation, and validation operations. More than one model is trained and the best model (or mapping function) is selected to give us an accurate approximation predicting the species of an Iris flower (based on measurements of those flowers):

Project block diagram

A breakdown of the project block diagram is as follows:

  • Spark, which represents the Spark cluster and its ecosystem
  • Training dataset
  • Model
  • Dataset attributes or feature measurements
  • An inference process, that produces a prediction column

The following diagram represents a more detailed description of the different phases in terms of the functions performed in each phase. Later we will come to visualize pipeline in terms of its constituent stages.

For now, the diagram depicts four stages, starting with a data pre-processing phase, which is considered separate from the numbered phases deliberately. Think of the pipeline as a two-step process:

  1.  A data cleansing phase, or pre-processing phase. An important phase that could include a subphase of Exploratory Data Analysis (EDA) (not explicitly depicted in the latter diagram).
  2. A data analysis phase that begins with Feature Extraction, followed by Model Fitting, and Model validation, all the way to deployment of an Uber pipeline JAR into Spark:

Pipeline diagram

Referring to the preceding diagram, the first implementation objective is to set up Spark inside an SBT project. An SBT project is a self-contained application, which we can run on the command line to predict Iris labels. In the SBT project,  dependencies are specified in a build.sbt file and our application code will create its  own  SparkSession and SparkContext.

So that brings us to a listing of implementation objectives and these are as follows:

  1. Get the Iris dataset from the UCI Machine Learning Repository
  2. Conduct preliminary EDA in the Spark shell
  3. Create a new Scala project in IntelliJ, and carry out all implementation steps, until the evaluation of the Random Forest classifier
  4. Deploy the application to your local Spark cluster

Step 1# Getting the Iris dataset from the UCI Machine Learning Repository

Head over to the UCI Machine Learning Repository website at https://archive.ics.uci.edu/ml/datasets/iris and click on Download: Data Folder. Extract this folder someplace convenient and copy over iris.csv into the root of your project folder.

You may refer back to the project overview for an in-depth description of the Iris dataset. We depict the contents of the iris.csv file here, as follows:

A snapshot of the Iris dataset with 150 sets

You may recall that the iris.csv file is a 150-row file, with comma-separated values.

Now that we have the dataset, the first step will be performing EDA on it. The Iris dataset is multivariate, meaning there is more than one (independent) variable, so we will carry out a basic multivariate EDA on it. But we need DataFrame to let us do that. How we create a dataframe as a prelude to EDA is the goal of the next section.

Step 2# Preliminary EDA

Before we get down to building the SBT pipeline project, we will conduct a preliminary EDA in spark-shell. The plan is to derive a dataframe out of the dataset and then calculate basic statistics on it.

We have three tasks at hand for spark-shell:

  1. Fire up spark-shell
  2. Load the iris.csv file and build DataFrame
  3. Calculate the statistics

We will then port that code over to a Scala file inside our SBT project.

That said, let’s get down to loading the iris.csv file (inputting the data source) before eventually building DataFrame.

Step 3# Creating an SBT project

Lay out your SBT project in a folder of your choice and name it IrisPipeline or any name that makes sense to you. This will hold all of our files needed to implement and run the pipeline on the Iris dataset.

The structure of our SBT project looks like the following:

Project structure

We will list dependencies in the build.sbt file. This is going to be an SBT project. Hence, we will bring in the following key libraries:

  • Spark Core
  • Spark MLlib
  • Spark SQL

The following screenshot illustrates the build.sbt file:

The build.sbt file with Spark dependencies

The build.sbt file referenced in the preceding snapshot is readily available for you in the book’s download bundle. Drill down to the folder Chapter01 code under ModernScalaProjects_Code and copy the folder over to a convenient location on your computer.

Drop the iris.csv file that you downloaded in Step 1 – getting the Iris dataset from the UCI Machine Learning Repository into the root folder of our new SBT project. Refer to the earlier screenshot that depicts the updated project structure with the iris.csv file inside of it.

Step 4# Creating Scala files in SBT project

Step 4 is broken down into the following steps:

  1. Create the Scala file iris.scala in the com.packt.modern.chapter1 package.
  2. Up until now, we relied on SparkSession and SparkContext, which spark-shell gave us. This time around, we need to create SparkSession, which will, in turn, give us SparkContext.

What follows is how the code is laid out in the iris.scala file.

In iris.scala, after the package statement, place the following import statements:

import org.apache.spark.sql.SparkSession

Create SparkSession inside a trait, which we shall call IrisWrapper:

lazy val session: SparkSession = SparkSession.builder().getOrCreate()

Just one SparkSession is made available to all classes extending from IrisWrapper. Create val to hold the iris.csv file path:

val dataSetPath = "<<path to folder containing your iris.csv file>>\\iris.csv"

Create a method to build DataFrame. This method takes in the complete path to the Iris dataset path as String and returns DataFrame:

def buildDataFrame(dataSet: String): DataFrame = {
 The following is an example of a dataSet parameter string: "C:\\Your\\Path\\To\\iris.csv"

Import the DataFrame class by updating the previous import statement for SparkSession:

import org.apache.spark.sql.{DataFrame, SparkSession}

Create a nested function inside buildDataFrame to process the raw dataset. Name this function getRows. getRows which takes no parameters but returns Array[(Vector, String)]. The textFile method on the SparkContext variable processes the iris.csv into RDD[String]:

val result1: Array[String] = session.sparkContext.textFile(<<path to iris.csv represented by the dataSetPath variable>>)

The resulting RDD contains two partitions. Each partition, in turn, contains rows of strings separated by a newline character, '\n'. Each row in the RDD represents its original counterpart in the raw data.

In the next step, we will attempt several data transformation steps. We start by applying a flatMap operation over the RDD, culminating in the DataFrame creation. DataFrame is a view over Dataset, which happens to the fundamental data abstraction unit in the Spark 2.0 line.

Step 5# Preprocessing, data transformation, and DataFrame creation

We will get started by invoking flatMap, by passing a function block to it, and successive transformations listed as follows, eventually resulting in Array[(org.apache.spark.ml.linalg.Vector, String)]. A vector represents a row of feature measurements.

The Scala code to give us Array[(org.apache.spark.ml.linalg.Vector, String)] is as follows:

//Each line in the RDD is a row in the Dataset represented by a String, which we can 'split' along the new //line character
val result2: RDD[String] = result1.flatMap { partition => partition.split("\n").toList }

//the second transformation operation involves a split inside of each line in the dataset where there is a //comma separating each element of that line
val result3: RDD[Array[String]] = result2.map(_.split(“,”))

Next, drop the header column, but not before doing a collection that returns an Array[Array[String]]:

val result4: Array[Array[String]] = result3.collect.drop(1)

The header column is gone; now import the Vectors class:

import org.apache.spark.ml.linalg.Vectors

Now, transform Array[Array[String]] into Array[(Vector, String)]:

val result5 = result4.map(row => (Vectors.dense(row(1).toDouble, row(2).toDouble, row(3).toDouble, row(4).toDouble),row(5)))

Step 6# Creating, training, and testing data

Now, let’s split our dataset in two by providing a random seed:

val splitDataSet: Array[org.apache.spark.sql.Dataset
[org.apache.spark.sql.Row]] = dataSet.randomSplit(Array(0.85, 0.15), 98765L)

Now our new splitDataset contains two datasets:

  • Train dataset: A dataset containing Array[(Vector, iris-species-label-column: String)]
  • Test dataset: A dataset containing Array[(Vector, iris-species-label-column: String)]

Confirm that the new dataset is of size 2:

res48: Int = 2

Assign the training dataset to a variable, trainSet:

val trainDataSet = splitDataSet(0)
trainSet: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [iris-features-column: vector, iris-species-label-column: string]

Assign the testing dataset to a variable, testSet:

val testDataSet = splitDataSet(1)
testSet: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [iris-features-column: vector, iris-species-label-column: string]

Count the number of rows in the training dataset:

res12: Long = 14

Count the number of rows in the testing dataset:

res9: Long = 136

There are 150 rows in all.

Step 7# Creating a Random Forest classifier

In reference to Step 5 – DataFrame Creation. This DataFrame ‘dataFrame’ contains column names that corresponds to the columns present in the DataFrame produced in that step

The first step to create a classifier is to  pass into it (hyper) parameters. A fairly comprehensive list of parameters look like this:

  • From ‘dataFrame’ we need the Features column name – iris-features-column
  • From ‘dataFrame’ we also need the Indexed label column name – iris-species-label-column
  • The sqrt setting for featureSubsetStrategy
  • Number of features to be considered per split (we have 150 observations and four features that will make our max_features value 2)
  • Impurity settings—values can be gini and entropy
  • Number of trees to train (since the number of trees is greater than one, we set a tree maximum depth), which is a number equal to the number of nodes
  • The required minimum number of feature measurements (sampled observations), also known as the minimum instances per node

Look at the IrisPipeline.scala file for values of each of these parameters.

But this time, we will employ an exhaustive grid search-based model selection process based on combinations of parameters, where parameter value ranges are specified.

Create a randomForestClassifier instance. Set the features and featureSubsetStrategy:

val randomForestClassifier = new RandomForestClassifier()

Start building Pipeline, which has two stages, Indexer and Classifier:

val irisPipeline = new Pipeline().setStages(Array[PipelineStage](indexer) ++  Array[PipelineStage](randomForestClassifier))

Next, set the hyperparameter num_trees (number of trees) on the classifier to 15, a Max_Depth parameter, and an impurity with two possible values of gini and entropy.

Build a parameter grid with all three hyperparameters:

val finalParamGrid: Array[ParamMap] = gridBuilder3.build()

Step 8# Training the Random Forest classifier

Next, we want to split our training set into a validation set and a training set:

val validatedTestResults: DataFrame = new TrainValidationSplit()

On this variable, set Seed, set EstimatorParamMaps, set Estimator with irisPipeline, and set a training ratio to 0.8:

val validatedTestResults: DataFrame = new TrainValidationSplit().setSeed(1234567L).setEstimator(irisPipeline)

Finally, do a fit and a transform with our training dataset and testing dataset. Great! Now the classifier is trained. In the next step, we will apply this classifier to testing the data.

Step 9# Applying the Random Forest classifier to test data

The purpose of our validation set is to be able to make a choice between models. We want an evaluation metric and hyperparameter tuning. We will now create an instance of a validation estimator called TrainValidationSplit, which will split the training set into a validation set and a training set:

val validatedTestResults.setEvaluator(new MulticlassClassificationEvaluator())

Next, we fit this estimator over the training dataset to produce a model and a transformer that we will use to transform our testing dataset. Finally, we perform a validation for hyperparameter tuning by applying an evaluator for a metric.

The new ValidatedTestResults DataFrame should look something like this:

 |iris-features-column|iris-species-column|label| rawPrediction| probability|prediction|
 | [4.4,3.2,1.3,0.2]| Iris-setosa| 0.0| [40.0,0.0,0.0]| [1.0,0.0,0.0]| 0.0|
 | [5.4,3.9,1.3,0.4]| Iris-setosa| 0.0| [40.0,0.0,0.0]| [1.0,0.0,0.0]| 0.0|
 | [5.4,3.9,1.7,0.4]| Iris-setosa| 0.0| [40.0,0.0,0.0]| [1.0,0.0,0.0]| 0.0|

Let’s return a new dataset by passing in column expressions for prediction and label:

val validatedTestResultsDataset:DataFrame = validatedTestResults.select("prediction", "label")

In the line of code, we produced a new DataFrame with two columns:

  • An input label
  • A predicted label, which is compared with its corresponding value in the input label column

That brings us to the next step, an evaluation step. We want to know how well our model performed. That is the goal of the next step.

Step 10# Evaluate Random Forest classifier

In this section, we will test the accuracy of the model. We want to know how well our model performed. Any ML process is incomplete without an evaluation of the classifier.

That said, we perform an evaluation as a two-step process:

  1. Evaluate the model output
  2. Pass in three hyperparameters:
val modelOutputAccuracy: Double = new MulticlassClassificationEvaluator()

Set the label column, a metric name, the prediction column label, and invoke evaluation with the validatedTestResults dataset.

Note the accuracy of the model output results on the testing dataset from the modelOutputAccuracy variable.

The other metrics to evaluate are how close the predicted label value in the 'predicted' column is to the actual label value in the (indexed) label column.

Next, we want to extract the metrics:

val multiClassMetrics = new MulticlassMetrics(validatedRDD2)

Our pipeline produced predictions. As with any prediction, we need to have a healthy degree of skepticism. Naturally, we want a sense of how our engineered prediction process performed. The algorithm did all the heavy lifting for us in this regard. That said, everything we did in this step was done for the purpose of evaluation. Who is being evaluated here or what evaluation is worth reiterating? That said, we wanted to know how close the predicted values were compared to the actual label value. To obtain that knowledge, we decided to use the MulticlassMetrics class to evaluate metrics that will give us a measure of the performance of the model via two methods:

  • Accuracy
  • Weighted precision
val accuracyMetrics = (multiClassMetrics.accuracy, multiClassMetrics.weightedPrecision)
val accuracy = accuracyMetrics._1
val weightedPrecsion = accuracyMetrics._2

These metrics represent evaluation results for our classifier or classification model. In the next step, we will run the application as a packaged SBT application.

Step 11# Running the pipeline as an SBT application

At the root of your project folder, issue the sbt console command, and in the Scala shell, import the IrisPipeline object and then invoke the main method of IrisPipeline with the argument iris:

sbt console
import com.packt.modern.chapter1.IrisPipeline
Accuracy (precision) is 0.9285714285714286 Weighted Precision is: 0.9428571428571428

Step 12# Packaging the application

In the root folder of your SBT application, run:

sbt package

When SBT is done packaging, the Uber JAR can be deployed into our cluster, using spark-submit, but since we are in standalone deploy mode, it will be deployed into [local]:

The application JAR file

The package command created a JAR file that is available under the target folder. In the next section, we will deploy the application into Spark.

Step 13# Submitting the pipeline application to Spark local

At the root of the application folder, issue the spark-submit command with the class and JAR file path arguments, respectively.

If everything went well, the application does the following:

  1. Loads up the data.
  2. Performs EDA.
  3. Creates training, testing, and validation datasets.
  4. Creates a Random Forest classifier model.
  5. Trains the model.
  6. Tests the accuracy of the model. This is the most important part—the ML classification task.
  1. To accomplish this, we apply our trained Random Forest classifier model to the test dataset. This dataset consists of Iris flower data of so far not seen by the model. Unseen data is nothing but Iris flowers picked in the wild.
  2. Applying the model to the test dataset results in a prediction about the species of an unseen (new) flower.
  3. The last part is where the pipeline runs an evaluation process, which essentially is about checking if the model reports the correct species.
  4. Lastly, pipeline reports back on how important a certain feature of the Iris flower turned out to be. As a matter of fact, the petal width turns out to be more important than the sepal width in carrying out the classification task.

Thus we implemented an ML workflow or an ML pipeline. The pipeline combined several stages of data analysis into one workflow. We started by loading the data and from there on, we created training and test data, preprocessed the dataset, trained the RandomForestClassifier model, applied the Random Forest classifier to test data, evaluated the classifier, and computed a process that demonstrated the importance of each feature in the classification.

If you’ve enjoyed reading this post visit the book, Modern Scala Projects to build efficient data science projects that fulfill your software requirements.

Read Next

Deep Learning Algorithms: How to classify Irises using multi-layer perceptrons

Introducing Android 9 Pie, filled with machine learning and baked-in UI features

Paper in Two minutes: A novel method for resource efficient image classification

A Data science fanatic. Loves to be updated with the tech happenings around the globe. Loves singing and composing songs. Believes in putting the art in smart.


Please enter your comment!
Please enter your name here