Adding a Spark to R

2 min read

Spark is written in a language called Scala. It has interfaces to use from Java and Python and from the recent version 1.4.0; it also supports R. This is called SparkR, which we will describe in the next section. The four classes of libraries available in Spark are SQL and DataFrames, Spark Streaming, MLib (machine learning), and GraphX (graph algorithms). Currently, SparkR supports only SQL and DataFrames; others are definitely in the roadmap. Spark can be downloaded from the Apache project page at http://spark.apache.org/downloads.html. Starting from 1.4.0 version, SparkR is included in Spark and no separate download is required.

(For more resources related to this topic, see here.)

SparkR

Similar to RHadoop, SparkR is an R package that allows R users to use Spark APIs through the RDD class. For example, using SparkR, users can run jobs on Spark from RStudio. SparkR can be evoked from RStudio. To enable this, include the following lines in your .Rprofile file that R uses at startup to initialize the environments:

Sys.setenv(SPARK_HOME/.../spark-1.5.0-bin-hadoop2.6")
#provide the correct path where spark downloaded folder is kept for SPARK_HOME
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),""R",""lib"),".libPaths()))

Once this is done, start RStudio and enter the following commands to start using SparkR:

>library(SparkR)
>sc ← sparkR.init(master="local")

As mentioned, as of the latest version 1.5 when this chapter is in writing, SparkR supports limited functionalities of R. This mainly includes data slicing and dicing and summary stat functions. The current version does not support the use of contributed R packages; however, it is planned for a future release. On machine learning, currently SparkR supports the glm( ) function. We will do an example in the next section.

Linear regression using SparkR

In the following example, we will illustrate how to use SparkR for machine learning.

>library(SparkR)
>sc ← sparkR.init(master="local")
>sqlContext ← sparkRSQL.init(sc)

#Importing data
>df ← read.csv("/Users/harikoduvely/Projects/Book/Data     /ENB2012_data.csv",header = T)
>#Excluding variable Y2,X6,X8 and removing records from 768 containing mainly null values
>df ← df[1:768,c(1,2,3,4,5,7,9)]
>#Converting to a Spark R Dataframe
>dfsr ← createDataFrame(sqlContext,df)
>model ← glm(Y1 ~ X1 + X2 + X3 + X4 + X5 + X7,data = dfsr,family       = "gaussian")
> summary(model)

Summary

In this article we have seen examples of SparkR and linear regression using SparkR. For more information on Spark you can refer to:

https://www.packtpub.com/big-data-and-business-intelligence/spark-python-developers
https://www.packtpub.com/big-data-and-business-intelligence/spark-beginners