Adding a Spark to R

February 22, 2016 - 12:00 am

1922

2 min read

Spark is written in a language called Scala. It has interfaces to use from Java and Python and from the recent version 1.4.0; it also supports R. This is called SparkR, which we will describe in the next section. The four classes of libraries available in Spark are SQL and DataFrames, Spark Streaming, MLib (machine learning), and GraphX (graph algorithms). Currently, SparkR supports only SQL and DataFrames; others are definitely in the roadmap. Spark can be downloaded from the Apache project page at http://spark.apache.org/downloads.html. Starting from 1.4.0 version, SparkR is included in Spark and no separate download is required.

(For more resources related to this topic, see here.)

SparkR

Similar to RHadoop, SparkR is an R package that allows R users to use Spark APIs through the RDD class. For example, using SparkR, users can run jobs on Spark from RStudio. SparkR can be evoked from RStudio. To enable this, include the following lines in your .Rprofile file that R uses at startup to initialize the environments:

Sys.setenv(SPARK_HOME/.../spark-1.5.0-bin-hadoop2.6")
#provide the correct path where spark downloaded folder is kept for SPARK_HOME
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),""R",""lib"),".libPaths()))

Once this is done, start RStudio and enter the following commands to start using SparkR:

>library(SparkR)
>sc ← sparkR.init(master="local")

As mentioned, as of the latest version 1.5 when this chapter is in writing, SparkR supports limited functionalities of R. This mainly includes data slicing and dicing and summary stat functions. The current version does not support the use of contributed R packages; however, it is planned for a future release. On machine learning, currently SparkR supports the glm( ) function. We will do an example in the next section.

Linear regression using SparkR

In the following example, we will illustrate how to use SparkR for machine learning.

>library(SparkR)
>sc ← sparkR.init(master="local")
>sqlContext ← sparkRSQL.init(sc)

#Importing data
>df ← read.csv("/Users/harikoduvely/Projects/Book/Data     /ENB2012_data.csv",header = T)
>#Excluding variable Y2,X6,X8 and removing records from 768 containing mainly null values
>df ← df[1:768,c(1,2,3,4,5,7,9)]
>#Converting to a Spark R Dataframe
>dfsr ← createDataFrame(sqlContext,df)
>model ← glm(Y1 ~ X1 + X2 + X3 + X4 + X5 + X7,data = dfsr,family       = "gaussian")
> summary(model)

Summary

In this article we have seen examples of SparkR and linear regression using SparkR. For more information on Spark you can refer to:

https://www.packtpub.com/big-data-and-business-intelligence/spark-python-developers
https://www.packtpub.com/big-data-and-business-intelligence/spark-beginners

Resources for Article:

Further resources on this subject:

Top 6 Cybersecurity Books from Packt to Accelerate Your Career

Your Quick Introduction to Extended Events in Analysis Services from Blog…

Logging the history of my past SQL Saturday presentations from Blog…

Storage savings with Table Compression from Blog Posts – SQLServerCentral

Daily Coping 31 Dec 2020 from Blog Posts – SQLServerCentral

Learning Essential Linux Commands for Navigating the Shell Effectively

Exploring the Strategy Behavioral Design Pattern in Node.js

How to integrate a Medium editor in Angular 8

Implementing memory management with Golang’s garbage collector

How to create sales analysis app in Qlik Sense using DAR…

Adding a Spark to R

SparkR

Linear regression using SparkR

Summary

Resources for Article:

LEAVE A REPLY Cancel reply

MobilePro

datapro

Programming

Subscribe to our newsletter