Categories: TutorialsData

Adding a Spark to R

2 min read

Spark is written in a language called Scala. It has interfaces to use from Java and Python and from the recent version 1.4.0; it also supports R. This is called SparkR, which we will describe in the next section. The four classes of libraries available in Spark are SQL and DataFrames, Spark Streaming, MLib (machine learning), and GraphX (graph algorithms). Currently, SparkR supports only SQL and DataFrames; others are definitely in the roadmap. Spark can be downloaded from the Apache project page at http://spark.apache.org/downloads.html. Starting from 1.4.0 version, SparkR is included in Spark and no separate download is required.

(For more resources related to this topic, see here.)

SparkR

Similar to RHadoop, SparkR is an R package that allows R users to use Spark APIs through the RDD class. For example, using SparkR, users can run jobs on Spark from RStudio. SparkR can be evoked from RStudio. To enable this, include the following lines in your .Rprofile file that R uses at startup to initialize the environments:

Sys.setenv(SPARK_HOME/.../spark-1.5.0-bin-hadoop2.6")
#provide the correct path where spark downloaded folder is kept for SPARK_HOME
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),""R",""lib"),".libPaths()))

Once this is done, start RStudio and enter the following commands to start using SparkR:

>library(SparkR)
>sc ← sparkR.init(master="local")

As mentioned, as of the latest version 1.5 when this chapter is in writing, SparkR supports limited functionalities of R. This mainly includes data slicing and dicing and summary stat functions. The current version does not support the use of contributed R packages; however, it is planned for a future release. On machine learning, currently SparkR supports the glm( ) function. We will do an example in the next section.

Linear regression using SparkR

In the following example, we will illustrate how to use SparkR for machine learning.

>library(SparkR)
>sc ← sparkR.init(master="local")
>sqlContext ← sparkRSQL.init(sc)

#Importing data
>df ← read.csv("/Users/harikoduvely/Projects/Book/Data     /ENB2012_data.csv",header = T)
>#Excluding variable Y2,X6,X8 and removing records from 768 containing mainly null values
>df ← df[1:768,c(1,2,3,4,5,7,9)]
>#Converting to a Spark R Dataframe
>dfsr ← createDataFrame(sqlContext,df)
>model ← glm(Y1 ~ X1 + X2 + X3 + X4 + X5 + X7,data = dfsr,family       = "gaussian")
> summary(model)

Summary

In this article we have seen examples of SparkR and linear regression using SparkR. For more information on Spark you can refer to:

  • https://www.packtpub.com/big-data-and-business-intelligence/spark-python-developers
  • https://www.packtpub.com/big-data-and-business-intelligence/spark-beginners

Resources for Article:


Further resources on this subject:


Packt

Share
Published by
Packt

Recent Posts

Top life hacks for prepping for your IT certification exam

I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…

3 years ago

Learn Transformers for Natural Language Processing with Denis Rothman

Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…

3 years ago

Learning Essential Linux Commands for Navigating the Shell Effectively

Once we learn how to deploy an Ubuntu server, how to manage users, and how…

3 years ago

Clean Coding in Python with Mariano Anaya

Key-takeaways:   Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…

3 years ago

Exploring Forms in Angular – types, benefits and differences   

While developing a web application, or setting dynamic pages and meta tags we need to deal with…

3 years ago

Gain Practical Expertise with the Latest Edition of Software Architecture with C# 9 and .NET 5

Software architecture is one of the most discussed topics in the software industry today, and…

3 years ago