Getting Started with Packages in R

R is a powerful programming language for loading, manipulating, transforming, and visualizing data. The language is made more powerful by its extensibility in conjunction with the efforts of a highly active open source community. This community is constantly contributing to the language in the form of packages, which are, at their core, sets of thematically linked functions. By leveraging the work that has been put in to the creation of useful open source packages, an R user can substantially improve both the readability and efficiency of their code.

In this post, you will learn how to install new packages to extend the functionality of R and how to load those packages into your session. We will also explore some of the most useful packages that have been contributed by the R community!

Installing Packages

There are a number of places where R packages can be stored, but the three most popular locations are CRAN, Bioconductor, and GitHub.

CRAN

The Comprehensive R Archive Network is the home of R. At the time of this writing, there are over 8,000 packages hosted on CRAN, all of which are free to download and use. If you are looking to get started with using R in your field but don't know exactly where to start, the CRAN task view for your field or area of interest is likely a good place to start. There you will find listings of relevant packages, along with short descriptions and links to source code.

Let's say you've entered the "Reproducible Research" task view and have decided that the package named knitr sounds useful. To install knitr from CRAN, you type this in your R console:

install.packages("knitr")

Bioconductor

Bioconductor is home to over 1,000 packages for R, with a focus on packages that can be used for bioinformatics research. One of the main differences between Bioconductor and CRAN is that Bioconductor has stricter guidelines for accepting packages than CRAN.

After finding a package on Bioconductor, such as EBImage, install it by running these commands:

source("https://bioconductor.org/biocLite.R")
biocLite("EBImage")

It is possible to install from Bioconductor using install.packages, but this is not recommended for reasons discussed here.

GitHub

GitHub is a space where you can post the source code of your work to keep it under version control and also to encourage and facilitate collaboration. Often, GitHub is where the truly bleeding-edge packages can be found, and where package updates are put first. Many of the packages that can be found on CRAN have a development version on GitHub, occasionally with features absent from the CRAN version. As you browse GitHub, you will likely find some packages that will never be put on CRAN or Bioconductor. For this reason, caution should be exercised when using packages sourced from GitHub.

Should you find a package on GitHub and wish to install it, you must first download the package devtools from CRAN. You then have access to the install_github() function, where the argument is the name of the developer, followed by a slash, and then the name of the package:

install.packages("devtools")
# Install swirl! See: https://github.com/swirldev/swirl
devtools::install_github("swirldev/swirl")

Where the syntax devtools::xxxx() simply means "Use the xxxx function from the devtools package ". You could just have easily called library(devtools) after installing and then simply typed install_github().

The devtools package also includes a number of different methods for installing packages that are stored locally, on bitbucket, in an SVN repository. Try typing ??devtools::install_ to see a full list.

Some Popular Packages

Now that you know the basic commands for installing packages, let's take a very short look at some of the more popular and useful packages.

Visualizing data with ggplot2

ggplot2 is a package that is used to visualize data. It provides a method of chart-building that is intuitive (based on The Grammar of Graphics) and results in aesthetically pleasing graphics.

Here is an example of a graphic produced using ggplot2:

install.packages("ggplot2") # Install from CRAN
library(ggplot2)            # Load ggplot2
data(diamonds)              # Load diamonds data set 

# Create plot with carat on x axis, price on y,
# and color based on quality of cut
ggplot(data=diamonds, aes(x=carat, y=price, col=cut)) +
  geom_point(alpha=0.5) # Use points (dots) to represent data

getting-started-packages-r-img-0

Manipulating data with dplyr

dplyr presents a number of verbs used for manipulating data (select, filter, mutate, arrange, summarize, and so on), each of which are common tasks when working with data.

To see how dplyr can simplify your workflow, let's compare the base R versus the dplyr code used to subset the diamonds data into only those gems with Ideal cut type and greater than 2 carats:

install.packages("dplyr") # Install dplyr from CRAN
library(dplyr)            # Load dplyr

BaseR <- diamonds[which(diamonds$cut == "Ideal" & diamonds$carat > 2),]
# vs:
Dplyr <- filter(diamonds, cut == "Ideal" & carat > 2)

Clearly the dplyr version is more succinct, more readable, and, most importantly, easier to write.

Machine learning with caret

The caret package is a collection of functions that unify the syntax used by many of the most popular machine learning packages implemented in R. caret will allow you to quickly prepare your data, create predictive models, tune the model parameters, and interpret the results.

Here is a simple working example of training and tuning a k-nearest neighbors model with caret to predict the price of a diamond based on cut, color, and clarity:

install.packages("caret")
library(caret)

# Split data into training and testing sets
inTrain <- createDataPartition(diamonds$price, p=0.01, list=FALSE)
training <- diamonds[inTrain,]
testing <- diamonds[-inTrain,]
knn_model <- train(price ~ cut + color + clarity, data=training, method="knn")
plot(knn_model)

getting-started-packages-r-img-1

You can see that increasing the number of neighbors in the model increases the accuracy (decreases the RMSE, a method of measuring the average distance between predictions and data).

Summary

In this post, you learned how to install and load packages from three different major sources: CRAN, Bioconductor, and GitHub. You also took a brief look at three popular packages: ggplot2 for visualization, dplyr for manipulation, and caret for machine learning.

About the author

Joel Carlson is a recent MSc graduate from Seoul National University, and current Data Science Fellow at Galvanize in San Francisco. He has contributed two R packages to CRAN (radiomics and RImagePalette). You can learn more or contact him at his personal website.