Data Analysis Using R

16 min read

In this article by Viswa Viswanathan and Shanthi Viswanathan, the authors of the book R Data Analysis Cookbook, we discover how R can be used in various ways such as comparison, classification, applying different functions, and so on.

We will cover the following recipes:

Creating charts that facilitate comparisons
Building, plotting, and evaluating – classification trees
Using time series objects
Applying functions to subsets of a vector

(For more resources related to this topic, see here.)

Creating charts that facilitate comparisons

In large datasets, we often gain good insights by examining how different segments behave. The similarities and differences can reveal interesting patterns. This recipe shows how to create graphs that enable such comparisons.

Getting ready

If you have not already done so, download the code files and save the daily-bike-rentals.csv file in your R working directory. Read the data into R using the following command:

> bike <- read.csv("daily-bike-rentals.csv")
> bike$season <- factor(bike$season, levels = c(1,2,3,4),   labels = c("Spring", "Summer", "Fall", "Winter"))
> attach(bike)

How to do it…

We base this recipe on the task of generating histograms to facilitate the comparison of bike rentals by season.

Using base plotting system

We first look at how to generate histograms of the count of daily bike rentals by season using R’s base plotting system:

Set up a 2 X 2 grid for plotting histograms for the four seasons:
```
> par(mfrow = c(2,2))
```

Extract data for the seasons:

> spring <- subset(bike, season == "Spring")$cnt
> summer <- subset(bike, season == "Summer")$cnt
> fall <- subset(bike, season == "Fall")$cnt
> winter <- subset(bike, season == "Winter")$cnt

Plot the histogram and density for each season:

> hist(spring, prob=TRUE,   xlab = "Spring daily rentals", main = "")
> lines(density(spring))
> 
> hist(summer, prob=TRUE,   xlab = "Summer daily rentals", main = "")
> lines(density(summer))
> 
> hist(fall, prob=TRUE,   xlab = "Fall daily rentals", main = "")
> lines(density(fall))
> 
> hist(winter, prob=TRUE,   xlab = "Winter daily rentals", main = "")
> lines(density(winter))

You get the following output that facilitates comparisons across the seasons:

Using ggplot2

We can achieve much of the preceding results in a single command:

> qplot(cnt, data = bike) + facet_wrap(~ season, nrow=2)
+   geom_histogram(fill = "blue")

You can also combine all four into a single histogram and show the seasonal differences through coloring:

> qplot(cnt, data = bike, fill = season)

How it works…

When you plot a single variable with qplot, you get a histogram by default. Adding facet enables you to generate one histogram per level of the chosen facet. By default, the four histograms will be arranged in a single row. Use facet_wrap to change this.

There’s more…

You can use ggplot2 to generate comparative boxplots as well.

Creating boxplots with ggplot2

Instead of the default histogram, you can get a boxplot with either of the following two approaches:

> qplot(season, cnt, data = bike, geom = c("boxplot"), fill = season)
> 
> ggplot(bike, aes(x = season, y = cnt)) + geom_boxplot()

The preceding code produces the following output:

The second line of the preceding code produces the following plot:

Building, plotting, and evaluating – classification trees

You can use a couple of R packages to build classification trees. Under the hood, they all do the same thing.

Getting ready

If you do not already have the rpart, rpart.plot, and caret packages, install them now. Download the data files and place the banknote-authentication.csv file in your R working directory.

How to do it…

This recipe shows you how you can use the rpart package to build classification trees and the rpart.plot package to generate nice-looking tree diagrams:

Load the rpart, rpart.plot, and caret packages:

> library(rpart)
> library(rpart.plot)
> library(caret)

Read the data:

> bn <- read.csv("banknote-authentication.csv")

Create data partitions. We need two partitions—training and validation. Rather than copying the data into the partitions, we will just keep the indices of the cases that represent the training cases and subset as and when needed:
```
> set.seed(1000)
> train.idx <- createDataPartition(bn$class, p = 0.7, list =
FALSE)
```

Build the tree:

> mod <- rpart(class ~ ., data = bn[train.idx, ], method =
"class", control = rpart.control(minsplit = 20, cp = 0.01))

View the text output (your result could differ if you did not set the random seed as in step 3):

> mod
n= 961
 
node), split, n, loss, yval, (yprob)
     * denotes terminal node
 
1) root 961 423 0 (0.55983351 0.44016649)
   2) variance>=0.321235 511 52 0 (0.89823875 0.10176125)
     4) curtosis>=-4.3856 482 29 0 (0.93983402 0.06016598)
       8) variance>=0.92009 413 10 0 (0.97578692 0.02421308) *
       9) variance< 0.92009 69 19 0 (0.72463768 0.27536232)
       18) entropy< -0.167685 52   6 0 (0.88461538 0.11538462) *
       19) entropy>=-0.167685 17   4 1 (0.23529412 0.76470588) *
     5) curtosis< -4.3856 29   6 1 (0.20689655 0.79310345)
     10) variance>=2.3098 7   1 0 (0.85714286 0.14285714) *
     11) variance< 2.3098 22   0 1 (0.00000000 1.00000000) *
   3) variance< 0.321235 450 79 1 (0.17555556 0.82444444)
     6) skew>=6.83375 76 18 0 (0.76315789 0.23684211)
     12) variance>=-3.4449 57   0 0 (1.00000000 0.00000000) *
     13) variance< -3.4449 19   1 1 (0.05263158 0.94736842) *
     7) skew< 6.83375 374 21 1 (0.05614973 0.94385027)
     14) curtosis>=6.21865 106 16 1 (0.15094340 0.84905660)
       28) skew>=-3.16705 16   0 0 (1.00000000 0.00000000) *
      29) skew< -3.16705 90   0 1 (0.00000000 1.00000000) *
     15) curtosis< 6.21865 268  
5 1 (0.01865672 0.98134328) *

Generate a diagram of the tree (your tree might differ if you did not set the random seed as in step 3):
```
> prp(mod, type = 2, extra = 104, nn = TRUE, fallen.leaves =
TRUE, faclen = 4, varlen = 8, shadow.col = "gray")
```
The following output is obtained as a result of the preceding command:

Prune the tree:

> # First see the cptable
> # !!Note!!: Your table can be different because of the
> # random aspect in cross-validation
> mod$cptable
 
         CP nsplit rel error   xerror       xstd
1 0.69030733     0 1.00000000 1.0000000 0.03637971
2 0.09456265     1 0.30969267 0.3262411 0.02570025
3 0.04018913     2 0.21513002 0.2387707 0.02247542
4 0.01891253     4 0.13475177 0.1607565 0.01879222
5 0.01182033     6 0.09692671 0.1347518 0.01731090
6 0.01063830     7 0.08510638 0.1323877 0.01716786
7 0.01000000     9 0.06382979 0.1276596 0.01687712
 
> # Choose CP value as the highest value whose
> # xerror is not greater than minimum xerror + xstd
> # With the above data that happens to be
> # the fifth one, 0.01182033
> # Your values could be different because of random
> # sampling
> mod.pruned = prune(mod, mod$cptable[5, "CP"])

View the pruned tree (your tree will look different):

> prp(mod.pruned, type = 2, extra = 104, nn = TRUE,
fallen.leaves = TRUE, faclen = 4, varlen = 8, shadow.col =
"gray")

Use the pruned model to predict for a validation partition (note the minus sign before train.idx to consider the cases in the validation partition):
```
> pred.pruned <- predict(mod, bn[-train.idx,], type =
"class")
```

Generate the error/classification-confusion matrix:

> table(bn[-train.idx,]$class, pred.pruned, dnn = c("Actual", "Predicted"))
     Predicted
Actual   0   1
     0 213 11
     1 11 176

How it works…

Steps 1 to 3 load the packages, read the data, and identify the cases in the training partition, respectively. In step 3, we set the random seed so that your results should match those that we display.

Step 4 builds the classification tree model:

> mod <- rpart(class ~ ., data = bn[train.idx, ], method =
"class", control = rpart.control(minsplit = 20, cp = 0.01))

The rpart() function builds the tree model based on the following:

Formula specifying the dependent and independent variables
Dataset to use
A specification through method=”class” that we want to build a classification tree (as opposed to a regression tree)
Control parameters specified through the control = rpart.control() setting; here we have indicated that the tree should only consider nodes with at least 20 cases for splitting and use the complexity parameter value of 0.01—these two values represent the defaults and we have included these just for illustration

Step 5 produces a textual display of the results. Step 6 uses the prp() function of the rpart.plot package to produce a nice-looking plot of the tree:

> prp(mod, type = 2, extra = 104, nn = TRUE, fallen.leaves =
TRUE, faclen = 4, varlen = 8, shadow.col = "gray")

use type=2 to get a plot with every node labeled and with the split label below the node
use extra=4 to display the probability of each class in the node (conditioned on the node and hence summing to 1); add 100 (hence extra=104) to display the number of cases in the node as a percentage of the total number of cases
use nn = TRUE to display the node numbers; the root node is node number 1 and node n has child nodes numbered 2n and 2n+1
use fallen.leaves=TRUE to display all leaf nodes at the bottom of the graph
use faclen to abbreviate class names in the nodes to a specific maximum length
use varlen to abbreviate variable names
use shadow.col to specify the color of the shadow that each node casts

Step 7 prunes the tree to reduce the chance that the model too closely models the training data—that is, to reduce overfitting. Within this step, we first look at the complexity table generated through cross-validation. We then use the table to determine the cutoff complexity level as the largest xerror (cross-validation error) value that is not greater than one standard deviation above the minimum cross-validation error.

Steps 8 through 10 display the pruned tree; use the pruned tree to predict the class for the validation partition and then generate the error matrix for the validation partition.

There’s more…

We discuss in the following an important variation on predictions using classification trees.

Computing raw probabilities

We can generate probabilities in place of classifications by specifying type=”prob”:

> pred.pruned <- predict(mod, bn[-train.idx,], type =
"prob")

Create the ROC Chart

Using the preceding raw probabilities and the class labels, we can generate a ROC chart:

> pred <- prediction(pred.pruned[,2], bn[-train.idx,"class"])
> perf <- performance(pred, "tpr", "fpr")
> plot(perf)

Using time series objects

In this recipe, we look at various features to create and plot time-series objects. We will consider data with both a single and multiple time series.

Getting ready

If you have not already downloaded the data files, do it now and ensure that the files are in your R working directory.

How to do it…

Read the data. The file has 100 rows and a single column named sales:
```
> s <- read.csv("ts-example.csv")
```
Convert the data to a simplistic time series object without any explicit notion of time:
```
> s.ts <- ts(s)
> class(s.ts)
[1] "ts"
```
Plot the time series:
```
> plot(s.ts)
```

Create a proper time series object with proper time points:

> s.ts.a <- ts(s, start = 2002)
> s.ts.a
Time Series:
Start = 2002
End = 2101
Frequency = 1
       sales
[1,]   51
[2,]   56
[3,]   37
[4,]   101
[5,]   66
(output truncated)
> plot(s.ts.a)
> # results show that R treated this as an annual
> # time series with 2002 as the starting year

The result of the preceding commands is seen in the following graph:

To create a monthly time series run the following command:

> # Create a monthly time series
> s.ts.m <- ts(s, start = c(2002,1), frequency = 12)
> s.ts.m
 
     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2002 51 56 37 101 66 63 45 68 70 107 86 102
2003 90 102 79 95 95 101 128 109 139 119 124 116
2004 106 100 114 133 119 114 125 167 149 165 135 152
2005 155 167 169 192 170 180 175 207 164 204 180 203
2006 215 222 205 202 203 209 200 199 218 221 225 212
2007 250 219 242 241 267 249 253 242 251 279 298 260
2008 269 257 279 273 275 314 288 286 290 288 304 291
2009 314 290 312 319 334 307 315 321 339 348 323 342
2010 340 348 354 291
> plot(s.ts.m) # note x axis on plot

The following plot can be seen as a result of the preceding commands:

> # Specify frequency = 4 for quarterly data
> s.ts.q <- ts(s, start = 2002, frequency = 4)
> s.ts.q
 
     Qtr1 Qtr2 Qtr3 Qtr4
2002   51   56   37 101
2003   66   63   45   68
2004   70 107   86 102
2005   90 102   79   95
2006   95 101 128 109
(output truncated)
> plot(s.ts.q)

Query time series objects (we use the s.ts.m object we created in the previous step):

> # When does the series start?
> start(s.ts.m)
[1] 2002   1
> # When does it end?
> end(s.ts.m)
[1] 2010   4
> # What is the frequency?
> frequency(s.ts.m)
[1] 12

Create a time series object with multiple time series. This data file contains US monthly consumer prices for white flour and unleaded gas for the years 1980 through 2014 (downloaded from the website of the US Bureau of Labor Statistics):
```
> prices <- read.csv("prices.csv")
> prices.ts <- ts(prices, start=c(1980,1), frequency =
12)
```

Plot a time series object with multiple time series:

> plot(prices.ts)

The plot in two separate panels appears as follows:

> # Plot both series in one panel with suitable legend
> plot(prices.ts, plot.type = "single", col = 1:2)
> legend("topleft", colnames(prices.ts), col = 1:2, lty =
1)

Two series plotted in one panel appear as follow:

How it works…

Step 1 reads the data.

Step 2 uses the ts function to generate a time series object based on the raw data.

Step 3 uses the plot function to generate a line plot of the time series. We see that the time axis does not provide much information. Time series objects can represent time in more friendly terms.

Step 4 shows how to create time series objects with a better notion of time. It shows how we can treat a data series as an annual, monthly, or quarterly time series. The start and frequency parameters help us to control these data series.

Although the time series we provide is just a list of sequential values, in reality our data can have an implicit notion of time attached to it. For example, the data can be annual numbers, monthly numbers, or quarterly ones (or something else, such as 10-second observations of something). Given just the raw numbers (as in our data file, ts-example.csv), the ts function cannot figure out the time aspect and by default assumes no secondary time interval at all.

We can use the frequency parameter to tell ts how to interpret the time aspect of the data. The frequency parameter controls how many secondary time intervals there are in one major time interval. If we do not explicitly specify it, by default frequency takes on a value of 1. Thus, the following code treats the data as an annual sequence, starting in 2002:

> s.ts.a <- ts(s, start = 2002)

The following code, on the other hand, treats the data as a monthly time series, starting in January 2002. If we specify the start parameter as a number, then R treats it as starting at the first subperiod, if any, of the specified start period. When we specify frequency as different from 1, then the start parameter can be a vector such as c(2002,1) to specify the series, the major period, and the subperiod where the series starts. c(2002,1) represent January 2002:

> s.ts.m <- ts(s, start = c(2002,1), frequency = 12)

Similarly, the following code treats the data as a quarterly sequence, starting in the first quarter of 2002:

> s.ts.q <- ts(s, start = 2002, frequency = 4)

The frequency values of 12 and 4 have a special meaning—they represent monthly and quarterly time sequences.

We can supply start and end, just one of them, or none. If we do not specify either, then R treats the start as 1 and figures out end based on the number of data points. If we supply one, then R figures out the other based on the number of data points.

While start and end do not play a role in computations, frequency plays a big role in determining seasonality, which captures periodic fluctuations.

If we have some other specialized time series, we can specify the frequency parameter appropriately. Here are two examples:

With measurements taken every 10 minutes and seasonality pegged to the hour, we should specify frequency as 6
With measurements taken every 10 minutes and seasonality pegged to the day, use frequency = 24*6 (6 measurements per hour times 24 hours per day)

Step 5 shows the use of the functions start, end, and frequency to query time series objects.

Steps 6 and 7 show that R can handle data files that contain multiple time series.

Applying functions to subsets of a vector

The tapply function applies a function to each partition of the dataset. Hence, when we need to evaluate a function over subsets of a vector defined by a factor, tapply comes in handy.

Getting ready

Download the files and store the auto-mpg.csv file in your R working directory. Read the data and create factors for the cylinders variable:

> auto <- read.csv("auto-mpg.csv", stringsAsFactors=FALSE)
> auto$cylinders <- factor(auto$cylinders, levels =
c(3,4,5,6,8),   labels = c("3cyl", "4cyl", "5cyl", "6cyl",
"8cyl"))

How to do it…

To apply functions to subsets of a vector, follow these steps:

Calculate mean mpg for each cylinder type:

> tapply(auto$mpg,auto$cylinders,mean)
 
   3cyl     4cyl     5cyl     6cyl     8cyl
20.55000 29.28676 27.36667 19.98571 14.96311

We can even specify multiple factors as a list. The following example shows only one factor since the out file has only one, but it serves as a template that you can adapt:
```
> tapply(auto$mpg,list(cyl=auto$cylinders),mean)
 
cyl
   3cyl     4cyl     5cyl     6cyl     8cyl
20.55000 29.28676 27.36667 19.98571 14.96311
```

How it works…

In step 1 the mean function is applied to the auto$mpg vector grouped according to the auto$cylinders vector. The grouping factor should be of the same length as the input vector so that each element of the first vector can be associated with a group.

The tapply function creates groups of the first argument based on each element’s group affiliation as defined by the second argument and passes each group to the user-specified function.

Step 2 shows that we can actually group by several factors specified as a list. In this case, tapply applies the function to each unique combination of the specified factors.

There’s more…

The by function is similar to tapply and applies the function to a group of rows in a dataset, but by passing in the entire data frame. The following examples clarify this.

Applying a function on groups from a data frame

In the following example, we find the correlation between mpg and weight for each cylinder type:

> by(auto, auto$cylinders, function(x) cor(x$mpg, x$weight))
auto$cylinders: 3cyl
[1] 0.6191685
---------------------------------------------------
auto$cylinders: 4cyl
[1] -0.5430774
---------------------------------------------------
auto$cylinders: 5cyl
[1] -0.04750808
---------------------------------------------------
auto$cylinders: 6cyl
[1] -0.4634435
---------------------------------------------------
auto$cylinders: 8cyl
[1] -0.5569099

Summary

Being an extensible system, R’s functionality is divided across numerous packages with each one exposing large numbers of functions. Even experienced users cannot expect to remember all the details off the top of their head. In this article, we went through a few techniques using which R helps analyze data and visualize the results.