In this article by Viswa Viswanathan and Shanthi Viswanathan, the authors of the book R Data Analysis Cookbook, we discover how R can be used in various ways such as comparison, classification, applying different functions, and so on.
We will cover the following recipes:
(For more resources related to this topic, see here.)
In large datasets, we often gain good insights by examining how different segments behave. The similarities and differences can reveal interesting patterns. This recipe shows how to create graphs that enable such comparisons.
If you have not already done so, download the code files and save the daily-bike-rentals.csv file in your R working directory. Read the data into R using the following command:
> bike <- read.csv("daily-bike-rentals.csv") > bike$season <- factor(bike$season, levels = c(1,2,3,4), labels = c("Spring", "Summer", "Fall", "Winter")) > attach(bike)
We base this recipe on the task of generating histograms to facilitate the comparison of bike rentals by season.
We first look at how to generate histograms of the count of daily bike rentals by season using R’s base plotting system:
> par(mfrow = c(2,2))
> spring <- subset(bike, season == "Spring")$cnt > summer <- subset(bike, season == "Summer")$cnt > fall <- subset(bike, season == "Fall")$cnt > winter <- subset(bike, season == "Winter")$cnt
> hist(spring, prob=TRUE, xlab = "Spring daily rentals", main = "") > lines(density(spring)) > > hist(summer, prob=TRUE, xlab = "Summer daily rentals", main = "") > lines(density(summer)) > > hist(fall, prob=TRUE, xlab = "Fall daily rentals", main = "") > lines(density(fall)) > > hist(winter, prob=TRUE, xlab = "Winter daily rentals", main = "") > lines(density(winter))
You get the following output that facilitates comparisons across the seasons:
We can achieve much of the preceding results in a single command:
> qplot(cnt, data = bike) + facet_wrap(~ season, nrow=2) + geom_histogram(fill = "blue")
You can also combine all four into a single histogram and show the seasonal differences through coloring:
> qplot(cnt, data = bike, fill = season)
When you plot a single variable with qplot, you get a histogram by default. Adding facet enables you to generate one histogram per level of the chosen facet. By default, the four histograms will be arranged in a single row. Use facet_wrap to change this.
You can use ggplot2 to generate comparative boxplots as well.
Instead of the default histogram, you can get a boxplot with either of the following two approaches:
> qplot(season, cnt, data = bike, geom = c("boxplot"), fill = season) > > ggplot(bike, aes(x = season, y = cnt)) + geom_boxplot()
The preceding code produces the following output:
The second line of the preceding code produces the following plot:
You can use a couple of R packages to build classification trees. Under the hood, they all do the same thing.
If you do not already have the rpart, rpart.plot, and caret packages, install them now. Download the data files and place the banknote-authentication.csv file in your R working directory.
This recipe shows you how you can use the rpart package to build classification trees and the rpart.plot package to generate nice-looking tree diagrams:
> library(rpart) > library(rpart.plot) > library(caret)
> bn <- read.csv("banknote-authentication.csv")
> set.seed(1000) > train.idx <- createDataPartition(bn$class, p = 0.7, list = FALSE)
> mod <- rpart(class ~ ., data = bn[train.idx, ], method = "class", control = rpart.control(minsplit = 20, cp = 0.01))
> mod n= 961 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 961 423 0 (0.55983351 0.44016649) 2) variance>=0.321235 511 52 0 (0.89823875 0.10176125) 4) curtosis>=-4.3856 482 29 0 (0.93983402 0.06016598) 8) variance>=0.92009 413 10 0 (0.97578692 0.02421308) * 9) variance< 0.92009 69 19 0 (0.72463768 0.27536232) 18) entropy< -0.167685 52 6 0 (0.88461538 0.11538462) * 19) entropy>=-0.167685 17 4 1 (0.23529412 0.76470588) * 5) curtosis< -4.3856 29 6 1 (0.20689655 0.79310345) 10) variance>=2.3098 7 1 0 (0.85714286 0.14285714) * 11) variance< 2.3098 22 0 1 (0.00000000 1.00000000) * 3) variance< 0.321235 450 79 1 (0.17555556 0.82444444) 6) skew>=6.83375 76 18 0 (0.76315789 0.23684211) 12) variance>=-3.4449 57 0 0 (1.00000000 0.00000000) * 13) variance< -3.4449 19 1 1 (0.05263158 0.94736842) * 7) skew< 6.83375 374 21 1 (0.05614973 0.94385027) 14) curtosis>=6.21865 106 16 1 (0.15094340 0.84905660) 28) skew>=-3.16705 16 0 0 (1.00000000 0.00000000) * 29) skew< -3.16705 90 0 1 (0.00000000 1.00000000) * 15) curtosis< 6.21865 268 5 1 (0.01865672 0.98134328) *
> prp(mod, type = 2, extra = 104, nn = TRUE, fallen.leaves = TRUE, faclen = 4, varlen = 8, shadow.col = "gray")
The following output is obtained as a result of the preceding command:
> # First see the cptable > # !!Note!!: Your table can be different because of the > # random aspect in cross-validation > mod$cptable CP nsplit rel error xerror xstd 1 0.69030733 0 1.00000000 1.0000000 0.03637971 2 0.09456265 1 0.30969267 0.3262411 0.02570025 3 0.04018913 2 0.21513002 0.2387707 0.02247542 4 0.01891253 4 0.13475177 0.1607565 0.01879222 5 0.01182033 6 0.09692671 0.1347518 0.01731090 6 0.01063830 7 0.08510638 0.1323877 0.01716786 7 0.01000000 9 0.06382979 0.1276596 0.01687712 > # Choose CP value as the highest value whose > # xerror is not greater than minimum xerror + xstd > # With the above data that happens to be > # the fifth one, 0.01182033 > # Your values could be different because of random > # sampling > mod.pruned = prune(mod, mod$cptable[5, "CP"])
> prp(mod.pruned, type = 2, extra = 104, nn = TRUE, fallen.leaves = TRUE, faclen = 4, varlen = 8, shadow.col = "gray")
> pred.pruned <- predict(mod, bn[-train.idx,], type = "class")
> table(bn[-train.idx,]$class, pred.pruned, dnn = c("Actual", "Predicted")) Predicted Actual 0 1 0 213 11 1 11 176
Steps 1 to 3 load the packages, read the data, and identify the cases in the training partition, respectively. In step 3, we set the random seed so that your results should match those that we display.
Step 4 builds the classification tree model:
> mod <- rpart(class ~ ., data = bn[train.idx, ], method = "class", control = rpart.control(minsplit = 20, cp = 0.01))
The rpart() function builds the tree model based on the following:
Step 5 produces a textual display of the results. Step 6 uses the prp() function of the rpart.plot package to produce a nice-looking plot of the tree:
> prp(mod, type = 2, extra = 104, nn = TRUE, fallen.leaves = TRUE, faclen = 4, varlen = 8, shadow.col = "gray")
Step 7 prunes the tree to reduce the chance that the model too closely models the training data—that is, to reduce overfitting. Within this step, we first look at the complexity table generated through cross-validation. We then use the table to determine the cutoff complexity level as the largest xerror (cross-validation error) value that is not greater than one standard deviation above the minimum cross-validation error.
Steps 8 through 10 display the pruned tree; use the pruned tree to predict the class for the validation partition and then generate the error matrix for the validation partition.
We discuss in the following an important variation on predictions using classification trees.
We can generate probabilities in place of classifications by specifying type=”prob”:
> pred.pruned <- predict(mod, bn[-train.idx,], type = "prob")
Using the preceding raw probabilities and the class labels, we can generate a ROC chart:
> pred <- prediction(pred.pruned[,2], bn[-train.idx,"class"]) > perf <- performance(pred, "tpr", "fpr") > plot(perf)
In this recipe, we look at various features to create and plot time-series objects. We will consider data with both a single and multiple time series.
If you have not already downloaded the data files, do it now and ensure that the files are in your R working directory.
> s <- read.csv("ts-example.csv")
> s.ts <- ts(s) > class(s.ts) [1] "ts"
> plot(s.ts)
> s.ts.a <- ts(s, start = 2002) > s.ts.a Time Series: Start = 2002 End = 2101 Frequency = 1 sales [1,] 51 [2,] 56 [3,] 37 [4,] 101 [5,] 66 (output truncated) > plot(s.ts.a) > # results show that R treated this as an annual > # time series with 2002 as the starting year
The result of the preceding commands is seen in the following graph:
To create a monthly time series run the following command:
> # Create a monthly time series > s.ts.m <- ts(s, start = c(2002,1), frequency = 12) > s.ts.m Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 2002 51 56 37 101 66 63 45 68 70 107 86 102 2003 90 102 79 95 95 101 128 109 139 119 124 116 2004 106 100 114 133 119 114 125 167 149 165 135 152 2005 155 167 169 192 170 180 175 207 164 204 180 203 2006 215 222 205 202 203 209 200 199 218 221 225 212 2007 250 219 242 241 267 249 253 242 251 279 298 260 2008 269 257 279 273 275 314 288 286 290 288 304 291 2009 314 290 312 319 334 307 315 321 339 348 323 342 2010 340 348 354 291 > plot(s.ts.m) # note x axis on plot
The following plot can be seen as a result of the preceding commands:
> # Specify frequency = 4 for quarterly data > s.ts.q <- ts(s, start = 2002, frequency = 4) > s.ts.q Qtr1 Qtr2 Qtr3 Qtr4 2002 51 56 37 101 2003 66 63 45 68 2004 70 107 86 102 2005 90 102 79 95 2006 95 101 128 109 (output truncated) > plot(s.ts.q)
> # When does the series start? > start(s.ts.m) [1] 2002 1 > # When does it end? > end(s.ts.m) [1] 2010 4 > # What is the frequency? > frequency(s.ts.m) [1] 12
> prices <- read.csv("prices.csv") > prices.ts <- ts(prices, start=c(1980,1), frequency = 12)
> plot(prices.ts)
The plot in two separate panels appears as follows:
> # Plot both series in one panel with suitable legend > plot(prices.ts, plot.type = "single", col = 1:2) > legend("topleft", colnames(prices.ts), col = 1:2, lty = 1)
Two series plotted in one panel appear as follow:
Step 1 reads the data.
Step 2 uses the ts function to generate a time series object based on the raw data.
Step 3 uses the plot function to generate a line plot of the time series. We see that the time axis does not provide much information. Time series objects can represent time in more friendly terms.
Step 4 shows how to create time series objects with a better notion of time. It shows how we can treat a data series as an annual, monthly, or quarterly time series. The start and frequency parameters help us to control these data series.
Although the time series we provide is just a list of sequential values, in reality our data can have an implicit notion of time attached to it. For example, the data can be annual numbers, monthly numbers, or quarterly ones (or something else, such as 10-second observations of something). Given just the raw numbers (as in our data file, ts-example.csv), the ts function cannot figure out the time aspect and by default assumes no secondary time interval at all.
We can use the frequency parameter to tell ts how to interpret the time aspect of the data. The frequency parameter controls how many secondary time intervals there are in one major time interval. If we do not explicitly specify it, by default frequency takes on a value of 1. Thus, the following code treats the data as an annual sequence, starting in 2002:
> s.ts.a <- ts(s, start = 2002)
The following code, on the other hand, treats the data as a monthly time series, starting in January 2002. If we specify the start parameter as a number, then R treats it as starting at the first subperiod, if any, of the specified start period. When we specify frequency as different from 1, then the start parameter can be a vector such as c(2002,1) to specify the series, the major period, and the subperiod where the series starts. c(2002,1) represent January 2002:
> s.ts.m <- ts(s, start = c(2002,1), frequency = 12)
Similarly, the following code treats the data as a quarterly sequence, starting in the first quarter of 2002:
> s.ts.q <- ts(s, start = 2002, frequency = 4)
The frequency values of 12 and 4 have a special meaning—they represent monthly and quarterly time sequences.
We can supply start and end, just one of them, or none. If we do not specify either, then R treats the start as 1 and figures out end based on the number of data points. If we supply one, then R figures out the other based on the number of data points.
While start and end do not play a role in computations, frequency plays a big role in determining seasonality, which captures periodic fluctuations.
If we have some other specialized time series, we can specify the frequency parameter appropriately. Here are two examples:
Step 5 shows the use of the functions start, end, and frequency to query time series objects.
Steps 6 and 7 show that R can handle data files that contain multiple time series.
The tapply function applies a function to each partition of the dataset. Hence, when we need to evaluate a function over subsets of a vector defined by a factor, tapply comes in handy.
Download the files and store the auto-mpg.csv file in your R working directory. Read the data and create factors for the cylinders variable:
> auto <- read.csv("auto-mpg.csv", stringsAsFactors=FALSE) > auto$cylinders <- factor(auto$cylinders, levels = c(3,4,5,6,8), labels = c("3cyl", "4cyl", "5cyl", "6cyl", "8cyl"))
To apply functions to subsets of a vector, follow these steps:
> tapply(auto$mpg,auto$cylinders,mean) 3cyl 4cyl 5cyl 6cyl 8cyl 20.55000 29.28676 27.36667 19.98571 14.96311
> tapply(auto$mpg,list(cyl=auto$cylinders),mean) cyl 3cyl 4cyl 5cyl 6cyl 8cyl 20.55000 29.28676 27.36667 19.98571 14.96311
In step 1 the mean function is applied to the auto$mpg vector grouped according to the auto$cylinders vector. The grouping factor should be of the same length as the input vector so that each element of the first vector can be associated with a group.
The tapply function creates groups of the first argument based on each element’s group affiliation as defined by the second argument and passes each group to the user-specified function.
Step 2 shows that we can actually group by several factors specified as a list. In this case, tapply applies the function to each unique combination of the specified factors.
The by function is similar to tapply and applies the function to a group of rows in a dataset, but by passing in the entire data frame. The following examples clarify this.
In the following example, we find the correlation between mpg and weight for each cylinder type:
> by(auto, auto$cylinders, function(x) cor(x$mpg, x$weight)) auto$cylinders: 3cyl [1] 0.6191685 --------------------------------------------------- auto$cylinders: 4cyl [1] -0.5430774 --------------------------------------------------- auto$cylinders: 5cyl [1] -0.04750808 --------------------------------------------------- auto$cylinders: 6cyl [1] -0.4634435 --------------------------------------------------- auto$cylinders: 8cyl [1] -0.5569099
Being an extensible system, R’s functionality is divided across numerous packages with each one exposing large numbers of functions. Even experienced users cannot expect to remember all the details off the top of their head. In this article, we went through a few techniques using which R helps analyze data and visualize the results.
Further resources on this subject:
I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…
Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…
Once we learn how to deploy an Ubuntu server, how to manage users, and how…
Key-takeaways: Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…
While developing a web application, or setting dynamic pages and meta tags we need to deal with…
Software architecture is one of the most discussed topics in the software industry today, and…