27 min read

Data analysis is a combination of art and science. The art part consists of data exploration and visualization, which is usually done best with better intuition and understanding of the data. The science part consists of statistical analysis, which relies on concrete knowledge of statistics and analytic skills. However, both parts of a serious research require proper tools and good skills to work with them.

R is exactly the proper tool to do data analysis with. In this article by Kun Ren, author of the book Learning R Programming, we will discuss how R and data.table package make it easy to transform data and, thus, greatly unleash our productivity.

(For more resources related to this topic, see here.)

Loading data as data frames

The most basic data structures in R are atomic vectors, such as. numeric, logical, character, and complex vector, and list. An atomic vector stores elements of the same type while list is allowed to store different types of elements.

The most commonly used data structure in R to store real-world data is data frame. A data frame stores data in tabular form. In essence, a data frame is a list of vectors with equal length but maybe different types.

Most of the code in this article is based on a group of fictitious data about some products (you can download the data at https://gist.github.com/renkun-ken/ba2d33f21efded23db66a68240c20c92). We will use the readr package to load the data for better handling of column types. If you don’t have this package installed, please run install.packages(“readr”).

library(readr)

product_info <- read_csv("data/product-info.csv")

product_info

##    id      name  type   class released

## 1 T01    SupCar   toy vehicle      yes

## 2 T02  SupPlane   toy vehicle       no

## 3 M01     JeepX model vehicle      yes

## 4 M02 AircraftX model vehicle      yes

## 5 M03    Runner model  people      yes

## 6 M04    Dancer model  people       no

Once the data is loaded into memory as a data frame, we can take a look at its column types, shown as follows:

sapply(product_info, class)

##          id        name        type       class    released

## "character" "character" "character" "character" "character"

Using built-in functions to manipulate data frames

Although a data frame is essentially a list of vectors, we can access it like a matrix due to all column vectors being the same length. To select rows that meet certain conditions, we will supply a logical vector as the first argument of [] while the second is left empty.

For example, we can take out all rows of toy type, shown as follows:

product_info[product_info$type == "toy", ]

##    id     name type   class released

## 1 T01   SupCar  toy vehicle      yes

## 2 T02 SupPlane  toy vehicle       no

Or, we can take out all rows that are not released.

product_info[product_info$released == "no", ]

##    id     name  type   class released

## 2 T02 SupPlane   toy vehicle       no

## 6 M04   Dancer model  people       no

To filter columns, we can supply a character vector as the second argument while the first is left empty, which is exactly the same with how we subset a matrix.

product_info[1:3, c("id", "name", "type")]

##    id     name  type

## 1 T01   SupCar   toy

## 2 T02 SupPlane   toy

## 3 M01    JeepX model

Alternatively, we can filter the data frame by regarding it as a list. We can supply only one character vector of column names in [].

product_info[c("id", "name", "class")]

##    id      name   class

## 1 T01    SupCar vehicle

## 2 T02  SupPlane vehicle

## 3 M01     JeepX vehicle

## 4 M02 AircraftX vehicle

## 5 M03    Runner  people

## 6 M04    Dancer  people

To filter a data frame by both row and column, we can supply a vector as the first argument to select rows and a vector as the second to select columns.

product_info[product_info$type == "toy", c("name", "class", "released")]

##       name   class released

## 1   SupCar vehicle      yes

## 2 SupPlane vehicle       no

If the row filtering condition is based on values of certain columns, the preceding code can be very redundant, especially when the condition gets more complicated. Another built-in function to simplify code is subset, as introduced previously.

subset(product_info,

  subset = type == "model" & released == "yes",

  select = name:class)

##        name  type   class

## 3     JeepX model vehicle

## 4 AircraftX model vehicle

## 5    Runner model  people

The subset function uses nonstandard evaluation so that we can directly use the columns of the data frame without typing product_info many times because the expressions are meant to be evaluated in the context of the data frame.

Similarly, we can use with to evaluate an expression in the context of the data frame, that is, the columns of the data frame can be used as symbols in the expression without repeatedly specifying the data frame.

with(product_info, name[released == "no"])

## [1] "SupPlane" "Dancer"

The expression can be more than a simple subsetting. We can summarize the data by counting the occurrences of each possible value of a vector. For example, we can create a table of occurrences of types of records that are released.

with(product_info, table(type[released == "yes"]))

##

## model   toy

##     3     1

In addition to the table of product information, we also have a table of product statistics that describe some properties of each product.

product_stats <- read_csv("data/product-stats.csv")

product_stats

##    id material size weight

## 1 T01    Metal  120   10.0

## 2 T02    Metal  350   45.0

## 3 M01 Plastics   50     NA

## 4 M02 Plastics   85    3.0

## 5 M03     Wood   15     NA

## 6 M04     Wood   16    0.6

Now, think of how we can get the names of products with the top three largest sizes? One way is to sort the records in product_stats by size in descending order, select id values of the top three records, and use these values to filter rows of product_info by id.

top_3_id <- product_stats[order(product_stats$size, decreasing = TRUE), "id"][1:3]

product_info[product_info$id %in% top_3_id, ]

##    id      name  type   class released

## 1 T01    SupCar   toy vehicle      yes

## 2 T02  SupPlane   toy vehicle       no

## 4 M02 AircraftX model vehicle      yes

This approach looks quite redundant. Note that product_info and product_stats actually describe the same set of products in different perspectives. The connection between these two tables is the id column. Each id is unique and means the same product. To access both sets of information, we can put the two tables together into one data frame. The simplest way to do this is use merge:

product_table <- merge(product_info, product_stats, by = "id")

product_table

##    id      name  type   class released material size weight

## 1 M01     JeepX model vehicle      yes Plastics   50     NA

## 2 M02 AircraftX model vehicle      yes Plastics   85    3.0

## 3 M03    Runner model  people      yes     Wood   15     NA

## 4 M04    Dancer model  people       no     Wood   16    0.6

## 5 T01    SupCar   toy vehicle      yes    Metal  120   10.0

## 6 T02  SupPlane   toy vehicle       no    Metal  350   45.0

Now, we can create a new data frame that is a combined version of product_table and product_info with a shared id column. In fact, if you reorder the records in the second table, the two tables still can be correctly merged.

With the combined version, we can do things more easily. For example, with the merged version, we can sort the data frame with any column in one table we loaded without having to manually work with the other.

product_table[order(product_table$size), ]

##    id      name  type   class released material size weight

## 3 M03    Runner model  people      yes     Wood   15     NA

## 4 M04    Dancer model  people       no     Wood   16    0.6

## 1 M01     JeepX model vehicle      yes Plastics   50     NA

## 2 M02 AircraftX model vehicle      yes Plastics   85    3.0

## 5 T01    SupCar   toy vehicle      yes    Metal  120   10.0

## 6 T02  SupPlane   toy vehicle       no    Metal  350   45.0

To solve the problem, we can directly use the merged table and get the same answer.

product_table[order(product_table$size, decreasing = TRUE), "name"][1:3]

## [1] "SupPlane"  "SupCar"    "AircraftX"

The merged data frame allows us to sort the records by a column in one data frame and filter the records by a column in the other. For example, we can first sort the product records by weight in descending order and select all records of model type.

product_table[order(product_table$weight, decreasing = TRUE), ][

  product_table$type == "model",]

##    id      name  type   class released material size weight

## 6 T02  SupPlane   toy vehicle       no    Metal  350   45.0

## 5 T01    SupCar   toy vehicle      yes    Metal  120   10.0

## 2 M02 AircraftX model vehicle      yes Plastics   85    3.0

## 4 M04    Dancer model  people       no     Wood   16    0.6

Sometimes, the column values are literal but can be converted to standard R data structures to better represent the data. For example, released column in product_info only takes yes and no, which can be better represented with a logical vector. We can use <- to modify the column values, as we learned previously. However, it is usually better to create a new data frame with the existing columns properly adjusted and new columns added without polluting the original data. To do this, we can use transform:

transform(product_table,

  released = ifelse(released == "yes", TRUE, FALSE),

  density = weight / size)

##    id      name  type   class released material size weight

## 1 M01     JeepX model vehicle     TRUE Plastics   50     NA

## 2 M02 AircraftX model vehicle     TRUE Plastics   85    3.0

## 3 M03    Runner model  people     TRUE     Wood   15     NA

## 4 M04    Dancer model  people    FALSE     Wood   16    0.6

## 5 T01    SupCar   toy vehicle     TRUE    Metal  120   10.0

## 6 T02  SupPlane   toy vehicle    FALSE    Metal  350   45.0

##      density

## 1         NA

## 2 0.03529412

## 3         NA

## 4 0.03750000

## 5 0.08333333

## 6 0.12857143

The result is a new data frame with released converted to a logical vector and a new density column added. You can easily verify that product_table is not modified at all.

Additionally, note that transform is like subset, as both functions use nonstandard evaluation to allow direct use of data frame columns as symbols in the arguments so that we don’t have to type product_table$ all the time.

Now, we will load another table into R. It is the test results of the quality, and durability of each product. We store the data in product_tests.

product_tests <- read_csv("data/product-tests.csv")

product_tests

##    id quality durability waterproof

## 1 T01      NA         10         no

## 2 T02      10          9         no

## 3 M01       6          4        yes

## 4 M02       6          5        yes

## 5 M03       5         NA        yes

## 6 M04       6          6        yes

Note that the values in both quality and durability contain missing values (NA). To exclude all rows with missing values, we can use na.omit():

na.omit(product_tests)

##    id quality durability waterproof

## 2 T02      10          9         no

## 3 M01       6          4        yes

## 4 M02       6          5        yes

## 6 M04       6          6        yes

Another way is to use complete.cases() to get a logical vector indicating all complete rows, without any missing value,:

complete.cases(product_tests)

## [1] FALSE  TRUE  TRUE  TRUE FALSE  TRUE

Then, we can use this logical vector to filter the data frame. For example, we can get the id  column of all complete rows as follows:

product_tests[complete.cases(product_tests), "id"]

## [1] "T02" "M01" "M02" "M04"

Or, we can get the id column of all incomplete rows:

product_tests[!complete.cases(product_tests), "id"]

## [1] "T01" "M03"

Note that product_info, product_stats and product_tests all share an id column, and we can merge them altogether. Unfortunately, there’s no built-in function to merge an arbitrary number of data frames. We can only merge two existing data frames at a time, or we’ll have to merge them recursively.

merge(product_table, product_tests, by = "id")

##    id      name  type   class released material size weight

## 1 M01     JeepX model vehicle      yes Plastics   50     NA

## 2 M02 AircraftX model vehicle      yes Plastics   85    3.0

## 3 M03    Runner model  people      yes     Wood   15     NA

## 4 M04    Dancer model  people       no     Wood   16    0.6

## 5 T01    SupCar   toy vehicle      yes    Metal  120   10.0

## 6 T02  SupPlane   toy vehicle       no    Metal  350   45.0

##   quality durability waterproof

## 1       6          4        yes

## 2       6          5        yes

## 3       5         NA        yes

## 4       6          6        yes

## 5      NA         10         no

## 6      10          9         no

Data wrangling with data.table

In the previous section, we had an overview on how we can use built-in functions to work with data frames. Built-in functions work, but are usually verbose. In this section, let’s use data.table, an enhanced version of data.frame, and see how it makes data manipulation much easier. Run install.packages(“data.table”) to install the package. As long as the package is ready, we can load the package and use fread() to read the data files as data.table objects.

library(data.table)

product_info <- fread("data/product-info.csv")

product_stats <- fread("data/product-stats.csv")

product_tests <- fread("data/product-tests.csv")

toy_tests <- fread("data/product-toy-tests.csv")

It is extremely easy to filter data in data.table. To select the first two rows, just use [1:2], which instead selects the first two columns for data.frame.

product_info[1:2]

##     id     name type   class released

## 1: T01   SupCar  toy vehicle      yes

## 2: T02 SupPlane  toy vehicle       no

To filter by logical conditions, just directly type columns names as variables without quotation as the expression is evaluated within the context of product_info:

product_info[type == "model" & class == "people"]

##     id   name  type  class released

## 1: M03 Runner model people      yes

## 2: M04 Dancer model people       no

It is easy to select or transform columns.

product_stats[, .(id, material, density = size / weight)]

##     id material   density

## 1: T01    Metal 12.000000

## 2: T02    Metal  7.777778

## 3: M01 Plastics        NA

## 4: M02 Plastics 28.333333

## 5: M03     Wood        NA

## 6: M04     Wood 26.666667

The data.table object also supports using key for subsetting, which can be much faster than using ==. We can set a column as key for each data.table:

setkey(product_info, id)

setkey(product_stats, id)

setkey(product_tests, id)

Then, we can use a value to directly select rows.

product_info["M02"]

##     id      name  type   class released

## 1: M02 AircraftX model vehicle      yes

We can also set multiple columns as key so as to use multiple values to subset it.

setkey(toy_tests, id, date)

toy_tests[.("T02", 20160303)]

##     id     date sample quality durability

## 1: T02 20160303     75       8          8

If two data.table objects share the same key, we can join them easily:

product_info[product_tests]

##     id      name  type   class released quality durability

## 1: M01     JeepX model vehicle      yes       6          4

## 2: M02 AircraftX model vehicle      yes       6          5

## 3: M03    Runner model  people      yes       5         NA

## 4: M04    Dancer model  people       no       6          6

## 5: T01    SupCar   toy vehicle      yes      NA         10

## 6: T02  SupPlane   toy vehicle       no      10          9

##    waterproof

## 1:        yes

## 2:        yes

## 3:        yes

## 4:        yes

## 5:         no

## 6:         no

Instead of creating new data.table, in-place modification is also supported. The := sets the values of a column in place without the overhead of making copies and, thus, is much faster than using <-.

product_info[, released := (released == "yes")]

##     id      name  type   class released

## 1: M01     JeepX model vehicle     TRUE

## 2: M02 AircraftX model vehicle     TRUE

## 3: M03    Runner model  people     TRUE

## 4: M04    Dancer model  people    FALSE

## 5: T01    SupCar   toy vehicle     TRUE

## 6: T02  SupPlane   toy vehicle    FALSE

product_info

##     id      name  type   class released

## 1: M01     JeepX model vehicle     TRUE

## 2: M02 AircraftX model vehicle     TRUE

## 3: M03    Runner model  people     TRUE

## 4: M04    Dancer model  people    FALSE

## 5: T01    SupCar   toy vehicle     TRUE

## 6: T02  SupPlane   toy vehicle    FALSE

Another important argument of subsetting a data.table is by, which is used to split the data into multiple parts and for each part the second argument (j) is evaluated. For example, the simplest usage of by is counting the records in each group. In the following code, we can count the number of both released and unreleased products:

product_info[, .N, by = released]

##    released N

## 1:     TRUE 4

## 2:    FALSE 2

The group can be defined by more than one variable. For example, a tuple of type and class can be a group, and for each group, we can count the number of records, as follows:

product_info[, .N, by = .(type, class)]

##     type   class N

## 1: model vehicle 2

## 2: model  people 2

## 3:   toy vehicle 2

We can also perform the following statistical calculations for each group:

product_tests[, .(mean_quality = mean(quality, na.rm = TRUE)),

  by = .(waterproof)]

##    waterproof mean_quality

## 1:        yes         5.75

## 2:         no        10.00

We can chain multiple [] in turn. In the following example, we will first join product_info and product_tests by a shared key id and then calculate the mean value of quality and durability for each group of type and class of released products.

product_info[product_tests][released == TRUE,

  .(mean_quality = mean(quality, na.rm = TRUE),

    mean_durability = mean(durability, na.rm = TRUE)),

  by = .(type, class)]

##     type   class mean_quality mean_durability

## 1: model vehicle            6             4.5

## 2: model  people            5             NaN

## 3:   toy vehicle          NaN            10.0

Note that the values of the by columns will be unique in the resulted data.table; we can use keyby instead of by to ensure that it is automatically used as key by the resulted data.table.

product_info[product_tests][released == TRUE,

  .(mean_quality = mean(quality, na.rm = TRUE),

    mean_durability = mean(durability, na.rm = TRUE)),

  keyby = .(type, class)]

##     type   class mean_quality mean_durability

## 1: model  people            5             NaN

## 2: model vehicle            6             4.5

## 3:   toy vehicle          NaN            10.0

The data.table package also provides functions to perform superfast reshaping of data. For example, we can use dcast() to spread id values along the x-axis as columns and align quality values to all possible date values along the y-axis.

toy_quality <- dcast(toy_tests, date ~ id, value.var = "quality")

toy_quality

##        date T01 T02

## 1: 20160201   9   7

## 2: 20160302  10  NA

## 3: 20160303  NA   8

## 4: 20160403  NA   9

## 5: 20160405   9  NA

## 6: 20160502   9  10

Although each month a test is conducted for each product, the dates may not exactly match with each other. This results in missing values if one product has a value on a day but the other has no corresponding value on exactly the same day.

One way to fix this is to use year-month data instead of exact date. In the following code, we will create a new ym column that is the first 6 characters of toy_tests. For example, substr(20160101, 1, 6) will result in 201601.

toy_tests[, ym := substr(toy_tests$date, 1, 6)]

##     id     date sample quality durability     ym

## 1: T01 20160201    100       9          9 201602

## 2: T01 20160302    150      10          9 201603

## 3: T01 20160405    180       9         10 201604

## 4: T01 20160502    140       9          9 201605

## 5: T02 20160201     70       7          9 201602

## 6: T02 20160303     75       8          8 201603

## 7: T02 20160403     90       9          8 201604

## 8: T02 20160502     85      10          9 201605

toy_tests$ym

## [1] "201602" "201603" "201604" "201605" "201602" "201603"

## [7] "201604" "201605"

This time, we will use ym for alignment instead of date:

toy_quality <- dcast(toy_tests, ym ~ id, value.var = "quality")

toy_quality

##        ym T01 T02

## 1: 201602   9   7

## 2: 201603  10   8

## 3: 201604   9   9

## 4: 201605   9  10

Now the missing values are gone, the quality scores of both products in each month are naturally presented.

Sometimes, we will need to combine a number of columns into one that indicates the measure and another that stores the value. For example, the following code uses melt() to combine the two measures (quality and durability) of the original data into a column named measure and a column of the measured value.

toy_tests2 <- melt(toy_tests, id.vars = c("id", "ym"),

  measure.vars = c("quality", "durability"),

  variable.name = "measure")

toy_tests2

##      id     ym    measure value

##  1: T01 201602    quality     9

##  2: T01 201603    quality    10

##  3: T01 201604    quality     9

##  4: T01 201605    quality     9

##  5: T02 201602    quality     7

##  6: T02 201603    quality     8

##  7: T02 201604    quality     9

##  8: T02 201605    quality    10

##  9: T01 201602 durability     9

## 10: T01 201603 durability     9

## 11: T01 201604 durability    10

## 12: T01 201605 durability     9

## 13: T02 201602 durability     9

## 14: T02 201603 durability     8

## 15: T02 201604 durability     8

## 16: T02 201605 durability     9

The variable names are now contained in the data, which can be directly used by some packages. For example, we can use ggplot2 to plot data in such format. The following code is an example of a scatter plot with a facet grid of different combination of factors.

library(ggplot2)

ggplot(toy_tests2, aes(x = ym, y = value)) +

  geom_point() +

  facet_grid(id ~ measure)

The graph generated is shown as follows:

Learning R Programming

The plot can be easily manipulated because the grouping factor (measure) is contained as data rather than columns, which is easier to represent from the perspective of the ggplot2 package.

ggplot(toy_tests2, aes(x = ym, y = value, color = id)) +

  geom_point() +

  facet_grid(. ~ measure)

The graph generated is shown as follows:

Learning R Programming

Summary

In this article, we used both built-in functions and the data.table package to perform simple data manipulation tasks. Using built-in functions can be verbose while using data.table can be much easier and faster. However, the tasks in real-world data analysis can be much more complex than the examples we demonstrated, which also requires better R programming skills. It is helpful to have a good understanding on how nonstandard evaluation makes data.table so easy to work with, how environment works and scoping rules apply to make your code predictable, and so on. A universal and consistent understanding of how R basically works will certainly give you great confidence to write R code to work with data and enable you to learn packages very quickly.

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here