34 min read

In this article by Simon Walkowiak author of the book Big Data Analytics with R, we will have the opportunity to learn some most important R functions from base R installation and well-known third party packages used for data crunching, transformation, and analysis. More specifically in this article you will learn to:

  • Understand the landscape of available R data structures
  • Be guided through a number of R operations allowing you to import data from standard and proprietary data formats
  • Carry out essential data cleaning and processing activities such as subsetting, aggregating, creating contingency tables, and so on
  • Inspect the data by implementing a selection of Exploratory Data Analysis techniques such as descriptive statistics
  • Apply basic statistical methods to estimate correlation parameters between two (Pearson’s r) or more variables (multiple regressions) or find the differences between means for two (t-tests) or more groups Analysis of variance (ANOVA)
  • Be introduced to more advanced data modeling tasks like logistic and Poisson regressions

(For more resources related to this topic, see here.)

Learning R

This book assumes that you have been previously exposed to R programming language, and this article would serve more as a revision, and an overview, of the most essential operations, rather than a very thorough handbook on R. The goal of this work is to present you with specific R applications related to Big Data and the way you can combine R with your existing Big Data analytics workflows instead of teaching you basics of data processing in R. There is a substantial number of great introductory and beginner-level books on R available at IT specialized bookstores or online, directly from Packt Publishing, and other respected publishers, as well as on the Amazon store. Some of the recommendations include the following:

  • R in Action: Data Analysis and Graphics with R by Robert Kabacoff (2015), 2nd edition, Manning Publications
  • R Cookbook by Paul Teetor (2011), O’Reilly
  • Discovering Statistics Using R by Andy Field, Jeremy Miles, and Zoe Field (2012), SAGE Publications
  • R for Data Science by Dan Toomey (2014), Packt Publishing

An alternative route to the acquisition of good practical R skills is through a large number of online resources, or more traditional tutor-led in-class training courses. The first option offers you an almost limitless choice of websites, blogs, and online guides. A good starting point is the main and previously mentioned Comprehensive R Archive Network (CRAN) page (https://cran.r-project.org/), which, apart from the R core software, contains several well-maintained manuals and Task Views—community run indexes of R packages dealing with specific statistical or data management issues. R-bloggers on the other hand (http://www.r-bloggers.com/) deliver regular news on R in the form of R-related blog posts or tutorials prepared by R enthusiasts and data scientists. Other interesting online sources, which you will probably find yourself using quite often, are as follows:

However, it is very likely that after some initial reading, and several months of playing with R, your most frequent destinations to seek further R-related information and obtain help on more complex use cases for specific functions will become StackOverflow(http://stackoverflow.com/) and, even better, StackExchange (http://stackexchange.com/). StackExchange is in fact a network of support and question-and-answer community-run websites, which address many problems related to statistical, mathematical, biological, and other methods or concepts, whereas StackOverflow, which is currently one of the sub-sites under the StackExchange label, focuses more on applied programming issues and provides users with coding hints and solutions in most (if not all) programming languages known to developers. Both tend to be very popular amongst R users, and as of late December 2015, there were almost 120,000 R-tagged questions asked on StackOverflow. The http://stackoverflow.com/tags/r/info page also contains numerous links and further references to free interactive R learning resources, online books and manuals and many other.

Another good idea is to start your R adventure from user-friendly online training courses available through online-learning providers like Coursera (https://www.coursera.org), DataCamp (https://www.datacamp.com), edX (https://www.edx.org), or CodeSchool (https://www.codeschool.com). Of course, owing to the nature of such courses, a successful acquisition of R skills is somewhat subjective, however, in recent years, they have grown in popularity enormously, and they have also gained rather positive reviews from employers and recruiters alike. Online courses may then be very suitable, especially for those who, for various reasons, cannot attend a traditional university degree with R components, or just prefer to learn R at their own leisure or around their working hours.

Before we move on to the practical part, whichever strategy you are going to use to learn R, please do not be discouraged by the first difficulties. R, like any other programming language, or should I say, like any other language (including foreign languages), needs time, patience, long hours of practice, and a large number of varied exercises to let you explore many different dimensions and complexities of its syntax and rich libraries of functions. If you are still struggling with your R skills, however, I am sure the next section will get them off the ground.

Revisiting R basics

In the following section we will present a short revision of the most useful and frequently applied R functions and statements. We will start from a quick R and RStudio installation guide and then proceed to creating R data structures, data manipulation, and transformation techniques, and basic methods used in the Exploratory Data Analysis (EDA). Although the R codes listed in this book have been tested extensively, as always in such cases, please make sure that your equipment is not faulty and that you will be running all the following scripts at your own risk.

Getting R and RStudio ready

Depending on your operating system (whether Mac OS X, Windows, or Linux) you can download and install specific base R files directly from https://cran.r-project.org/. If you prefer to use RStudio IDE you still need to install R core available from CRAN website first and then download and run installers of the most recent version of RStudio IDE specific for your platform from https://www.rstudio.com/products/rstudio/download/.

Personally I prefer to use RStudio, owing to its practical add-ons such as code highlighting and more user-friendly GUI, however, there is no particular reason why you can’t use just the simple R core installation if you want to. Having said that, in this book we will be using RStudio in most of the examples.

All code snippets have been executed and run on a MacBook Pro laptop with Mac OS X (Yosemite) operating system, 2.3 GHz Intel Core i5 processor, 1TB solid-state hard drive and 16GB of RAM memory, but you should also be fine with a much weaker configuration. In this article we won’t be using any large data, and even in the remaining parts of this book the data sets used are limited to approximately 100MB to 130MB in size each. You are also provided with links and references to full Big Data whenever possible.

If you would like to follow the practical parts of this book you are advised to download and unzip the R code and data for each article from the web page created for this book by Packt Publishing. If you use this book in PDF format it is not advisable to copy the code and paste it into the R console. When printed, some characters (like quotation marks ” “) may be encoded differently than in R and the execution of such commands may result in errors being returned by the R console.

Once you have downloaded both R core and RStudio installation files, follow the on-screen instructions for each installer. When you have finished installing them, open your RStudio software. Upon initialization of the RStudio you should see its GUI with a number of windows distributed on the screen. The largest one is the console in which you input and execute the code, line by line. You can also invoke the editor panel (it is recommended) by clicking on the white empty file icon in the top left corner of the RStudio software or alternatively by navigating to File | New File | R Script. If you have downloaded the R code from the book page of the Packt Publishing website, you may also just click on the Open an existing file (Ctrl + O) (a yellow open folder icon) and locate the downloaded R code on your computer’s hard drive (or navigate to File | Open File…).

Now your RStudio session is open and we can adjust some most essential settings. First, you need to set your working directory to the location on your hard drive where your data files are. If you know the specific location you can just type the setwd() command with a full and exact path to the location of your data as follows:

> setwd("/Users/simonwalkowiak/Desktop/data")

Of course your actual path will differ from mine, shown in the preceding code, however please mind that if you copy the path from the Windows Explorer address bar you will need to change the backslashes to forward slashes / (or to double backslashes \). Also, the path needs to be kept within the quotation marks “…”. Alternatively you can set your working directory by navigating to Session | Set Working Directory | Choose Directory… to manually select the folder in which you store the data for this session.

Apart from the ones we have already described, there are other ways to set your working directory correctly. In fact most of the operations, and even more complex data analysis and processing activities, can be achieved in R in numerous ways. For obvious reasons, we won’t be presenting all of them, but we will just focus on the frequently used methods and some tips and hints applicable to special or difficult scenarios.

You can check whether your working directory has been set correctly by invoking the following line:

> getwd()
[1] "/Users/simonwalkowiak/Desktop/data"

From what you can see, the getwd() function returned the correct destination for my previously defined working directory.

Setting the URLs to R repositories

It is always good practice to check whether your R repositories are set correctly. R repositories are servers located at various institutes and organizations around the world, which store recent updates and new versions of third-party R packages. It is recommended that you set the URL of your default repository to the CRAN server and choose a mirror that is located relatively close to you. To set the repositories you may use the following code:

> setRepositories(addURLs = c(CRAN = "https://cran.r-project.org/"))

You can check your current, or default, repository URLs by invoking the following function:

> getOption("repos")

The output will confirm your URL selection:              

                         CRAN
"https://cran.r-project.org/"

You will be able to choose specific mirrors when you install a new package for the first time during the session, or you may navigate to Tools | Global Options… | Packages. In the Package management section of the window you can alter the default CRAN mirror location—click on Change… button to adjust.

Once your repository URLs and working directory are set, you can go on to create data structures that are typical for R programming language.

R data structures

The concept of data structures in various programming languages is extremely important and cannot be overlooked. Similarly in R, available data structures allow you to hold any type of data and use them for further processing and analysis. The kind of data structure which you use, puts certain constraints on how you can access and process data stored in this structure, and what manipulation techniques you can use. This section will briefly guide you through a number of basic data structures available in R language.

Vectors

Whenever I teach statistical computing courses, I always start by introducing R learners to vectors as the first data structure they should get familiar with. Vectors are one-dimensional structures that can hold any type of data that is numeric, character, or logical. In simple terms, a vector is a sequence of some sort of values (for example numeric, character, logical, and many more) of specified length. The most important thing that you need to remember is that an atomic vector may contain only one type of data.

Let’s then create a vector with 10 random deviates from a standard normal distribution, and store all its elements in an object which we will call vector1. In your RStudio console (or its editor) type the following:

> vector1 <- rnorm(10)

Let’s now see the contents of our newly created vector1:

> vector1
[1] -0.37758383 -2.30857701  2.97803059 -0.03848892  1.38250714 
[6] 0.13337065 -0.51647388 -0.81756661 0.75457226 -0.01954176

As we drew random values, your vector most likely contains different elements to the ones shown in the preceding example. Let’s then make sure that my new vector (vector2) is the same as yours. In order to do this we need to set a seed from which we will be drawing the values:

> set.seed(123)
> vector2 <- rnorm(10, mean=3, sd=2)
> vector2
[1] 1.8790487 2.5396450 6.1174166 3.1410168 3.2585755 6.4301300 
[7] 3.9218324 0.4698775 1.6262943 2.1086761

In the preceding code we’ve set the seed to an arbitrary number (123) in order to allow you to replicate the values of elements stored in vector2 and we’ve also used some optional parameters of the rnorm() function, which enabled us to specify two characteristics of our data, that is the arithmetic mean (set to 3) and standard deviation (set to 2). If you wish to inspect all available arguments of the rnorm() function, its default settings, and examples of how to use it in practice, type ?rnorm to view help and information on that specific function.

However, probably the most common way in which you will be creating a vector of data is by using the c() function (c stands for concatenate) and then explicitly passing the values of each element of the vector:

> vector3 <- c(6, 8, 7, 1, 2, 3, 9, 6, 7, 6)
> vector3
[1] 6 8 7 1 2 3 9 6 7 6

In the preceding example we’ve created vector3 with 10 numeric elements. You can use the length() function of any data structure to inspect the number of elements:

> length(vector3)
[1] 10

The class() and mode() functions allow you to determine how to handle the elements of vector3 and how the data are stored in vector3 respectively.

> class(vector3)
[1] "numeric"
> mode(vector3)
[1] "numeric"

The subtle difference between both functions becomes clearer if we create a vector that holds levels of categorical variable (known as a factor in R) with character values:

> vector4 <- c("poor", "good", "good", "average", "average", "good", "poor", "good", "average", "good")
> vector4
[1] "poor"    "good"   "good"   "average" "average" "good"   "poor"    
[8] "good"   "average" "good"
> class(vector4)
[1] "character"
> mode(vector4)
[1] "character"
> levels(vector4)
NULL

In the preceding example, both the class() and mode() outputs of our character vector are the same, as we still haven’t set it to be treated as a categorical variable, and we haven’t defined its levels (the contents of the levels() function is empty—NULL). In the following code we will explicitly set the vector to be recognized as categorical with three levels:

> vector4 <- factor(vector4, levels = c("poor", "average", "good"))
> vector4
[1] poor    good    good    average average good    poor    good    [8] average good   
Levels: poor average good

The sequence of levels doesn’t imply that our vector is ordered. We can order the levels of factors in R using the ordered() command. For example, you may want to arrange the levels of vector4 in reverse order, starting from “good”:

> vector4.ord <- ordered(vector4, levels = c("good", "average", "poor"))
> vector4.ord
[1] poor    good    good    average average good    poor    good    [8] average good
Levels: good < average < poor

You can see from the output that R has now properly recognized the order of our levels, which we had defined. We can now apply class() and mode() functions on the vector4.ord object:

> class(vector4.ord)
[1] "ordered" "factor"
> mode(vector4.ord)
[1] "numeric"

You may very likely be wondering why the mode() function returned “numeric” type instead of “character”. The answer is simple. By setting the levels of our factor, R has assigned values 1, 2, and 3 to “good”, “average” and “poor” respectively, exactly in the same order as we had defined them in the ordered() function. You can check this using levels() and str() functions:

> levels(vector4.ord)
[1] "good"    "average" "poor"
> str(vector4.ord)
 Ord.factor w/ 3 levels "good"<"average"<..: 3 1 1 2 2 1 3 1 2 1

Just to finalize the subject of vectors, let’s create a logical vector, which contains only TRUE and FALSE values:

> vector5 <- c(TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE)
> vector5
[1]  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE

Similarly, for all other vectors already presented, feel free to check their structure, class, mode, and length using appropriate functions shown in this section. What outputs did those commands return?

Scalars

The reason why I always start from vectors is that scalars just seem trivial when they follow vectors. To simplify things even more, think of scalars as one-element vectors which are traditionally used to hold some constant values for example:

> a1 <- 5
> a1
[1] 5

Of course you may use scalars in computations and also assign any one-element outputs of mathematical or statistical operations to another, arbitrary named scalar for example:

> a2 <- 4
> a3 <- a1 + a2
> a3
[1] 9

In order to complete this short subsection on scalars, create two separate scalars which will hold a character and a logical value.

Matrices

A matrix is a two-dimensional R data structure in which each of its elements must be of the same type; that is numeric, character, or logical. As matrices consist of rows and columns, their shape resembles tables. In fact, when creating a matrix, you can specify how you want to distribute values across its rows and columns for example:

> y <- matrix(1:20, nrow=5, ncol=4)
> y
     [,1] [,2] [,3] [,4]
[1,]    1    6   11   16
[2,]    2    7   12   17
[3,]    3    8   13   18
[4,]    4    9   14   19
[5,]    5   10   15   20

In the preceding example we have allocated a sequence of 20 values (from 1 to 20) into five rows and four columns, and by default they have been distributed by column. We may now create another matrix in which we will distribute the values by rows and give names to rows and columns using the dimnames argument (dimnames stands for names of dimensions) in the matrix() function:

> rows <- c("R1", "R2", "R3", "R4", "R5")
> columns <- c("C1", "C2", "C3", "C4")
> z <- matrix(1:20, nrow=5, ncol=4, byrow=TRUE, dimnames=list(rows, columns))
> z
   C1 C2 C3 C4
R1  1  2  3  4
R2  5  6  7  8
R3  9 10 11 12
R4 13 14 15 16
R5 17 18 19 20

As we are talking about matrices it’s hard not to mention anything about how to extract specific elements stored in a matrix. This skill will actually turn out to be very useful when we get to subsetting real data sets. Looking at the matrix y, for which we didn’t define any names of its rows and columns, notice how R denotes them. The row numbers come in the format [r, ], where r is a consecutive number of a row, whereas the column are identified by [ ,c], where c is a consecutive number of a column. If you then wished to extract a value stored in the fourth row of the second column of our matrix y, you could use the following code to do so:

> y[4,2]
[1] 9

In case you wanted to extract the whole column number three from our matrix y, you could type the following:

> y[,3]
[1] 11 12 13 14 15

As you can see, we don’t even need to allow an empty space before the comma in order for this short script to work. Let’s now imagine you would like to extract three values stored in the second, third and fifth rows of the first column in our vector z with named rows and columns. In this case, you may still want to use the previously shown notation, you do not need to refer explicitly to the names of dimensions of our matrix z. Additionally, notice that for several values to extract we have to specify their row locations as a vector—hence we will put their row coordinates inside the c() function which we had previously used to create vectors:

> z[c(2, 3, 5), 1]
R2 R3 R5 
 5  9 17

Similar rules of extracting data will apply to other data structures in R such as arrays, lists, and data frames, which we are going to present next.

Arrays

Arrays are very similar to matrices with only one exception: they contain more dimensions. However, just like matrices or vectors, they may only hold one type of data. In R language, arrays are created using the array() function:

> array1 <- array(1:20, dim=c(2,2,5))
> array1
, , 1

     [,1] [,2]
[1,]    1    3
[2,]    2    4

, , 2

     [,1] [,2]
[1,]    5    7
[2,]    6    8

, , 3

     [,1] [,2]
[1,]    9   11
[2,]   10   12

, , 4

     [,1] [,2]
[1,]   13   15
[2,]   14   16

, , 5

     [,1] [,2]
[1,]   17   19
[2,]   18   20

The dim argument, which was used within the array() function, specifies how many dimensions you want to distribute your data across. As we had 20 values (from 1 to 20) we had to make sure that our array can hold all 20 elements, therefore we decided to assign them into two rows, two columns, and five dimensions (2 x 2 x 5 = 20). You can check dimensionality of your multi-dimensional R objects with dim() command:

> dim(array1)
[1] 2 2 5 

As with matrices, you can use standard rules for extracting specific elements from your arrays. The only difference is that now you have additional dimensions to take care of. Let’s assume you would want to extract a specific value located in the second row of the first column in the fourth dimension of our array1:

> array1[2, 1, 4]
[1] 14

Also, if you need to find a location of a specific value, for example 11, within the array, you can simply type the following line:

> which(array1==11, arr.ind=TRUE)
     dim1 dim2 dim3
[1,]    1    2    3

Here, the which() function returns indices of the array (arr.ind=TRUE), where the sought value equals 11 (hence ==). As we had only one instance of value 11 in our array, there is only one row specifying its location in the output. If we had more instances of 11, additional rows would be returned indicating indices for each element equal to 11.

Data frames

The following two short subsections concern two of probably the most widely used R data structures. Data frames are very similar to matrices, but they may contain different types of data. Here you might have suddenly thought of a typical rectangular data set with rows and columns or observations and variables. In fact you are correct. Most of the data sets are indeed imported into R as data frames. You can also create a simple data frame manually with the data.frame() function, but as each column in the data frame may be of a different type, we must first create vectors which will hold data for specific columns:

> subjectID <- c(1:10)
> age <- c(37,23,42,25,22,25,48,19,22,38)
> gender <- c("male", "male", "male", "male", "male", "female", "female", "female", "female", "female")
> lifesat <- c(9,7,8,10,4,10,8,7,8,9)
> health <- c("good", "average", "average", "good", "poor", "average", "good", "poor", "average", "good")
> paid <- c(T, F, F, T, T, T, F, F, F, T)
> dataset <- data.frame(subjectID, age, gender, lifesat, health, paid)
> dataset
   subjectID age gender lifesat  health  paid
1          1  37   male       9    good  TRUE
2          2  23   male       7 average FALSE
3          3  42   male       8 average FALSE
4          4  25   male      10    good  TRUE
5          5  22   male       4    poor  TRUE
6          6  25 female      10 average  TRUE
7          7  48 female       8    good FALSE
8          8  19 female       7    poor FALSE
9          9  22 female       8 average FALSE
10        10  38 female       9    good  TRUE

The preceding example presents a simple data frame which contains some dummy imaginary data, possibly a sample from a basic psychological experiment, which measured subjects’ life satisfaction (lifesat) and their health status (health) and also collected other socio-demographic information such as age and gender, and whether the participant was a paid subject or a volunteer. As we deal with various types of data, the elements for each column had to be amalgamated into a single structure of a data frame using the data.frame() command, and specifying the names of objects (vectors) in which we stored all values. You can inspect the structure of this data frame with the previously mentioned str() function:

> str(dataset)
'data.frame':	10 obs. of  6 variables:
 $ subjectID: int  1 2 3 4 5 6 7 8 9 10
 $ age      : num  37 23 42 25 22 25 48 19 22 38
 $ gender   : Factor w/ 2 levels "female","male": 2 2 2 2 2 1 1 1 1 1
 $ lifesat  : num  9 7 8 10 4 10 8 7 8 9
 $ health   : Factor w/ 3 levels "average","good",..: 2 1 1 2 3 1 2 3 1 2
 $ paid     : logi  TRUE FALSE FALSE TRUE TRUE TRUE ...

The output of str() gives you some basic insights into the shape and format of your data in the dataset object, for example, number of observations and variables, names of variables, types of data they hold, and examples of values for each variable.

While discussing data frames, it may also be useful to introduce you to another way of creating subsets. As presented earlier, you may apply standard extraction rules to subset data of your interest. For example, suppose you want to print only those columns which contain age, gender, and life satisfaction information from our dataset data frame. You may use the following two alternatives (the output not shown to save space, but feel free to run it):

> dataset[,2:4] #or
> dataset[, c("age", "gender", "lifesat")]

Both lines of code will produce exactly the same results. The subset() function however gives you additional capabilities of defining conditional statements which will filter the data, based on the output of logical operators. You can replicate the preceding output using subset() in the following way:

> subset(dataset[c("age", "gender", "lifesat")])

Assume now that you want to create a subset with all subjects who are over 30 years old, and with a score of greater than or equal to eight on the life satisfaction scale (lifesat). The subset() function comes very handy:

> subset(dataset, age > 30 & lifesat >= 8)
   subjectID age gender lifesat  health  paid
1          1  37   male       9    good  TRUE
3          3  42   male       8 average FALSE
7          7  48 female       8    good FALSE
10        10  38 female       9    good  TRUE

Or you want to produce an output with two socio-demographic variables of age and gender, of only these subjects who were paid to participate in this experiment:

> subset(dataset, paid==TRUE, select=c("age", "gender"))
   age gender
1   37   male
4   25   male
5   22   male
6   25 female
10  38 female

We will perform much more thorough and complex data transformations on real data frames in the second part of this article.

Lists

A list in R is a data structure, which is a collection of other objects. For example, in the list you can store vectors, scalars, matrices, arrays, data frames, and even other lists. In fact, lists in R are vectors, but they differ from atomic vectors, which we introduced earlier in this section as lists that can hold many different types of data. In the following example, we will construct a simple list (using list() function) which will include a variety of other data structures:

> simple.vector1 <- c(1, 29, 21, 3, 4, 55)
> simple.matrix <- matrix(1:24, nrow=4, ncol=6, byrow=TRUE)
> simple.scalar1 <- 5
> simple.scalar2 <- "The List"
> simple.vector2 <- c("easy", "moderate", "difficult")
> simple.list <- list(name=simple.scalar2, matrix=simple.matrix, vector=simple.vector1, scalar=simple.scalar1, difficulty=simple.vector2)
>simple.list
$name
[1] "The List"

$matrix
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    2    3    4    5    6
[2,]    7    8    9   10   11   12
[3,]   13   14   15   16   17   18
[4,]   19   20   21   22   23   24

$vector
[1]  1 29 21  3  4 55

$scalar
[1] 5

$difficulty
[1] "easy"      "moderate"  "difficult"
> str(simple.list)
List of 5
 $ name      : chr "The List"
 $ matrix    : int [1:4, 1:6] 1 7 13 19 2 8 14 20 3 9 ...
 $ vector    : num [1:6] 1 29 21 3 4 55
 $ scalar    : num 5
 $ difficulty: chr [1:3] "easy" "moderate" "difficult"

Looking at the preceding output, you can see that we have assigned names to each component in our list and the str() function prints them as if they were variables of a standard rectangular data set.

In order to extract specific elements from a list, you first need to use a double square bracket notation [[x]] to identify a component x within the list. For example, assuming you want to print an element stored in its first row and the third column of the second component you may use the following line in R:

> simple.list[[2]][1,3]
[1] 3

Owing to their flexibility, lists are commonly used as preferred data structures in the outputs of statistical functions. It is then important for you to know how you can deal with lists and what sort of methods you can apply to extract and process data stored in them.

Once you are familiar with the basic features of data structures available in R, you may wish to visit Hadley Wickham’s online book at http://adv-r.had.co.nz/ in which he explains various more advanced concepts related to each native data structure in R language, and different techniques of subsetting data, depending on the way they are stored.

Exporting R data objects

In the previous section we created numerous objects, which you can inspect in the Environment tab window in RStudio. Alternatively, you may use the ls() function to list all objects stored in your global environment:

> ls()

If you’ve followed the article along, and run the script for this book line-by-line, the output of the ls() function should hopefully return 27 objects:

 [1] "a1"             "a2"             "a3"            
 [4] "age"            "array1"         "columns"       
 [7] "dataset"        "gender"         "health"        
[10] "lifesat"        "paid"           "rows"          
[13] "simple.list"    "simple.matrix"  "simple.scalar1"
[16] "simple.scalar2" "simple.vector1" "simple.vector2"
[19] "subjectID"      "vector1"        "vector2"       
[22] "vector3"        "vector4"        "vector4.ord"   
[25] "vector5"        "y"              "z"

In this section we will present various methods of saving the created objects to your local drive and exporting their contents to a number of the most commonly used file formats.

Sometimes, for various reasons, it may happen that you need to leave your project and exit RStudio or shut your PC down. If you do not save your created objects, you will lose all of them, the moment you close RStudio. Remember that R stores created data objects in the RAM of your machine, and whenever these objects are not in use any longer, R frees them from the memory, which simply means that they get deleted. Of course this might turn out to be quite costly, especially if you had not saved your original R script, which would have enabled you to replicate all the steps of your data processing activities when you start a new session in R. In order to prevent the objects from being deleted, you can save all or selected ones as .RData files on your hard drive. In the first case, you may use the save.image() function which saves your whole current workspace with all objects to your current working directory:

> save.image(file = "workspace.RData")

If you are dealing with large objects, first make sure you have enough storage space available on your drive (this is normally not a problem any longer), or alternatively you can reduce the size of the saved objects using one of the compression methods available. For example, the above workspace.RData file was 3,751 bytes in size without compression, but when xz compression was applied the size of the resulting file decreased to 3,568 bytes.

> save.image(file = "workspace2.RData", compress = "xz")

Of course, the difference in sizes in the presented example is minuscule, as we are dealing with very small objects, however it gets much more significant for bigger data structures. The trade-off of applying one of the compression methods is the time it takes for R to save and load .RData files.

If you prefer to save only chosen objects (for example dataset data frame and simple.list list) you can achieve this with the save() function:

> save(dataset, simple.list, file = "two_objects.RData")

You may now test whether the above solutions worked by cleaning your global environment of all objects, and then loading one of the created files, for example:

> rm(list=ls())
> load("workspace2.RData")

As an additional exercise, feel free to explore other functions which allow you to write text representations of R objects, for example dump() or dput(). More specifically, run the following commands and compare the returned outputs:

> dump(ls(), file = "dump.R", append = FALSE)
> dput(dataset, file = "dput.txt")

The save.image() and save() functions only create images of your workspace or selected objects on the hard drive. It is an entirely different story if you want to export some of the objects to data files of specified formats, for example, comma-separated, tab-delimited, or proprietary formats like Microsoft Excel, SPSS, or Stata.

The easiest way to export R objects to generic file formats like CSV, TXT, or TAB is through the cat() function, but it only works on atomic vectors:

> cat(age, file="age.txt", sep=",", fill=TRUE, labels=NULL, append=TRUE)
> cat(age, file="age.csv", sep=",", fill=TRUE, labels=NULL, append=TRUE)

The preceding code creates two files, one as a text file and another one as a comma-separated format, both of which contain values from the age vector that we had previously created for the dataset data frame. The sep argument is a character vector of strings to append after each element, the fill option is a logical argument which controls whether the output is automatically broken into lines (if set to TRUE), the labels parameter allows you to add a character vector of labels for each printed line of data in the file, and the append logical argument enables you to append the output of the call to the already existing file with the same name.

In order to export vectors and matrices to TXT, CSV, or TAB formats you can use the write() function, which writes out a matrix or a vector in a specified number of columns for example:

> write(age, file="agedata.csv", ncolumns=2, append=TRUE, sep=",")
> write(y, file="matrix_y.tab", ncolumns=2, append=FALSE, sep="t")

Another method of exporting matrices provides the MASS package (make sure to install it with the install.packages(“MASS”) function) through the write.matrix() command:

> library(MASS)
> write.matrix(y, file="ymatrix.txt", sep=",")

For large matrices, the write.matrix() function allows users to specify the size of blocks in which the data are written through the blocksize argument.

Probably the most common R data structure that you are going to export to different file formats will be a data frame. The generic write.table() function gives you an option to save your processed data frame objects to standard data formats for example TAB, TXT, or CSV:

> write.table(dataset, file="dataset1.txt", append=TRUE, sep=",", na="NA", col.names=TRUE, row.names=FALSE, dec=".")

The append and sep arguments should already be clear to you as they were explained earlier. In the na option you may specify an arbitrary string to use for missing values in the data. The logical parameter col.names allows users to append the names of columns to the output file, and the dec parameter sets the string used for decimal points and must be a single character. In the example, we used row.names set to FALSE, as the names of the rows in the data are the same as the values of the subjectID column. However, it is very likely that in other data sets the ID variable may differ from the names (or numbers) of rows, so you may want to control it depending on the characteristics of your data.

Two similar functions write.csv() and write.csv2() are just convenience wrappers for saving CSV files, and they only differ from the generic write.table() function by default settings of some of their parameters, for example sep and dec. Feel free to explore these subtle differences at your leisure.

To complete this section of the article we need to present how to export your R data frames to third-party formats. Amongst several frequently used methods, at least four of them are worth mentioning here. First, if you wish to write a data frame to a proprietary Microsoft Excel format, such as XLS or XLSX, you should probably use the WriteXLS package (please use install.packages(“WriteXLS”) if you have not done it yet) and its WriteXLS() function:

> library(WriteXLS)
> WriteXLS("dataset", "dataset1.xlsx", SheetNames=NULL, 
row.names=FALSE, col.names=TRUE, AdjWidth=TRUE, 
envir=parent.frame())

The WriteXLS() command offers users a number of interesting options, for instance you can set the names of the worksheets (SheetNames argument), adjust the widths of columns depending on the number of characters of the longest value (AdjWidth), or even freeze rows and columns just as you do it in Excel (FreezeRow and FreezeCol parameters).

Please note that in order for the WriteXLS package to work, you need to have Perl installed on your machine. The package creates Excel files using Perl scripts called WriteXLS.pl for Excel 2003 (XLS) files, and WriteXLSX.pl for Excel 2007 and later version (XLSX) files. If Perl is not present on your system, please make sure to download and install it from https://www.perl.org/get.html. After the Perl installation, you may have to restart your R session and load the WriteXLS package again to apply the changes. For solutions to common Perl issues please visit the following websites: https://www.perl.org/docs.html, http://www.ahinea.com/en/tech/perl-unicode-struggle.html, and http://www.perl.com/pub/2012/04/perlunicook-standard-preamble.html or search StackOverflow and similar websites for R and Perl related specific problems.

Another very useful way of writing R objects to the XLSX format is provided by the openxlsx package through the write.xlsx() function, which, apart from data frames, also allows lists to be easily written to Excel spreadsheets. Please note that Windows users may need to install the Rtools package in order to use openxlsx functionalities. The write.xlsx() function gives you a large choice of possible options to set, including a custom style to apply to column names (through headerStyle argument), the color of cell borders (borderColour), or even its line style (borderStyle). The following example utilizes only the most common and minimal arguments required to write a list to the XLSX file, but be encouraged to explore other options offered by this very flexible function:

> write.xlsx(simple.list, file = "simple_list.xlsx")

A third-party package called foreign makes it possible to write data frames to other formats used by well-known statistical tools such as SPSS, Stata, or SAS. When creating files, the write.foreign() function requires users to specify the names of both the data and code files. Data files hold raw data, whereas code files contain scripts with the data structure and metadata (value and variable labels, variable formats, and so on) written in the proprietary syntax. In the following example, the code writes the dataset data frame to the SPSS format:

> library(foreign)
> write.foreign(dataset, "datafile.txt", "codefile.txt", package="SPSS")

Finally, another package called rio contains only three functions, allowing users to quickly import(), export() and convert() data between a large array of file formats, (for example TSV, CSV, RDS, RData, JSON, DTA, SAV, and many more). The package, in fact, is dependent on a number of other R libraries, some of which, for example foreign and openxlsx, have already been presented in this article. The rio package does not introduce any new functionalities apart from the default arguments characteristic for underlying export functions, so you still need to be familiar with the original functions and their parameters if you require more advanced exporting capabilities. But, if you are only looking for a no-fuss general export function, the rio package is definitely a good shortcut to take:

> export(dataset, format = "stata")
> export(dataset, "dataset1.csv", col.names = TRUE, na = "NA")

Summary

In this article, we have provided you with quite a bit of theory, and hopefully a lot of practical examples of data structures available to R users. You’ve created several objects of different types, and you’ve become familiar with a variety of data and file formats to offer. We then showed you how to save R objects held in your R workspace to external files on your hard drive, or to export them to various standard and proprietary file formats.

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here