11 min read

In this article by Jen Stirrup, the author of the book Advanced Analytics with R and Tableau, We will cover, with examples, the core essentials of R programming such as variables and data structures in R such as matrices, factors, vectors, and data frames. We will also focus on control mechanisms in R ( relational operators, logical operators, conditional statements, loops, functions, and apply) and how to execute these commands in R to get grips before proceeding to article that heavily rely on these concepts for scripting complex analytical operations.

(For more resources related to this topic, see here.)

Core essentials of R programming

One of the reasons for R’s success is its use of variables. Variables are used in all aspects of R programming. For example, variables can hold data, strings to access a database, whole models, queries, and test results. Variables are a key part of the modeling process, and their selection has a fundamental impact on the usefulness of the models. Therefore, variables are an important place to start since they are at the heart of R programming.

Variables

In the following section we will deal with the variables—how to create variables and working with variables.

Creating variables

It is very simple to create variables in R, and to save values in them. To create a variable, you simply need to give the variable a name, and assign a value to it.

In many other languages, such as SQL, it’s necessary to specify the type of value that the variable will hold. So, for example, if the variable is designed to hold an integer or a string, then this is specified at the point at which the variable is created.

Unlike other programming languages, such as SQL, R does not require that you specify the type of the variable before it is created. Instead, R works out the type for itself, by looking at the data that is assigned to the variable.

In R, we assign variables using an assignment variable, which is a less than sign (<) followed by a hyphen (). Put together, the assignment variable looks like so:

Working with variables

It is important to understand what is contained in the variables. It is easy to check the content of the variables using the lscommand. If you need more details of the variables, then the ls.strcommand will provide you with more information.

If you need to remove variables, then you can use the rm function.

Data structures in R

The power of R resides in its ability to analyze data, and this ability is largely derived from its powerful data types. Fundamentally, R is a vectorized programming language. Data structures in R are constructed from vectors that are foundational. This means that R’s operations are optimized to work with vectors.

Vector

The vector is a core component of R. It is a fundamental data type. Essentially, a vector is a data structure that contains an array where all of the values are the same type. For example, they could all be strings, or numbers. However, note that vectors cannot contain mixed data types.

R uses the c() function to take a list of items and turns them into a vector.

Lists

R contains two types of lists: a basic list, and a named list. A basic list is created using the list() operator. In a named list, every item in the list has a name as well as a value. named lists are a good mapping structure to help map data between R and Tableau. In R, lists are mapped using the $ operator. Note, however, that the list label operators are case sensitive.

Matrices

Matrices are two-dimensional structures that have rows and columns. The matrices are lists of rows. It’s important to note that every cell in a matrix has the same type.

Factors

A factor is a list of all possible values of a variable in a string format. It is a special string type, which is chosen from a specified set of values known as levels. They are sometimes known as categorical variables. In dimensional modeling terminology, a factor is equivalent to a dimension, and the levels represent different attributes of the dimension. Note that factors are variables that can only contain a limited number of different values.

Data frames

The data frame is the main data structure in R. It’s possible to envisage the data frame as a table of data, with rows and columns. Unlike the list structure, the data frame can contain different types of data. In R, we use the data.frame() command in order to create a data frame.

The data frame is extremely flexible for working with structured data, and it can ingest data from many different data types. Two main ways to ingest data into data frames involves the use of many data connectors, which connect to data sources such as databases, for example. There is also a command, read.table(), which takes in data.

Data Frame Structure

Here is an example, populated data frame. There are three columns, and two rows. The top of the data frame is the header. Each horizontal line afterwards holds a data row. This starts with the name of the row, and then followed by the data itself. Each data member of a row is called a cell. Here is an example data frame, populated with data:

Example Data Frame Structure

df = data.frame(

Year=c(2013, 2013, 2013),

Country=c("Arab World","Carribean States", "Central Europe"),

LifeExpectancy=c(71, 72, 76))

As always, we should read out at least some of the data frame so we can double-check that it was set correctly. The data frame was set to the df variable, so we can read out the contents by simply typing in the variable name at the command prompt:

To obtain the data held in a cell, we enter the row and column co-ordinates of the cell, and surround them by square brackets []. In this example, if we wanted to obtain the value of the second cell in the second row, then we would use the following:

df[2, "Country"]

We can also conduct summary statistics on our data frame. For example, if we use the following command:

summary(df)

Then we obtain the summary statistics of the data. The example output is as follows:

You’ll notice that the summary command has summarized different values for each of the columns. It has identified Year as an integer, and produced the min, quartiles, mean, and max for year. The Country column has been listed, simply because it does not contain any numeric values. Life Expectancy is summarized correctly.

We can change the Year column to a factor, using the following command:

df$Year <- as.factor(df$Year)

Then, we can rerun the summary command again:

summary(df)

On this occasion, the data frame now returns the correct results that we expect:

As we proceed throughout this book, we will be building on more useful features that will help us to analyze data using data structures, and visualize the data in interesting ways using R.

Control structures in R

R has the appearance of a procedural programming language. However, it is built on another language, known as S. S leans towards functional programming. It also has some object-oriented characteristics. This means that there are many complexities in the way that R works.

In this section, we will look at some of the fundamental building blocks that make up key control structures in R, and then we will move onto looping and vectorized operations.

Logical operators

Logical operators are binary operators that allow the comparison of values:

Operator

Description

less than

<=

less than or equal to

greater than

>=

greater than or equal to

==

exactly equal to

!=

not equal to

!x

Not x

x | y

x OR y

x & y

x AND y

isTRUE(x)

test if X is TRUE

For loops and vectorization in R

Specifically, we will look at the constructs involved in loops. Note, however, that it is more efficient to use vectorized operations rather than loops, because R is vector-based. We investigate loops here, because they are a good first step in understanding how R works, and then we can optimize this understanding by focusing on vectorized alternatives that are more efficient.

More information about control flows can be obtained by executing the command at the command line:

Help?Control

The control flow commands take decisions and make decisions between alternative actions. The main constructs are for, while, and repeat.

For loops

Let’s look at a for loop in more detail. For this exercise, we will use the Fisher iris dataset, which is installed along with R by default. We are going to produce summary statistics for each species of iris in the dataset.

You can see some of the iris data by typing in the following command at the command prompt:

head(iris)

We can divide the iris dataset so that the data is split by species. To do this, we use the split command, and we assign it to the variable called IrisBySpecies:

IrisBySpecies <- split(iris,iris$Species)

Now, we can use a for loop in order to process the data in order to summarize it by species.

Firstly, we will set up a variable called output, and set it to a list type. For each species held in the IrisBySpecies variable, we set it to calculate the minimum, maximum, mean, and total cases. It is then set to a data frame called output.df, which is printed out to the screen:

output <- list()

for(n in names(IrisBySpecies)){  

  ListData <- IrisBySpecies[[n]]  

  output[[n]] <- data.frame(species=n,

                            MinPetalLength=min(ListData$Petal.Length),

                            MaxPetalLength=max(ListData$Petal.Length),

                         MeanPetalLength=mean(ListData$Petal.Length),

                         NumberofSamples=nrow(ListData))

  output.df <- do.call(rbind,output)

}

print(output.df)

The output is as follows:

We used a for loop here, but they can be expensive in terms of processing. We can achieve the same end by using a function that uses a vector called Tapply. Tapply processes data in groups. Tapply has three parameters; the vector of data, the factor that defines the group, and a function. It works by extracting the group, and then applying the function to each of the groups. Then, it returns a vector with the results. We can see an example of tapply here, using the same dataset:

output <-

data.frame(MinPetalLength=tapply(iris$Petal.Length,iris$Species,min),

                MaxPetalLength=tapply(iris$Petal.Length,iris$Species,max),

                     MeanPetalLength=tapply(iris$Petal.Length,iris$Species,mean),

                     NumberofSamples=tapply(iris$Petal.Length,iris$Species,length)) 

print(output)

This time, we get the same output as previously. The only difference is that by using a vectorized function, we have concise code that runs efficiently.

To summarize, R is extremely flexible and it’s possible to achieve the same objective in a number of different ways. As we move forward through this book, we will make recommendations about the optimal method to select, and the reasons for the recommendation.

Functions

R has many functions that are included as part of the installation.

In the first instance, let’s look to see how we can work smart by finding out what functions are available by default.

In our last example, we used the split() function. To find out more about the split function, we can simply use the following command:

?split

Or we can use:

help(split)

It’s possible to get an overview of the arguments required for a function. To do this, simply use the args command:

args(split)

Fortunately, it’s also possible to see examples of each function by using the following command:

example(split)

If you need more information than the documented help file about each function, you can use the following command. It will go and search through all the documentation for instances of the keyword:

help.search("split")

If you  want to search the R project site from within RStudio, you can use the RSiteSearch command. For example:

RSiteSearch("split")

Summary

In this article, we have looked at various essential structures in working with R. We have looked at the data structures that are fundamental to using R optimally. We have also taken the view that structures such as for loops can often be done better as vectorized operations. Finally, we have looked at the ways in which R can be used to create functions in order to simply code.

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here