In this blog post, you will follow along to produce a line chart using the ggplot2 package for R. The ggplot2 package is highly customizable and extensible, which provides an intuitive plotting syntax that allows for the creation of an incredibly diverse range of plots.
Before getting started, let’s examine ggplot over the base R plotting functions.
In general, the base R plotting system is more verbose and harder to understand and produces plots that are less attractive than their ggplot2 equivalents. To illustrate, let’s build a plot using data on the growth of five trees from the “datasets” package. This is just a demonstration, so don’t worry too much about the structure of the data or the details of the plotting syntax. Take a look at the following:
The goal is to plot the growth of the trees as a line chart where each line corresponds to a different tree over time. Consider the following code to produce this chart using the base R plotting system:
# Adapted from: http://www.statmethods.net/graphs/line.html ntrees
The code is verbose, difficult to extend or change (for example, if you want to change the lines to points, you would need to change a number of variables), and the chart produced is not particularly attractive.
The following is an equivalent chart using ggplot2:
Using ggplot2, you can produce this plot with fewer lines of code that are both more readable and extensible. You will also avoid the ugly “for” loop used to produce the lines. By the end of this post, you will have built this plot from the ground up using ggplot2!
For this post, you will first need to make sure that ggplot2 is installed via the following command:
Once the package is installed, load it into the session using:
The dataset used in this post is already in the “tidy data” format, as described here. If your data is not in the tidy format, consider using the dplyr and/or tidyr packages to shape it into the correct format.
You are using a very small dataset called Orange, which as the preceding plots describe, contains the growth patterns of five trees over several years. The data consist of 35 rows and three columns and is found in the datasets package. The structure of the data is as follows:
str(Orange) 'data.frame': 35 obs. of 3 variables: $ Tree : Ord.factor w/ 5 levels "1"
You will now begin building up the previous plot using principles described in “The Grammar of Graphics“, upon which ggplot2 is based. To build a plot using ggplot, think about it in terms of aesthetic mappings and geometries, which are used to create layers that make up the plot. Calling ggplot() without any aesthetics or geometries defined provides an empty canvas.
Aesthetics are the visual properties (for example, size, shape, color, fill, and so on) of the geometries present in the graph. In this context, a geometry refers to objects that directly represent data points (that is, rows in a data frame), such as dots, lines, or bars. In ggplot2, create aesthetics using the aes() function.
Inside aes(), you define which variables will map to aesthetics in the plot. Here, we wish to map the “age” variable to the x-axis aesthetic, the “circumference” variable to the y-axis aesthetic, and the “Tree” factor variable to the color aesthetic, with each factor level being represented by a different color, as follows:
If you run the code after defining only the aesthetics, you will see that there is nothing on the plot except the axes:
This is because although you have mapped aesthetics to data, you have yet to represent these mappings with geometries (or geoms).
To create this representation, you add a layer on the plot using a call to the line geometry and the geom_line() function, as follows:
Take a look at the full listing of geoms that can be used here.
With the structure of the plot in place, polish the plot by:
- Editing the axis labels
- Adding a title
- Moving the legend
You can create/change the axis labels of the plot using labs(), as follows:
You can also add a title using ggtitle(), as follows:
Moving the legend
To move the legend, use the theme() function and change the legend.justification and legend.position variables via the following code:
The justification for the legend is laid out as a grid, where (0,0) is lower-left and (1,1) is upper-right. The legend.position parameter can also take values such as “top”, “bottom”, “left”, “right”, or “none” (which removes the legend entirely).
The theme() function is very powerful and allows very fine-grained control over the plot. You can find a listing of all the available parameters in the documentation here.
The plot is now identical to the plot used to motivate the article! The final code is as follows:
ggplot(data=Orange, aes(x=age, y=circumference, col=Tree)) + geom_line() + labs(x="Age (days)", y="Circumference (mm)") + ggtitle("Tree Growth (ggplot2)") + theme(legend.justification=c(0,1), legend.position=c(0,1))
Clearly, the code is more readable, and I think you would agree that the plot is more attractive than the equivalent plot using base R. Good luck and happy plotting!
About the author
Joel Carlson is a recent MSc graduate from Seoul National University and current Data Science Fellow at Galvanize in San Francisco. He has contributed two R packages in CRAN (radiomics and RImagePalette). You can learn more about him or get in touch at his personal website.