In this article by Donato Teutonico, author of the book ggplot2 Essentials, we are going to explore different plotting environments in R and subsequently learn about the package, ggplot2. R provides a complete series of options available for realizing graphics, which make this software quite advanced concerning data visualization. The core of the graphics visualization in R is within the package grDevices, which provides the basic structure of data plotting, as for instance the colors and fonts used in the plots. Such graphic engine was then used as starting point in the development of more advanced and sophisticated packages for data visualization; the most commonly used being graphics and grid.
(For more resources related to this topic, see here.)
The graphics package is often referred to as the base or traditional graphics environment, since historically it was already available among the default packages delivered with the base installation of R and it provides functions that allow to the generation of complete plots.
The grid package developed by Paul Murrell, on the other side, provides an alternative set of graphics tools. This package does not provide directly functions that generate complete plots, so it is not frequently used directly for generating graphics, but it was used in the development of advanced data visualization packages. Among the grid-based packages, the most widely used are lattice and ggplot2, although they are built implementing different visualization approaches. In fact lattice was build implementing the Trellis plots, while ggplot2 was build implementing the grammar of graphics. A diagram representing the connections between the tools just mentioned is represented in the Figure 1.
Figure 1: Overview of the most widely used R packages for graphics
Just keep in mind that this is not a complete overview of the packages available, but simply a small snapshot on the main packages used for data visualization in R, since many other packages are built on top of the tools just mentioned. If you would like to get a more complete overview of the graphics tools available in R, you may have a look at the web page of the R project summarizing such tools, http://cran.r-project.org/web/views/Graphics.html.
ggplot2 and the Grammar of Graphics
The package ggplot2 was developed by Hadley Wickham by implementing a completely different approach to statistical plots. As in the case of lattice, this package is also based on grid, providing a series of high-level functions which allow the creation of complete plots. The ggplot2 package provides an interpretation and extension of the principles of the book The Grammar of Graphics by Leland Wilkinson. Briefly, the Grammar of Graphics assumes that a statistical graphic is a mapping of data to aesthetic attributes and geometric objects used to represent the data, like points, lines, bars, and so on. Together with the aesthetic attributes, the plot can also contain statistical transformation or grouping of the data. As in lattice, also in ggplot2 we have the possibility to split data by a certain variable obtaining a representation of each subset of data in an independent sub-plot; such representation in ggplot2 is called faceting.
In a more formal way, the main components of the grammar of graphics are: the data and their mapping, the aesthetic, the geometric objects, the statistical transformations, scales, coordinates and faceting. A more detailed description of these elements is provided along the book ggplot2 Essentials, but this is a summary of the general principles
- The data that must be visualized are mapped to aesthetic attributes which define how the data should be perceived
- The geometric objects describe what is actually represented on the plot like lines, points, or bars; the geometric objects basically define which kind of plot you are going to draw
- The statistical transformations are transformations which are applied to the data to group them; an example of statistical transformations would be, for instance, the smooth line or the regression lines of the previous examples or the binning of the histograms.
- Scales represent the connection between the aesthetic spaces with the actual values which should be represented. Scales maybe also be used to draw legends
- The coordinates represent the coordinate system in which the data are drawn
- The faceting, which we have already mentioned, is a grouping of data in subsets defined by a value of one variable
In ggplot2 there are two main high-level functions, capable of creating directly creating a plot, qplot() and ggplot(); qplot() stands for quick plot and it is a simple function with serve a similar purpose to the plot() function in graphics. The function ggplot() on the other side is a much more advanced function which allow the user to have a deep control of the plot layout and details. In this article we will see some examples of qplot() in order to provide you with a taste of the typical plots which can be realized with ggplot2, but for more advanced data visualization the function ggplot(), is much more flexible. If you have a look on the different forums of R programming, there is quite some discussion about which of these two functions would be more convenient to use. My general recommendation would be that it depends on the type of graph you are drawing more frequently. For simple and standard plot, where basically only the data should be represented and some minor modification of standard layout, the qplot() function will do the job. On the other side, if you would need to apply particular transformations to the data or simply if you would like to keep the freedom of controlling and defining the different details of the plot layout, I would recommend to focus in learning the code of ggplot().
In the code below you will see an example of plot realized with ggplot2 where you can identify some of the components of the grammar of graphics. The example is realized with the function ggplot() which allow a more direct comparison with the grammar, but just below you may also find the corresponding code for the use of qplot(). Both codes generate the graph depicted on Figure 2.
require(ggplot2) ## Load ggplot2 data(Orange) # Load the data ggplot(data=Orange, ## Data used aes(x=circumference,y=age, color=Tree))+ ##mapping to aesthetic geom_point()+ ##Add geometry (plot with data points) stat_smooth(method="lm",se=FALSE) ##Add statistics(linear regression) ### Corresponding code with qplot() qplot(circumference,age,data=Orange, ## Data used color=Tree, ## Aestetic mapping geom=c("point","smooth"),method="lm",se=FALSE)
This simple example can give you an idea of the role of each portion of code in a ggplot2 graph; you have seen how the main function body create the connection between the data and the aesthetic we are interested to represent and how, on top of this, you add the components of the plot like in this case the geometry element of points and the statistical element of regression. You can also notice how the components which need to be added to the main function call are included using the + sign. One more thing worth to mention at this point, is the if you run just the main body function in the ggplot() function, you will get an error message. This is because this call is not able to generate an actual plot. The step during which the plot is actually created is when you include the geometric attributes, in this case geom_point(). This is perfectly in line with the grammar of graphics, since as we have seen the geometry represent the actual connection between the data and what is represented on the plot. Is in fact at this stage that we specify we are interested in having points representing the data, before that nothing was specified yet about which plot we were interested in drawing.
Figure 2: Example of plot of Orange dataset with ggplot2
The qplot() function
The qplot (quick plot) function is a basic high level function of ggplot2. The general syntax that you should use with this function is the following
qplot(x, y, data, colour, shape, size, facets, geom, stat)
- x and y represent the variables to plot (y is optional with a default value NULL)
- data define the dataset containing the variables
- colour, shape and size are the aesthetic arguments that can be mapped on additional variables
- facets define the optional faceting of the plot based on one variable contained in the dataset
- geom allows you to select the actual visualization of the data, which basically will represent the plot which will be generated. Possible values are point, line or boxplot, but we will see several different examples in the next pages
- stat define the statistics to be used on the data
These options represents the most important options available in qplot(). You may find a descriptions of the other function arguments in the help page of the function accessible with ?qplot, or on the ggplot2 website under the following link http://docs.ggplot2.org/0.9.3/qplot.html.
Most of the options just discussed can be applied to different types of plots, since most of the concepts of the grammar of graphics, embedded in the code, may be translated from one plot to the other. For instance, you may use the argument colour to do an aesthetics mapping to one variable; these same concepts can in example be applied to scatterplots as well as histograms. Exactly the same principle would be applied to facets, which can be used for splitting plots independently on the type of plot considered.
Histograms and density plots
Histograms are plots used to explore how one (or more) quantitative variables are distributed. To show some examples of histograms we will use the iris data. This dataset contains measurements in centimetres of the variables sepal length and width and petal length and width for 50 flowers from each of three species of the flower iris: iris setosa, versicolor, and virginica. You may find more details running ?iris.
The geometric attribute used to produce histograms is simply by specifying geom=”histogram” in the qplot() function. This default histogram will represent the variable specified on the x axis while the y axis will represent the number of elements in each bin. One other very useful way of representing distributions is to look at the kernel density function, which will basically produce a sort of continuous histogram instead of different bins by estimating the probability density function.
For example let’s plot the petal length of all the three species of iris as histogram and density plot.
data(iris) ## Load data qplot(Petal.Length, data=iris, geom="histogram") ## Histogram qplot(Petal.Length, data=iris, geom="density") ## Density plot
The output of this code is showed in Figure 3.
Figure 3: Histogram (left) and density plot (right)
As you can see in both plots of Figure 3, it appears that the data are not distributed homogenously, but there are at least two distinct distribution clearly separated. This is very reasonably due to a different distribution for one of the iris species. To try to verify if the two distributions are indeed related to specie differences, we could generate the same plot using aesthetic attributes and have a different colour for each subtype of iris.
To do this, we can simply map the fill to the Species column in the dataset; also in this case we can do that for the histogram and the density plot too. Below you may see the code we built, and in Figure 4 the resulting output.
qplot(Petal.Length, data=iris, geom="histogram", colour=Species, fill=Species) qplot(Petal.Length, data=iris, geom="density", colour=Species, fill=Species)
Figure 4: Histogram (left) and density plot (right) with aesthetic attribute for colour and fill
In the distribution we can see that the lower data are coming from the Setosa species, while the two other distributions are partly overlapping.
Scatterplots are probably the most common plot, since they are usually used to display the relationship between two quantitative variables. When two variables are provided, ggplot2 will make a scatterplot by default.
For our example on how to build a scatterplot, we will use a dataset called ToothGrowth, which is available in the base R installation. In this dataset are reported measurements of teeth length of 10 guinea pig for three different doses of vitamin C (0.5, 1, and 2 mg) delivered in two different ways, as orange juice or as ascorbic acid (a compound having vitamin C activity). You can find, as usual, details on these data on the dataset help page at ?ToothGrowth.
We are interested in seeing how the length of the teeth changed for the different doses. We are not able to distinguish among the different guinea pigs, since this information is not contained in the data, so for the moment we will plot just all the data we have. So let’s load the dataset and do a basic plot of the dose vs. length.
require(ggplot2) data(ToothGrowth) qplot(dose, len, data=ToothGrowth, geom="point") ##Alternative coding qplot(dose, len, data=ToothGrowth)
The resulting plot is reproduced in Figure 5. As you have seen, the default plot generated, also without a geom argument, is the scatter plot, which is the default bivariate plot type. In this plot we may have an idea of the tendency the data have, for instance we see that the teeth length increase by increasing the amount of vitamin C intake. On the other side, we know that there are two different subgroups in our data, since the vitamin C was provided in two different ways, as orange juice or as ascorbic acid, so it could be interesting to check if these two groups behave differently.
Figure 5: Scatterplot of length vs. dose of ToothGrowth data
The first approach could be to have the data in two different colours. To do that we simply need to assign the colour attribute to the column sup in the data, which defines the way of vitamin intake. The resulting plot is in Figure 6.
qplot(dose, len,data=ToothGrowth, geom="point", col=supp)
We now can distinguish from which intake route come each data in the plot and it looks like the data from orange juice shown are a little higher compared to ascorbic acid, but to differentiate between them it is not really easy. We could then try with the facets, so that the data will be completely separated in two different sub-plots. So let´s see what happens.
Figure 6: Scatterplot of length vs. dose of ToothGrowth with data in different colours depending on vitamin intake.
qplot(dose, len,data=ToothGrowth, geom="point", facets=.~supp)
In this new plot, showed in Figure 7, we definitely have a better picture of the data, since we can see how the tooth growth differs for the different intakes.
As you have seen, in this simple example, you will find that the best visualization may be different depending on the data you have. In some cases grouping a variable with colours or dividing the data with faceting may give you a different idea about the data and their tendency. For instance in our case with the plot in Figure 7 we can see that the way how the tooth growth increase with dose seems to be different for the different intake routes.
Figure 7: Scatterplot of length vs. dose of ToothGrowth with faceting
One approach to see the general tendency of the data could be to include a smooth line to the graph. In this case in fact we can see that the growth in the case of the orange juice does not looks really linear, so a smooth line could be a nice way to catch this. In order to do that we simply add a smooth curve to the vector of geometry components in the qplot() function.
qplot(dose, len,data=ToothGrowth, geom=c("point","smooth"), facets=.~supp)
As you can see from the plot obtained (Figure 8) we now see not only clearly the different data thanks to the faceting, but we can also see the tendency of the data with respect to the dose administered. As you have seen, requiring for the smooth line in ggplot2 will also include a confidence interval in the plot. If you would like to not to have the confidence interval you may simply add the argument se=FALSE.
Figure 8: Scatterplot of length vs. dose of ToothGrowth with faceting and smooth line
In this short article we have seen some basic concept of ggplot2, ranging from the basic principles in comparison with the other R packages for graphics, up to some basic plots as for instance histograms, density plots or scatterplots. In this case we have limited our example to the use of qplot(), which enable us to obtain plots with some easy commands, but on the other side, in order to have a full control of plot appearance as well as data representation the function ggplot() will provide you with much more advanced functionalities. You can find a more detailed description of these functions as well as of the different features of ggplot2 together illustrated in various examples in the book ggplot2 Essentials.
Resources for Article:
- Data Analysis Using R [article]
- Data visualization [article]
- Using R for Statistics, Research, and Graphics [article]