5 min read

In this post, we will learn about data visualization using ggplot2. ggplot2 is an R package for data exploration and visualization. It produces amazing graphics that are easy to interpret. The main use of ggplot2 is in exploratory analysis, and it is an important element of a data scientist’s toolkit. The ease with which complex graphs can be plotted using ggplot2 is probably its most attractive feature. It also allows you to slice and dice data in many different ways. ggplot2 is an implementation of A Layered Grammar of Graphics by Hadley Wickham, who is certainly the strongest R programmer out there.

Installation

Installing packages in R is very easy. Just type the following command on the R prompt.

install.packages("ggplot2")

Import the package in your R code.

library(ggplot2)

qplot

We will start with the function qplot(). qplot is the simplest plotting function in ggplot2. It is very similar to the generic plot() function in basic R. We will learn how to plot basic statistical and exploratory plots using qplot.

We will use the Iris dataset that comes with the base R package (and with every other data mining package that I know of). The Iris data consists of observations of phenotypic traits of three species of iris. In R, the iris data is provided as a data frame of 150 rows and 5 columns.

The head command will print first 6 rows of the data.

head(iris)

The general syntax for the qplot function is:

qplot(x,y, data=data.frame)
  • We will plot sepal length against petal length:
    qplot(Sepal.Length, Petal.Length, data = iris)

  • We can color each point by adding a color argument. We will color each point by what species it belongs to.
    qplot(Sepal.Length, Petal.Length, data = iris, color = Species)

    An observant reader would notice that this coloring scheme provides a way to visualize clustering.

  • Also, we can change the size of each point by adding a size argument. Let the size correspond to the petal width.
    qplot(Sepal.Length, Petal.Length, data = iris, color = Species, size = Petal.Width)

    Thus we have a visualization for four-dimensional data.

  • The alpha argument can be used to increase the transparency of the circles.
    qplot(Sepal.Length, Petal.Length, data = iris, color = Species,
    size = Petal.Width, alpha = I(0.7))

     

    This reduces the over-plotting of the data.

  • Label the graph using xlab, ylab, and main arguments.
    qplot(Sepal.Length, Petal.Length, data = iris, color = Species, xlab = "Sepal Length", ylab = "Petal Length",
    main = "Sepal vs. Petal Length in Fisher's Iris data")

     

    All the above graphs were scatterplots. We can use the geom argument to draw other types of graphs.

  • Histogram
    qplot(Sepal.Length, data = iris, geom="bar")

     

  • Line chart
    qplot(Sepal.Length, Petal.Length, data = iris, geom = "line",
    color = Species)
    

ggplot

Now we’ll move to the ggplot() function, which has a much broader range of graphing techniques. We’ll start with the basic plots similar to what we did with qplot().

First things first, load the library:

library(ggplot2)

As before, we will use the iris dataset.

For ggplot(), we generate aesthetic mappings that describe how variables in the data are mapped to visual properties. This is specified by the aes function.

  • Scatterplot
    ggplot(iris, aes(x = Sepal.Length, y = Petal.Length)) +
    geom_point()
    

    This is exactly what we got for qplot().

    The syntax is a bit unintuitive, but is very consistent. The basic structure is:

    ggplot(data.frame, aes(x=, y=, ...)) + geom_*(.) + ....
  • Add Colors in the scatterplot.
    ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) +
    geom_point()
    

  • The geom argument

    We can other geoms to create different types of graphs, for example, linechart:

    ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color=Species)) +
    geom_line() + ggtitle("Plot of sepal length vs. petal length")
    
  • Histogram
    ggplot(iris, aes(x = Sepal.Length)) +
    geom_histogram(binwidth = .2)
    

  • Histogram with color.

    Use the fill argument

    ggplot(iris, aes(x = Sepal.Length, fill=Species)) +
    geom_histogram(binwidth = .2)
    

     

  • The position argument can fine tune positioning to achieve useful effects; for example, we can adjust the position by dodging overlaps to the side.
    ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
    geom_histogram(binwidth = .2, position = "dodge")
    

     

  • Labeling
    ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color=Species)) +
    geom_point() + ggtitle("Plot of sepal length vs. petal length")
    
  • size and alpha arguments
    ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color=Species, size=Petal.Width)) +
    geom_point(alpha=0.7) + ggtitle("Plot of sepal length vs. petal length")
    

  • We can also transform variables directly in the ggplot call.
    ggplot(iris, aes(x = log(Sepal.Length), y = Petal.Length/Petal.Width, color=Species)) +
    geom_point()

     

  • ggplot allows slicing of data. A way to split up the way we look at data is with the facets argument. These break the plot into multiple plots.
    ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color=Species)) +
    geom_point() + facet_wrap(~Species) +
    ggtitle("Plot of sepal length vs. petal length")
    

     

Themes

We can use a whole range of themes for ggplot using the R package ggthemes.

install.packages('ggthemes', dependencies = TRUE)
library(ggthemes)

Essentially you add the theme_*() argument to the ggplot call.

  • The Economist theme: For someone like me who reads The Economist regularly and might work there one day (ask for my resume if you know someone!!!!), it would be fun/useful to try to reproduce some of the graphs they publish. We may not have the data available, but we have the theme.
    ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) +
    geom_point() + theme_economist()

  • Five Thirty Eight theme: Nate Silver is probably the most famous statistician (data scientist). His company, Five Thirty Eight, is a must for any data scientist. Good folks at ggthemes have created a 538 theme as well.
    ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) +
    geom_point() + theme_fivethirtyeight()

     

About the author

Janu Verma is a researcher in the IBM T.J. Watson Research Center, New York. His research interests are mathematics, machine learning, information visualization, computational biology and healthcare analytics. He has held research positions at Cornell University, Kansas State University, Tata Institute of Fundamental Research, Indian Institute of Science, and Indian Statistical Institute.  He has written papers for IEEE Vis, KDD, International Conference on HealthCare Informatics, Computer Graphics and Applications, Nature Genetics, IEEE Sensors Journals, and so on.  His current focus is on the development of visual analytics systems for prediction and understanding. He advises start-ups and companies on data science and machine learning in the Delhi-NCR area.

LEAVE A REPLY

Please enter your comment!
Please enter your name here