28 min read

Humans are visual creatures and have evolved to be able to quickly notice the meaning when information is presented in certain ways that cause the wiring in our brains to have the light bulb of insight turn on. This “aha” can often be performed very quickly, given the correct tools, instead of through tedious numerical analysis.

Tools for data analysis, such as pandas, take advantage of being able to quickly and iteratively provide the user to take data, process it, and quickly visualize the meaning. Often, much of what you will do with pandas is massaging your data to be able to visualize it in one or more visual patterns, in an attempt to get to “aha” by simply glancing at the visual representation of the information.

In this article by Michael Heydt, author of the book Learning pandas we will cover common patterns in visualizing data with pandas. It is not meant to be exhaustive in coverage. The goal is to give you the required knowledge to create beautiful data visualizations on pandas data quickly and with very few lines of code.

(For more resources related to this topic, see here.)

This article is presented in three sections. The first introduces you to the general concepts of programming visualizations with pandas, emphasizing the process of creating time-series charts. We will also dive into techniques to label axes and create legends, colors, line styles, and markets.

The second part of the article will then focus on the many types of data visualizations commonly used in pandas programs and data sciences, including:

  • Bar plots
  • Histograms
  • Box and whisker charts
  • Area plots
  • Scatter plots
  • Density plots
  • Scatter plot matrixes
  • Heatmaps

The final section will briefly look at creating composite plots by dividing plots into subparts and drawing multiple plots within a single graphical canvas.

Setting up the IPython notebook

The first step to plot with pandas data, is to first include the appropriate libraries, primarily, matplotlib. The examples in this article will all be based on the following imports, where the plotting capabilities are from matplotlib, which will be aliased with plt:

In [1]:
# import pandas, numpy and datetime
import numpy as np
import pandas as pd
# needed for representing dates and times
import datetime
from datetime import datetime
# Set some pandas options for controlling output
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns', 10)
pd.set_option('display.max_rows', 10)
# used for seeding random number sequences
seedval = 111111
# matplotlib
import matplotlib as mpl
# matplotlib plotting functions
import matplotlib.pyplot as plt
# we want our plots inline
%matplotlib inline

The %matplotlib inline line is the statement that tells matplotlib to produce inline graphics. This will make the resulting graphs appear either inside your IPython notebook or IPython session.

All examples will seed the random number generator with 111111, so that the graphs remain the same every time they run.

Plotting basics with pandas

The pandas library itself performs data manipulation. It does not provide data visualization capabilities itself. The visualization of data in pandas data structures is handed off by pandas to other robust visualization libraries that are part of the Python ecosystem, most commonly, matplotlib, which is what we will use in this article.

All of the visualizations and techniques covered in this article can be performed without pandas. These techniques are all available independently in matplotlib. pandas tightly integrates with matplotlib, and by doing this, it is very simple to go directly from pandas data to a matplotlib visualization without having to work with intermediate forms of data.

pandas does not draw the graphs, but it will tell matplotlib how to draw graphs using pandas data, taking care of many details on your behalf, such as automatically selecting Series for plots, labeling axes, creating legends, and defaulting color. Therefore, you often have to write very little code to create stunning visualizations.

Creating time-series charts with .plot()

One of the most common data visualizations created, is of the time-series data. Visualizing a time series in pandas is as simple as calling .plot() on a DataFrame or Series object. To demonstrate, the following creates a time series representing a random walk of values over time, akin to the movements in the price of a stock:

In [2]:
# generate a random walk time-series
np.random.seed(seedval)
s = pd.Series(np.random.randn(1096),
index=pd.date_range('2012-01-01',
'2014-12-31'))
walk_ts = s.cumsum()
# this plots the walk - just that easy :)
walk_ts.plot();

Learning pandas

The ; character at the end suppresses the generation of an IPython out tag, as well as the trace information.

It is a common practice to execute the following statement to produce plots that have a richer visual style. This sets a pandas option that makes resulting plots have a shaded background and what is considered a slightly more pleasing style:

In [3]:
# tells pandas plots to use a default style
# which has a background fill
pd.options.display.mpl_style = 'default'
walk_ts.plot();

Learning pandas

The .plot() method on pandas objects is a wrapper function around the matplotlib libraries’ plot() function. It makes plots of pandas data very easy to create. It is coded to know how to use the data in the pandas objects to create the appropriate plots for the data, handling many of the details of plot generation, such as selecting series, labeling, and axes generation. In this situation, the .plot() method determines that as Series contains dates for its index that the x axis should be formatted as dates and it selects a default color for the data.

This example used a single series and the result would be the same using DataFrame with a single column. As an example, the following produces the same graph with one small difference. It has added a legend to the graph, which charts by default, generated from a DataFrame object, will have a legend even if there is only one series of data:

In [4]:
# a DataFrame with a single column will produce
# the same plot as plotting the Series it is created from
walk_df = pd.DataFrame(walk_ts)
walk_df.plot();

The .plot() function is smart enough to know whether DataFrame has multiple columns, and it should create multiple lines/series in the plot and include a key
for each, and also select a distinct color for each line. This is demonstrated with the following example:

In [5]:
# generate two random walks, one in each of
# two columns in a DataFrame
np.random.seed(seedval)
df = pd.DataFrame(np.random.randn(1096, 2),
index=walk_ts.index, columns=list('AB'))
walk_df = df.cumsum()
walk_df.head()
Out [5]:
A B
2012-01-01 -1.878324 1.362367
2012-01-02 -2.804186 1.427261
2012-01-03 -3.241758 3.165368
2012-01-04 -2.750550 3.332685
2012-01-05 -1.620667 2.930017
In [6]:
# plot the DataFrame, which will plot a line
# for each column, with a legend
walk_df.plot();

If you want to use one column of DataFrame as the labels on the x axis of the plot instead of the index labels, you can use the x and y parameters to the .plot() method, giving the x parameter the name of the column to use as the x axis and y parameter the names of the columns to be used as data in the plot. The following recreates the random walks as columns ‘A’ and ‘B’, creates a column ‘C’ with sequential values starting with 0, and uses these values as the x axis labels and the ‘A’ and ‘B’ columns values as the two plotted lines:

In [7]:
# copy the walk
df2 = walk_df.copy()
# add a column C which is 0 .. 1096
df2['C'] = pd.Series(np.arange(0, len(df2)), index=df2.index)
# instead of dates on the x axis, use the 'C' column,
# which will label the axis with 0..1000
df2.plot(x='C', y=['A', 'B']);

The .plot() functions, provided by pandas for the Series and DataFrame objects, take care of most of the details of generating plots. However, if you want to modify characteristics of the generated plots beyond their capabilities, you can directly use the matplotlib functions or one of more of the many optional parameters of the .plot() method.

Adorning and styling your time-series plot

The built-in .plot() method has many options that you can use to change the content in the plot. We will cover several of the common options used in most plots.

Adding a title and changing axes labels

The title of the chart can be set using the title parameter of the .plot() method. Axes labels are not set with .plot(), but by directly using the plt.ylabel() and plt.xlabel() functions after calling .plot():

In [8]:
# create a time-series chart with a title and specific
# x and y axes labels
# the title is set in the .plot() method as a parameter
walk_df.plot(title='Title of the Chart')
# explicitly set the x and y axes labels after the .plot()
plt.xlabel('Time')
plt.ylabel('Money');

The labels in this plot were added after the call to .plot(). A question that may be asked, is that if the plot is generated in the call to .plot(), then how are they changed on the plot?

The answer, is that plots in matplotlib are not displayed until either .show() is called on the plot or the code reaches the end of the execution and returns to the interactive prompt. At either of these points, any plot generated by plot commands will be flushed out to the display. In this example, although .plot() is called, the plot is not generated until the IPython notebook code section finishes completion, so the changes for labels and title are added to the plot.

Specifying the legend content and position

To change the text used in the legend (the default is the column name from DataFrame), you can use the ax object returned from the .plot() method to modify the text using its .legend() method. The ax object is an AxesSubplot object, which is a representation of the elements of the plot, that can be used to change various aspects of the plot before it is generated:

In [9]:
# change the legend items to be different
# from the names of the columns in the DataFrame
ax = walk_df.plot(title='Title of the Chart')
# this sets the legend labels
ax.legend(['1', '2']);

The location of the legend can be set using the loc parameter of the .legend() method. By default, pandas sets the location to ‘best’, which tells matplotlib to examine the data and determine the best place to put the legend. However, you can also specify any of the following to position the legend more specifically (you can use either the string or the numeric code):

Text

Code

‘best’

0

‘upper right’

1

‘upper left’

2

‘lower left’

3

‘lower right’

4

‘right’

5

‘center left’

6

‘center right’

7

‘lower center’

8

‘upper center’

9

‘center’

10

In our last chart, the ‘best’ option actually had the legend overlap the line from one of the series. We can reposition the legend in the upper center of the chart, which will prevent this and create a better chart of this data:

In [10]:
# change the position of the legend
ax = walk_df.plot(title='Title of the Chart')
# put the legend in the upper center of the chart
ax.legend(['1', '2'], loc='upper center');

Legends can also be turned off with the legend parameter:

In [11]:
# omit the legend by using legend=False
walk_df.plot(title='Title of the Chart', legend=False);

There are more possibilities for locating and actually controlling the content of the legend, but we leave that for you to do some more experimentation.

Specifying line colors, styles, thickness, and markers

pandas automatically sets the colors of each series on any chart. If you would like to specify your own color, you can do so by supplying style code to the style parameter of the plot function. pandas has a number of built-in single character code for colors, several of which are listed here:

  • b: Blue
  • g: Green
  • r: Red
  • c: Cyan
  • m: Magenta
  • y: Yellow
  • k: Black
  • w: White

It is also possible to specify the color using a hexadecimal RGB code of the #RRGGBB format. To demonstrate both options, the following example sets the color of the first series to green using a single digit code and the second series to red using the hexadecimal code:

In [12]:
# change the line colors on the plot
# use character code for the first line,
# hex RGB for the second
walk_df.plot(style=['g', '#FF0000']);

Line styles can be specified using a line style code. These can be used in combination with the color style codes, following the color code. The following are examples of several useful line style codes:

  • ‘-‘ = solid
  • ‘–‘ = dashed
  • ‘:’ = dotted
  • ‘-.’ = dot-dashed
  • ‘.’ = points

The following plot demonstrates these five line styles by drawing five data series, each with one of these styles. Notice how each style item now consists of a color symbol and a line style code:

In [13]:
# show off different line styles
t = np.arange(0., 5., 0.2)
legend_labels = ['Solid', 'Dashed', 'Dotted',
'Dot-dashed', 'Points']
line_style = pd.DataFrame({0 : t,
1 : t**1.5,
2 : t**2.0,
3 : t**2.5,
4 : t**3.0})
# generate the plot, specifying color and line style for each line
ax = line_style.plot(style=['r-', 'g--', 'b:', 'm-.', 'k:'])
# set the legend
ax.legend(legend_labels, loc='upper left');

The thickness of lines can be specified using the lw parameter of .plot(). This can be passed a thickness for multiple lines, by passing a list of widths, or a single width that is applied to all lines. The following redraws the graph with a line width of 3, making the lines a little more pronounced:

In [14]:
# regenerate the plot, specifying color and line style
# for each line and a line width of 3 for all lines
ax = line_style.plot(style=['r-', 'g--', 'b:', 'm-.', 'k:'], lw=3)
ax.legend(legend_labels, loc='upper left');

Markers on a line can also be specified using abbreviations in the style code. There are quite a few marker types provided and you can see them all at http://matplotlib.org/api/markers_api.html. We will examine five of them in the following chart by having each series use a different marker from the following: circles, stars, triangles, diamonds, and points. The type of marker is also specified using a code at the end of the style:

In [15]:
# redraw, adding markers to the lines
ax = line_style.plot(style=['r-o', 'g--^', 'b:*',
'm-.D', 'k:o'], lw=3)
ax.legend(legend_labels, loc='upper left');

Specifying tick mark locations and tick labels

Every plot we have seen to this point, has used the default tick marks and labels on the ticks that pandas decides are appropriate for the plot. These can also be customized using various matplotlib functions.

We will demonstrate how ticks are handled by first examining a simple DataFrame. We can retrieve the locations of the ticks that were generated on the x axis using the plt.xticks() method. This method returns two values, the location, and the actual labels:

In [16]:
# a simple plot to use to examine ticks
ticks_data = pd.DataFrame(np.arange(0,5))
ticks_data.plot()
ticks, labels = plt.xticks()
ticks

Out [16]:
array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. ])

This array contains the locations of the ticks in units of the values along the x axis. pandas has decided that a range of 0 through 4 (the min and max) and an interval of 0.5 is appropriate. If we want to use other locations, we can provide these by passing them to plt.xticks() as a list. The following demonstrates these using even integers from -1 to 5, which will both change the extents of the axis, as well as remove non integral labels:

In [17]:
# resize x axis to (-1, 5), and draw ticks
# only at integer values
ticks_data = pd.DataFrame(np.arange(0,5))
ticks_data.plot()
plt.xticks(np.arange(-1, 6));

Also, we can specify new labels at these locations by passing them as the second parameter. Just as an example, we can change the y axis ticks and labels to integral values and consecutive alpha characters using the following:

In [18]:
# rename y axis tick labels to A, B, C, D, and E
ticks_data = pd.DataFrame(np.arange(0,5))
ticks_data.plot()
plt.yticks(np.arange(0, 5), list("ABCDE"));

Formatting axes tick date labels using formatters

The formatting of axes labels whose underlying data types is datetime is performed using locators and formatters. Locators control the position of the ticks, and the formatters control the formatting of the labels.

To facilitate locating ticks and formatting labels based on dates, matplotlib provides several classes in maptplotlib.dates to help facilitate the process:

  • MinuteLocator, HourLocator, DayLocator, WeekdayLocator, MonthLocator, and YearLocator: These are specific locators coded to determine where ticks for each type of date field will be found on the axis
  • DateFormatter: This is a class that can be used to format date objects into labels on the axis

By default, the default locator and formatter are AutoDateLocator and AutoDateFormatter, respectively. You can change these by providing different objects to use the appropriate methods on the specific axis object.

To demonstrate, we will use a subset of the random walk data from earlier, which represents just the data from January through February of 2014. Plotting this gives us the following output:

In [19]:
# plot January-February 2014 from the random walk
walk_df.loc['2014-01':'2014-02'].plot();

The labels on the x axis of this plot have two series of labels, the minor and the major. The minor labels in this plot contain the day of the month, and the major contains the year and month (the year only for the first month). We can set locators and formatters for each of the minor and major levels.

This will be demonstrated by changing the minor labels to be located at the Monday of each week and to contain the date and day of the week (right now, the chart uses weekly and only Friday’s date—without the day name). On the major labels, we will use the monthly location and always include both the month name and the year:

In [20]:
# this import styles helps us type less
from matplotlib.dates import WeekdayLocator,
DateFormatter, MonthLocator
# plot Jan-Feb 2014
ax = walk_df.loc['2014-01':'2014-02'].plot()
# do the minor labels
weekday_locator = WeekdayLocator(byweekday=(0), interval=1)
ax.xaxis.set_minor_locator(weekday_locator)
ax.xaxis.set_minor_formatter(DateFormatter("%dn%a"))
# do the major labels
ax.xaxis.set_major_locator(MonthLocator())
ax.xaxis.set_major_formatter(DateFormatter('nnn%bn%Y'));

This is almost what we wanted. However, note that the year is being reported as 45. This, unfortunately, seems to be an issue between pandas and the matplotlib representation of values for the year. The best reference I have on this is this following link from Stack Overflow (http://stackoverflow.com/questions/12945971/pandas-timeseries-plot-setting-x-axis-major-and-minor-ticks-and-labels).

So, it appears to create a plot with custom-date-based labels, we need to avoid the pandas .plot() and need to kick all the way down to using matplotlib. Fortunately, this is not too hard. The following changes the code slightly and renders what we wanted:

In [21]:
# this gets around the pandas / matplotlib year issue
# need to reference the subset twice, so let's make a variable
walk_subset = walk_df['2014-01':'2014-02']
# this gets the plot so we can use it, we can ignore fig
fig, ax = plt.subplots()
# inform matplotlib that we will use the following as dates
# note we need to convert the index to a pydatetime series
ax.plot_date(walk_subset.index.to_pydatetime(), walk_subset, '-')
# do the minor labels
weekday_locator = WeekdayLocator(byweekday=(0), interval=1)
ax.xaxis.set_minor_locator(weekday_locator)
ax.xaxis.set_minor_formatter(DateFormatter('%dn%a'))
# do the major labels
ax.xaxis.set_major_locator(MonthLocator())
ax.xaxis.set_major_formatter(DateFormatter('nnn%bn%Y'));
ax.xaxis.set_major_formatter(DateFormatter('nnn%bn%Y'));

To add grid lines for the minor axes ticks, you can use the .grid() method of the x axis object of the plot, the first parameter specifying the lines to use and the second parameter specifying the minor or major set of ticks. The following replots this graph without the major grid line and with the minor grid lines:

In [22]:
# this gets the plot so we can use it, we can ignore fig
fig, ax = plt.subplots()
# inform matplotlib that we will use the following as dates
# note we need to convert the index to a pydatetime series
ax.plot_date(walk_subset.index.to_pydatetime(), walk_subset, '-')
# do the minor labels
weekday_locator = WeekdayLocator(byweekday=(0), interval=1)
ax.xaxis.set_minor_locator(weekday_locator)
ax.xaxis.set_minor_formatter(DateFormatter('%dn%a'))
ax.xaxis.grid(True, "minor") # turn on minor tick grid lines
ax.xaxis.grid(False, "major") # turn off major tick grid lines
# do the major labels
ax.xaxis.set_major_locator(MonthLocator())
ax.xaxis.set_major_formatter(DateFormatter('nnn%bn%Y'));

The last demonstration of formatting will use only the major labels but on a weekly basis and using a YYYY-MM-DD format. However, because these would overlap, we will specify that they should be rotated to prevent the overlap. This is done using the fig.autofmt_xdate() function:

In [23]:
# this gets the plot so we can use it, we can ignore fig
fig, ax = plt.subplots()
# inform matplotlib that we will use the following as dates
# note we need to convert the index to a pydatetime series
ax.plot_date(walk_subset.index.to_pydatetime(), walk_subset, '-')
ax.xaxis.grid(True, "major") # turn off major tick grid lines
# do the major labels
ax.xaxis.set_major_locator(weekday_locator)
ax.xaxis.set_major_formatter(DateFormatter('%Y-%m-%d'));
# informs to rotate date labels
fig.autofmt_xdate();

Common plots used in statistical analyses

Having seen how to create, lay out, and annotate time-series charts, we will now look at creating a number of charts, other than time series that are commonplace in presenting statistical information.

Bar plots

Bar plots are useful in order to visualize the relative differences in values of non time-series data. Bar plots can be created using the kind=’bar’ parameter of the .plot() method:

In [24]:
# make a bar plot
# create a small series of 10 random values centered at 0.0
np.random.seed(seedval)
s = pd.Series(np.random.rand(10) - 0.5)
# plot the bar chart
s.plot(kind='bar');

If the data being plotted consists of multiple columns, a multiple series bar plot will be created:

In [25]:
# draw a multiple series bar chart
# generate 4 columns of 10 random values
np.random.seed(seedval)
df2 = pd.DataFrame(np.random.rand(10, 4),
columns=['a', 'b', 'c', 'd'])
# draw the multi-series bar chart
df2.plot(kind='bar');

If you would prefer stacked bars, you can use the stacked parameter, setting it to True:

In [26]:
# horizontal stacked bar chart
df2.plot(kind='bar', stacked=True);

If you want the bars to be horizontally aligned, you can use kind=’barh’:

In [27]:
# horizontal stacked bar chart
df2.plot(kind='barh', stacked=True);

Histograms

Histograms are useful for visualizing distributions of data. The following shows you a histogram of generating 1000 values from the normal distribution:

In [28]:
# create a histogram
np.random.seed(seedval)
# 1000 random numbers
dfh = pd.DataFrame(np.random.randn(1000))
# draw the histogram
dfh.hist();

The resolution of a histogram can be controlled by specifying the number of bins to allocate to the graph. The default is 10, and increasing the number of bins gives finer detail to the histogram. The following increases the number of bins to 100:

In [29]:
# histogram again, but with more bins
dfh.hist(bins = 100);

If the data has multiple series, the histogram function will automatically generate multiple histograms, one for each series:

In [30]:
# generate a multiple histogram plot
# create DataFrame with 4 columns of 1000 random values
np.random.seed(seedval)
dfh = pd.DataFrame(np.random.randn(1000, 4),
columns=['a', 'b', 'c', 'd'])
# draw the chart. There are four columns so pandas draws
# four historgrams
dfh.hist();

If you want to overlay multiple histograms on the same graph (to give a quick visual difference of distribution), you can call the pyplot.hist() function multiple times before .show() is called to render the chart:

In [31]:
# directly use pyplot to overlay multiple histograms
# generate two distributions, each with a different
# mean and standard deviation
np.random.seed(seedval)
x = [np.random.normal(3,1) for _ in range(400)]
y = [np.random.normal(4,2) for _ in range(400)]
# specify the bins (-10 to 10 with 100 bins)
bins = np.linspace(-10, 10, 100)
# generate plot x using plt.hist, 50% transparent
plt.hist(x, bins, alpha=0.5, label='x')
# generate plot y using plt.hist, 50% transparent
plt.hist(y, bins, alpha=0.5, label='y')
plt.legend(loc='upper right');

Box and whisker charts

Box plots come from descriptive statistics and are a useful way of graphically depicting the distributions of categorical data using quartiles. Each box represents the values between the first and third quartiles of the data with a line across the box at the median. Each whisker reaches out to demonstrate the extent to five interquartile ranges below and above the first and third quartiles:

In [32]:
# create a box plot
# generate the series
np.random.seed(seedval)
dfb = pd.DataFrame(np.random.randn(10,5))
# generate the plot
dfb.boxplot(return_type='axes');

There are ways to overlay dots and show outliers, but for brevity, they will not be covered in this text.

Area plots

Area plots are used to represent cumulative totals over time, to demonstrate the change in trends over time among related attributes. They can also be “stacked” to demonstrate representative totals across all variables.

Area plots are generated by specifying kind=’area’. A stacked area chart is the default:

In [33]:
# create a stacked area plot
# generate a 4-column data frame of random data
np.random.seed(seedval)
dfa = pd.DataFrame(np.random.rand(10, 4),
columns=['a', 'b', 'c', 'd'])
# create the area plot
dfa.plot(kind='area');

To produce an unstacked plot, specify stacked=False:

In [34]:
# do not stack the area plot
dfa.plot(kind='area', stacked=False);

By default, unstacked plots have an alpha value of 0.5, so that it is possible to see how the data series overlaps.

Scatter plots

A scatter plot displays the correlation between a pair of variables. A scatter plot can be created from DataFrame using .plot() and specifying kind=’scatter’, as well as specifying the x and y columns from the DataFrame source:

In [35]:
# generate a scatter plot of two series of normally
# distributed random values
# we would expect this to cluster around 0,0
np.random.seed(111111)
sp_df = pd.DataFrame(np.random.randn(10000, 2),
columns=['a', 'b'])
sp_df.plot(kind='scatter', x='a', y='b')

We can easily create more elaborate scatter plots by dropping down a little lower into matplotlib. The following code gets Google stock data for the year of 2011 and calculates delta in the closing price per day, and renders close versus volume as bubbles of different sizes, derived on the size of the values in the data:

In [36]:
# get Google stock data from 1/1/2011 to 12/31/2011
from pandas.io.data import DataReader
stock_data = DataReader("GOOGL", "yahoo",
datetime(2011, 1, 1),
datetime(2011, 12, 31))
# % change per day
delta = np.diff(stock_data["Adj Close"])/stock_data["Adj Close"][:-1]
# this calculates size of markers
volume = (15 * stock_data.Volume[:-2] / stock_data.Volume[0])**2
close = 0.003 * stock_data.Close[:-2] / 0.003 * stock_data.Open[:-2]
# generate scatter plot
fig, ax = plt.subplots()
ax.scatter(delta[:-1], delta[1:], c=close, s=volume, alpha=0.5)
# add some labels and style
ax.set_xlabel(r'$Delta_i$', fontsize=20)
ax.set_ylabel(r'$Delta_{i+1}$', fontsize=20)
ax.set_title('Volume and percent change')
ax.grid(True);

Note the nomenclature for the x and y axes labels, which creates a nice mathematical style for the labels.

Density plot

You can create kernel density estimation plots using the .plot() method and setting the kind=’kde’ parameter. A kernel density estimate plot, instead of being a pure empirical representation of the data, makes an attempt and estimates the true distribution of the data, and hence smoothes it into a continuous plot. The following generates a normal distributed set of numbers, displays it as a histogram, and overlays the kde plot:

In [37]:
# create a kde density plot
# generate a series of 1000 random numbers
np.random.seed(seedval)
s = pd.Series(np.random.randn(1000))
# generate the plot
s.hist(normed=True) # shows the bars
s.plot(kind='kde');

The scatter plot matrix

The final composite graph we’ll look at in this article is one that is provided by pandas in its plotting tools subcomponent: the scatter plot matrix. A scatter plot matrix is a popular way of determining whether there is a linear correlation between multiple variables. The following creates a scatter plot matrix with random values, which then shows a scatter plot for each combination, as well as a kde graph for each variable:

In [38]:
# create a scatter plot matrix
# import this class
from pandas.tools.plotting import scatter_matrix
# generate DataFrame with 4 columns of 1000 random numbers
np.random.seed(111111)
df_spm = pd.DataFrame(np.random.randn(1000, 4),
columns=['a', 'b', 'c', 'd'])
# create the scatter matrix
scatter_matrix(df_spm, alpha=0.2, figsize=(6, 6), diagonal='kde');

Heatmaps

A heatmap is a graphical representation of data, where values within a matrix are represented by colors. This is an effective means to show relationships of values that are measured at the intersection of two variables, at each intersection of the rows and the columns of the matrix. A common scenario, is to have the values in the matrix normalized to 0.0 through 1.0 and have the intersections between a row and column represent the correlation between the two variables. Values with less correlation (0.0) are the darkest, and those with the highest correlation (1.0) are white.

Heatmaps are easily created with pandas and matplotlib using the .imshow() function:

In [39]:
# create a heatmap
# start with data for the heatmap
s = pd.Series([0.0, 0.1, 0.2, 0.3, 0.4],
['V', 'W', 'X', 'Y', 'Z'])
heatmap_data = pd.DataFrame({'A' : s + 0.0,
'B' : s + 0.1,
'C' : s + 0.2,
'D' : s + 0.3,
'E' : s + 0.4,
'F' : s + 0.5,
'G' : s + 0.6
})
heatmap_data
Out [39]:
A B C D E F G
V 0.0 0.1 0.2 0.3 0.4 0.5 0.6
W 0.1 0.2 0.3 0.4 0.5 0.6 0.7
X 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Y 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Z 0.4 0.5 0.6 0.7 0.8 0.9 1.0
In [40]:
# generate the heatmap
plt.imshow(heatmap_data, cmap='hot', interpolation='none')
plt.colorbar() # add the scale of colors bar
# set the labels
plt.xticks(range(len(heatmap_data.columns)), heatmap_data.columns)
plt.yticks(range(len(heatmap_data)), heatmap_data.index);

Multiple plots in a single chart

It is often useful to contrast data by displaying multiple plots next to each other. This is actually quite easy to when using matplotlib.

To draw multiple subplots on a grid, we can make multiple calls to plt.subplot2grid(), each time passing the size of the grid the subplot is to be located on (shape=(height, width)) and the location on the grid of the upper-left section of the subplot (loc=(row, column)). Each call to plt.subplot2grid() returns a different AxesSubplot object that can be used to reference the specific subplot and direct the rendering into.

The following demonstrates this, by creating a plot with two subplots based on a two row by one column grid (shape=(2,1)). The first subplot, referred to by ax1, is located in the first row (loc=(0,0)), and the second, referred to as ax2, is in the second row (loc=(1,0)):

In [41]:
# create two sub plots on the new plot using a 2x1 grid
# ax1 is the upper row
ax1 = plt.subplot2grid(shape=(2,1), loc=(0,0))
# and ax2 is in the lower row
ax2 = plt.subplot2grid(shape=(2,1), loc=(1,0))

The subplots have been created, but we have not drawn into either yet.

The size of any subplot can be specified using the rowspan and colspan parameters in each call to plt.subplot2grid(). This actually feels a lot like placing content in HTML tables.

The following demonstrates a more complicated layout of five plots, specifying different row and column spans for each:

In [42]:
# layout sub plots on a 4x4 grid
# ax1 on top row, 4 columns wide
ax1 = plt.subplot2grid((4,4), (0,0), colspan=4)
# ax2 is row 2, leftmost and 2 columns wide
ax2 = plt.subplot2grid((4,4), (1,0), colspan=2)
# ax3 is 2 cols wide and 2 rows high, starting
# on second row and the third column
ax3 = plt.subplot2grid((4,4), (1,2), colspan=2, rowspan=2)
# ax4 1 high 1 wide, in row 4 column 0
ax4 = plt.subplot2grid((4,4), (2,0))
# ax4 1 high 1 wide, in row 4 column 1
ax5 = plt.subplot2grid((4,4), (2,1));

To draw into a specific subplot using the pandas .plot() method, you can pass the specific axes into the plot function via the ax parameter. The following demonstrates this by extracting each series from the random walk we created at the beginning of this article, and drawing each into different subplots:

In [43]:
# demonstrating drawing into specific sub-plots
# generate a layout of 2 rows 1 column
# create the subplots, one on each row
ax5 = plt.subplot2grid((2,1), (0,0))
ax6 = plt.subplot2grid((2,1), (1,0))
# plot column 0 of walk_df into top row of the grid
walk_df[[0]].plot(ax = ax5)
# and column 1 of walk_df into bottom row
walk_df[[1]].plot(ax = ax6);

Using this technique, we can perform combinations of different series of data, such as a stock close versus volume graph. Given the data we read during a previous example for Google, the following will plot the volume versus the closing price:

In [44]:
# draw the close on the top chart
top = plt.subplot2grid((4,4), (0, 0), rowspan=3, colspan=4)
top.plot(stock_data.index, stock_data['Close'], label='Close')
plt.title('Google Opening Stock Price 2001')
# draw the volume chart on the bottom
bottom = plt.subplot2grid((4,4), (3,0), rowspan=1, colspan=4)
bottom.bar(stock_data.index, stock_data['Volume'])
plt.title('Google Trading Volume')
# set the size of the plot
plt.gcf().set_size_inches(15,8)

Summary

Visualizing your data is one of the best ways to quickly understand the story that is being told with the data. Python, pandas, and matplotlib (and a few other libraries) provide a means of very quickly, and with a few lines of code, getting the gist of what you are trying to discover, as well as the underlying message (and displaying
it beautifully too).

In this article, we examined many of the most common means of visualizing data from pandas. There are also a lot of interesting visualizations that were not covered, and indeed, the concept of data visualization with pandas and/or Python is the subject of entire texts, but I believe this article provides a much-needed reference to get up and going with the visualizations that provide most of what is needed.

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here