Home Data Tutorials Visualizing univariate distribution in Seaborn

Visualizing univariate distribution in Seaborn

November 16, 2017 - 12:00 am

6836

6 min read

[box type=”note” align=”” class=”” width=””]This article is an excerpt from a book by Allen Chi Shing Yu, Claire Yik Lok Chung, and Aldrin Kay Yuen Yim titled Matplotlib 2.x By Example. [/box]

Seaborn by Michael Waskom is a statistical visualization library that is built on top of Matplotlib. It comes with handy functions for visualizing categorical variables, univariate distributions, and bivariate distributions. In this article, we will visualize univariate distribution in Seaborn.

Visualizing univariate distribution

Seaborn makes the task of visualizing the distribution of a dataset much easier. In this example, we are going to use the annual population summary published by the Department of Economic and Social Affairs, United Nations, in 2015. Projected population figures towards 2100 were also included in the dataset. Let’s see how it distributes among different countries in 2017 by plotting a bar plot:

import seaborn as sns
import matplotlib.pyplot as plt
 
 
# Extract USA population data in 2017
current_population = population_df[(population_df.Location
== 'United States of America') & (population_df.Time == 2017) & (population_df.Sex != 'Both')]
 
# Population Bar chart
sns.barplot(x="AgeGrp",y="Value", hue="Sex", data = current_population)
 
# Use Matplotlib functions to label axes rotate tick labels
ax = plt.gca()
ax.set(xlabel="Age Group", ylabel="Population (thousands)") ax.set_xticklabels(ax.xaxis.get_majorticklabels(), rotation=45) plt.title("Population Barchart (USA)")

# Show the figure plt.show()

Bar chart in Seaborn

The seaborn.barplot() function shows a series of data points as rectangular bars. If multiple points per group are available, confidence intervals will be shown on top of the bars to indicate the uncertainty of the point estimates. Like most other Seaborn functions, various input data formats are supported, such as Python lists, Numpy arrays, pandas Series, and pandas DataFrame.

A more traditional way to show the population structure is through the use of a population pyramid.

So what is a population pyramid? As its name suggests, it is a pyramid-shaped plot that shows the age distribution of a population. It can be roughly classified into three classes, namely constrictive, stationary, and expansive for populations that are undergoing negative, stable, and rapid growth respectively. For instance, constrictive populations have a lower proportion of young people, so the pyramid base appears to be constricted. Stable populations have a more or less similar number of young and middle-aged groups. Expansive populations, on the other hand, have a large proportion of youngsters, thus resulting in pyramids with enlarged bases.

We can build a population pyramid by plotting two bar charts on two subplots with a shared y axis:

import seaborn as sns
import matplotlib.pyplot as plt
 
# Extract USA population data in 2017
current_population = population_df[(population_df.Location
== 'United States of America') & (population_df.Time == 2017) & (population_df.Sex != 'Both')]
 
# Change the age group to descending order current_population = current_population.iloc[::-1]
 
# Create two subplots with shared y-axis
fig, axes = plt.subplots(ncols=2, sharey=True)
 
# Bar chart for male
sns.barplot(x="Value",y="AgeGrp", color="darkblue", ax=axes[0],
data = current_population[(current_population.Sex == 'Male')])
# Bar chart for female
sns.barplot(x="Value",y="AgeGrp", color="darkred", ax=axes[1], data = current_population[(current_population.Sex ==
'Female')])
 
# Use Matplotlib function to invert the first chart axes[0].invert_xaxis()
 
# Use Matplotlib function to show tick labels in the middle axes[0].yaxis.tick_right()
 
# Use Matplotlib functions to label the axes and titles axes[0].set_title("Male")
axes[1].set_title("Female")
axes[0].set(xlabel="Population (thousands)", ylabel="Age Group") axes[1].set(xlabel="Population (thousands)", ylabel="") fig.suptitle("Population Pyramid (USA)")
 
# Show the figure plt.show()

Since Seaborn is built on top of the solid foundations of Matplotlib, we can customize the plot easily using built-in functions of Matplotlib. In the preceding example, we used matplotlib.axes.Axes.invert_xaxis() to flip the male population plot horizontally, followed by changing the location of the tick labels to the right-hand side using matplotlib.axis.YAxis.tick_right(). We further customized the titles and axis labels for the plot using a combination of matplotlib.axes.Axes.set_title(), matplotlib.axes.Axes.set(), and matplotlib.figure.Figure.suptitle().

Let’s try to plot the population pyramids for Cambodia and Japan as well by changing the line population_df.Location == 'United States of America' to population_df.Location == 'Cambodia' or population_df.Location == 'Japan'. Can you classify the pyramids into one of the three population pyramid classes?

To see how Seaborn simplifies the code for relatively complex plots, let’s see how a similar plot can be achieved using vanilla Matplotlib.

First, like the previous Seaborn-based example, we create two subplots with shared y axis:

fig, axes = plt.subplots(ncols=2, sharey=True)

Next, we plot horizontal bar charts using matplotlib.pyplot.barh() and set the location and labels of ticks, followed by adjusting the subplot spacing:

# Get a list of tick positions according to the data bins y_pos = range(len(current_population.AgeGrp.unique()))
 
# Horizontal barchart for male
axes[0].barh(y_pos, current_population[(current_population.Sex == 'Male')].Value, color="darkblue")
 
# Horizontal barchart for female
axes[1].barh(y_pos, current_population[(current_population.Sex == 'Female')].Value, color="darkred")
 
# Show tick for each data point, and label with the age group axes[0].set_yticks(y_pos) axes[0].set_yticklabels(current_population.AgeGrp.unique())
 
# Increase spacing between subplots to avoid clipping of ytick labels plt.subplots_adjust(wspace=0.3)
 
Finally, we use the same code to further customize the look and feel of the figure:
 
# Invert the first chart axes[0].invert_xaxis()
 
# Show tick labels in the middle axes[0].yaxis.tick_right()
 
# Label the axes and titles axes[0].set_title("Male") axes[1].set_title("Female")
axes[0].set(xlabel="Population (thousands)", ylabel="Age Group") axes[1].set(xlabel="Population (thousands)", ylabel="") fig.suptitle("Population Pyramid (USA)")
 
# Show the figure plt.show()

When compared to the Seaborn-based code, the pure Matplotlib implementation requires extra lines to define the tick positions, tick labels, and subplot spacing. For some other Seaborn plot types that include extra statistical calculations such as linear regression, and Pearson correlation, the code reduction is even more dramatic. Therefore, Seaborn is a “batteries-included” statistical visualization package that allows users to write less verbose code.

Histogram and distribution fitting in Seaborn

In the population example, the raw data was already binned into different age groups. What if the data is not binned (for example, the BigMac Index data)? Turns out, seaborn.distplot can help us to process the data into bins and show us a histogram as a result. Let’s look at this example:

import seaborn as sns
import matplotlib.pyplot as plt
 
 
# Get the BigMac index in 2017
current_bigmac = bigmac_df[(bigmac_df.Date == "2017-01-31")]
 
# Plot the histogram
ax = sns.distplot(current_bigmac.dollar_price) plt.show()

The seaborn.distplot function expects either pandas Series, single-dimensional numpy.array, or a Python list as input. Then, it determines the size of the bins according to the Freedman-Diaconis rule, and finally it fits a kernel density estimate (KDE) over the histogram.

KDE is a non-parametric method used to estimate the distribution of a variable. We can also supply a parametric distribution, such as beta, gamma, or normal distribution, to the fit argument.

In this example, we are going to fit the normal distribution from the scipy.stats package over the Big Mac Index dataset:

from scipy import stats
 
ax = sns.distplot(current_bigmac.dollar_price, kde=False, fit=stats.norm) 
plt.show()

[INSERT IMAGE]

You have now equipped yourself with the knowledge to visualize univariate data in Seaborn as Bar Charts, Histogram, and distribution fitting. To have more fun visualizing data with Seaborn and Matplotlib, check out the book, this snippet appears from.

Top 6 Cybersecurity Books from Packt to Accelerate Your Career

Your Quick Introduction to Extended Events in Analysis Services from Blog…

Logging the history of my past SQL Saturday presentations from Blog…

Storage savings with Table Compression from Blog Posts – SQLServerCentral

Daily Coping 31 Dec 2020 from Blog Posts – SQLServerCentral

Learning Essential Linux Commands for Navigating the Shell Effectively

Exploring the Strategy Behavioral Design Pattern in Node.js

How to integrate a Medium editor in Angular 8

Implementing memory management with Golang’s garbage collector

How to create sales analysis app in Qlik Sense using DAR…

Visualizing univariate distribution in Seaborn

Visualizing univariate distribution

Bar chart in Seaborn

Histogram and distribution fitting in Seaborn

LEAVE A REPLY Cancel reply

MobilePro

datapro

Programming

Subscribe to our newsletter