[box type=”note” align=”” class=”” width=””]This article has been taken from the book Principles of Data Science, written by Sinan Ozdemir. It aims to practically introduce you to the different ways in which you can communicate or visualize your data to tell stories effectively.[/box]
Communication matters
Being able to conduct experiments and manipulate data in a coding language is not enough to conduct practical and applied data science. This is because data science is, generally, only as good as how it is used in practice. For instance, a medical data scientist might be able to predict the chance of a tourist contracting Malaria in developing countries with >98% accuracy, however, if these results are published in a poorly marketed journal and online mentions of the study are minimal, their groundbreaking results that could potentially prevent deaths would never see the true light of day.
For this reason, communication of results through data storytelling is arguably as important as the results themselves. A famous example of poor management of distribution of results is the case of Gregor Mendel. Mendel is widely recognized as one of the founders of modern genetics. However, his results (including data and charts) were not well adopted until after his death. Mendel even sent them to Charles Darwin, who largely ignored Mendel’s papers, which were written in unknown Moravian journals.
Generally, there are two ways of presenting results: verbal and visual. Of course, both the verbal and visual forms of communication can be broken down into dozens of subcategories, including slide decks, charts, journal papers, and even university lectures. However, we can find common elements of data presentation that can make anyone in the field more aware and effective in their communication skills.
Let’s dive right into effective (and ineffective) forms of communication, starting with visuals. We’ll look at four basic types of graphs: scatter plots, line graphs, bar charts, histograms, and box plots.
Scatter plots
A scatter plot is probably one of the simplest graphs to create. It is made by creating two quantitative axes and using data points to represent observations. The main goal of a scatter plot is to highlight relationships between two variables and, if possible, reveal a correlation.
For example, we can look at two variables: average hours of TV watched in a day and a 0-100 scale of work performance (0 being very poor performance and 100 being excellent performance). The goal here is to find a relationship (if it exists) between watching TV and average work performance.
The following code simulates a survey of a few people, in which they revealed the amount of television they watched, on an average, in a day against a company-standard work performance metric:
import pandas as pd
hours_tv_watched = [0, 0, 0, 1, 1.3, 1.4, 2, 2.1, 2.6, 3.2, 4.1, 4.4, 4.4, 5]
This line of code is creating 14 sample survey results of people answering the question of how many hours of TV they watch in a day.
work_performance = [87, 89, 92, 90, 82, 80, 77, 80, 76, 85, 80, 75, 73, 72]
This line of code is creating 14 new sample survey results of the same people being rated on their work performance on a scale from 0 to 100.
For example, the first person watched 0 hours of TV a day and was rated 87/100 on their work, while the last person watched, on an average, 5 hours of TV a day and was rated 72/100:
df = pd.DataFrame({'hours_tv_watched':hours_tv_watched, 'work_
performance':work_performance})
Here, we are creating a Dataframe in order to ease our exploratory data analysis and make it easier to make a scatter plot:
df.plot(x='hours_tv_watched', y='work_performance', kind='scatter')
Now, we are actually making our scatter plot. In the following plot, we can see that our axes represent the number of hours of TV watched in a day and the person’s work performance metric:
Each point on a scatter plot represents a single observation (in this case a person) and its location is a result of where the observation stands on each variable. This scatter plot does seem to show a relationship, which implies that as we watch more TV in the day, it seems to affect our work performance.
Of course, as we are now experts in statistics from the last two chapters, we know that this might not be causational. A scatter plot may only work to reveal a correlation or an association between but not a causation. Advanced statistical tests, such as the ones we saw in Chapter 8, Advanced Statistics, might work to reveal causation. Later on in this chapter, we will see the damaging effects that trusting correlation might have.
Line graphs
Line graphs are, perhaps, one of the most widely used graphs in data communication. A line graph simply uses lines to connect data points and usually represents time on the x axis. Line graphs are a popular way to show changes in variables over time. The line graph, like the scatter plot, is used to plot quantitative variables.
As a great example, many of us wonder about the possible links between what we see on TV and our behavior in the world. A friend of mine once took this thought to an extreme—he wondered if he could find a relationship between the TV show, The X-Files, and the amount of UFO sightings in the U.S.. He then found the number of sightings of UFOs per year and plotted them over time. He then added a quick graphic to ensure that readers would be able to identify the point in time when the X-files were released:
It appears to be clear that right after 1993, the year of the X-Files premier, the number of UFO sightings started to climb drastically.
This graphic, albeit light-hearted, is an excellent example of a simple line graph. We are told what each axis measures, we can quickly see a general trend in the data, and we can identify with the author’s intent, which is to show a relationship between the number of UFO sightings and the X-files premier.
On the other hand, the following is a less impressive line chart:
This line graph attempts to highlight the change in the price of gas by plotting three points in time. At first glance, it is not much different than the previous graph—we have time on the bottom x axis and a quantitative value on the vertical y axis. The (not so) subtle difference here is that the three points are equally spaced out on the x axis; however, if we read their actual time indications, they are not equally spaced out in time. A year separates the first two points whereas a mere 7 days separates the last two points.
Bar charts
We generally turn to bar charts when trying to compare variables across different groups. For example, we can plot the number of countries per continent using a bar chart. Note how the x axis does not represent a quantitative variable, in fact, when using a bar chart, the x axis is generally a categorical variable, while the y axis is quantitative.
Note that, for this code, I am using the World Health Organization’s report on alcohol consumption around the world by country:
drinks = pd.read_csv('data/drinks.csv')
drinks.continent.value_counts().plot(kind='bar', title='Countries per Continent')
plt.xlabel('Continent')
plt.ylabel('Count')
The following graph shows us a count of the number of countries in each continent. We can see the continent code at the bottom of the bars and the bar height represents the number of countries we have in each continent. For example, we see that Africa has the most countries represented in our survey, while South America has the least:
In addition to the count of countries, we can also plot the average beer servings per continent using a bar chart, as shown:
drinks.groupby('continent').beer_servings.mean().plot(kind='bar')
Note how a scatter plot or a line graph would not be able to support this data because they can only handle quantitative variables; bar graphs have the ability to demonstrate categorical values.
We can also use bar charts to graph variables that change over time, like a line graph.
Histograms
Histograms show the frequency distribution of a single quantitative variable by splitting up the data, by range, into equidistant bins and plotting the raw count of observations in each bin. A histogram is effectively a bar chart where the x axis is a bin (subrange) of values and the y axis is a count. As an example, I will import a store’s daily number of unique customers, as shown:
rossmann_sales = pd.read_csv('data/rossmann.csv')
rossmann_sales.head()
Note how we have multiple store data (by the first Store column). Let’s subset this data for only the first store, as shown:
first_rossmann_sales = rossmann_sales[rossmann_sales['Store']==1]
Now, let's plot a histogram of the first store's customer count:
first_rossmann_sales['Customers'].hist(bins=20)
plt.xlabel('Customer Bins')
plt.ylabel('Count')
The x axis is now categorical in that each category is a selected range of values, for example, 600-620 customers would potentially be a category. The y axis, like a bar chart, is plotting the number of observations in each category. In this graph, for example, one might take away the fact that most of the time, the number of customers on any given day will fall between 500 and 700.
Altogether, histograms are used to visualize the distribution of values that a quantitative variable can take on.
Box plots
Box plots are also used to show a distribution of values. They are created by plotting the five number summary, as follows:
- The minimum value
- The first quartile (the number that separates the 25% lowest values from the rest)
- The median
- The third quartile (the number that separates the 25% highest values from the rest)
- The maximum value
In Pandas, when we create box plots, the red line denotes the median, the top of the box (or the right if it is horizontal) is the third quartile, and the bottom (left) part of the box is the first quartile.
The following is a series of box plots showing the distribution of beer consumption according to continents:
drinks.boxplot(column='beer_servings', by='continent')
Now, we can clearly see the distribution of beer consumption across the seven continents and how they differ. Africa and Asia have a much lower median of beer consumption than Europe or North America.
Box plots also have the added bonus of being able to show outliers much better than a histogram. This is because the minimum and maximum are parts of the box plot.
Getting back to the customer data, let’s look at the same store customer numbers, but using a box plot:
first_rossmann_sales.boxplot(column='Customers', vert=False)
This is the exact same data as plotted earlier in the histogram; however, now it is shown as a box plot. For the purpose of comparison, I will show you both the graphs one after the other:
Note how the x axis for each graph are the same, ranging from 0 to 1,200. The box plot is much quicker at giving us a center of the data, the red line is the median, while the histogram works much better in showing us how spread out the data is and where people’s biggest bins are. For example, the histogram reveals that there is a very large bin of zero people. This means that for a little over 150 days of data, there were zero customers.
Note that we can get the exact numbers to construct a box plot using the describe feature in Pandas, as shown:
first_rossmann_sales['Customers'].describe()
min 0.000000
25% 463.000000
50% 529.000000
75% 598.750000
max 1130.000000
There we have it! We just learned data storytelling through various techniques like scatter plots, line graphs, bar charts, histograms and box plots. Now you’ve got the power to be creative in the way you tell tales of your data!
If you found our article useful, you can check out Principles of Data Science for more interesting Data Science tips and techniques.