Thinking Probabilistically

16 min read

In this article by Osvaldo Martin, the author of the book Bayesian Analysis with Python, we will learn that Bayesian statistics has been developing for more than 250 years now. During this time, it has enjoyed as much recognition and appreciation as disdain and contempt. In the last few decades, it has gained an increasing amount of attention from people in the field of statistics and almost all the other sciences, engineering, and even outside the walls of the academic world. This revival has been possible due to theoretical and computational developments; modern Bayesian statistics is mostly computational statistics. The necessity for flexible and transparent models and more intuitive interpretation of the results of a statistical analysis has only contributed to the trend.

(For more resources related to this topic, see here.)

Here, we will adopt a pragmatic approach to Bayesian statistics and we will not care too much about other statistical paradigms and their relationship with Bayesian statistics. The aim of this book is to learn how to do Bayesian statistics with Python; philosophical discussions are interesting but they have already been discussed elsewhere in a much richer way than we could discuss in these pages. We will use a computational and modeling approach, and we will learn to think in terms of probabilistic models and apply Bayes’ theorem to derive the logical consequences of our models and data. Models will be coded using Python and PyMC3, a great library for Bayesian statistics that hides most of the mathematical details of Bayesian analysis from the user. Bayesian statistics is theoretically grounded in probability theory, and hence it is no wonder that many books about Bayesian statistics are full of mathematical formulas requiring a certain level of mathematical sophistication. Nevertheless, programming allows us to learn and do Bayesian statistics with only modest mathematical knowledge. This is not to say that learning the mathematical foundations of statistics is useless; don’t get me wrong, that could certainly help you build better models and gain an understanding of problems, models, and results.

In this article, we will cover the following topics:

Statistical modeling
Probabilities and uncertainty

Statistical modeling

Statistics is about collecting, organizing, analyzing, and interpreting data, and hence statistical knowledge is essential for data analysis. Another useful skill when analyzing data is knowing how to write code in a programming language such as Python. Manipulating data is usually necessary given that we live in a messy world with even more messy data, and coding helps to get things done. Even if your data is clean and tidy, programming will still be very useful since, as will see, modern Bayesian statistics is mostly computational statistics.

Most introductory statistical courses, at least for non-statisticians, are taught as a collection of recipes that more or less go like this; go to the the statistical pantry, pick one can and open it, add data to taste and stir until obtaining a consisting p-value, preferably under 0.05 (if you don’t know what a p-value is, don’t worry; we will not use them in this book). The main goal in this type of course is to teach you how to pick the proper can. We will take a different approach: we will also learn some recipes, but this will be home-made food rather than canned food; we will learn hot to mix fresh ingredients that will suit different gastronomic occasions. But before we can cook we must learn some statistical vocabulary and also some concepts.

Exploratory data analysis

Data is an essential ingredient of statistics. Data comes from several sources, such as experiments, computer simulations, surveys, field observations, and so on. If we are the ones that will be generating or gathering the data, it is always a good idea to first think carefully about the questions we want to answer and which methods we will use, and only then proceed to get the data. In fact, there is a whole branch of statistics dealing with data collection known as experimental design. In the era of data deluge, we can sometimes forget that getting data is not always cheap. For example, while it is true that the Large Hadron Collider (LHC) produces hundreds of terabytes a day, its construction took years of manual and intellectual effort. In this book we will assume that we already have collected the data and also that the data is clean and tidy, something rarely true in the real world. We will make these assumptions in order to focus on the subject of this book. If you want to learn how to use Python for cleaning and manipulating data and also a primer on statistics and machine learning, you should probably read Python Data Science Handbook by Jake VanderPlas.

OK, so let’s assume we have our dataset; usually, a good idea is to explore and visualize it in order to get some idea of what we have in our hands. This can be achieved through what is known as Exploratory Data Analysis (EDA), which basically consists of the following:

Descriptive statistics
Data visualization

The first one, descriptive statistics, is about how to use some measures (or statistics) to summarize or characterize the data in a quantitative manner. You probably already know that you can describe data using the mean, mode, standard deviation, interquartile ranges, and so forth. The second one, data visualization, is about visually inspecting the data; you probably are familiar with representations such as histograms, scatter plots, and others. While EDA was originally thought of as something you apply to data before doing any complex analysis or even as an alternative to complex model-based analysis, through the book we will learn that EDA is also applicable to understanding, interpreting, checking, summarizing, and communicating the results of Bayesian analysis.

Inferential statistics

Sometimes, plotting our data and computing simple numbers, such as the average of our data, is all what we need. Other times, we want to go beyond our data to understand the underlying mechanism that could have generated the data, or maybe we want to make predictions for future data, or we need to choose among several competing explanations for the same data. That’s the job of inferential statistics. To do inferential statistics we will rely on probabilistic models. There are many types of model and most of science, and I will add all of our understanding of the real world, is through models. The brain is just a machine that models reality (whatever reality might be) http://www.tedxriodelaplata.org/videos/m%C3%A1quina-construye-realidad.

What are models? Models are a simplified descriptions of a given system (or process). Those descriptions are purposely designed to capture only the most relevant aspects of the system, and hence, most models do not try to pretend they are able to explain everything; on the contrary, if we have a simple and a complex model and both models explain the data well, we will generally prefer the simpler one.

Model building, no matter which type of model you are building, is an iterative process following more or less the same basic rules. We can summarize the Bayesian modeling process using three steps:

Given some data and some assumptions on how this data could have been generated, we will build models. Most of the time, models will be crude approximations, but most of the time this is all we need.
Then we will use Bayes’ theorem to add data to our models and derive the logical consequences of mixing the data and our assumptions. We say we are conditioning the model on our data.
Lastly, we will check that the model makes sense according to different criteria, including our data and our expertise on the subject we are studying.

In general, we will find ourselves performing these three steps in a non-linear iterative fashion. Sometimes we will retrace our steps at any given point: maybe we made a silly programming mistake, maybe we found a way to change the model and improve it, maybe we need to add more data.

Bayesian models are also known as probabilistic models because they are built using probabilities. Why probabilities? Because probabilities are the correct mathematical tool for dealing with uncertainty in our data and models, so let’s take a walk through the garden of forking paths.

Probabilities and uncertainty

While probability theory is a mature and well-established branch of mathematics, there is more than one interpretation of what probabilities are. To a Bayesian, a probability is a measure that quantifies the uncertainty level of a statement. If we know nothing about coins and we do not have any data about coin tosses, it is reasonable to think that the probability of a coin landing heads could take any value between 0 and 1; that is, in the absence of information, all values are equally likely, our uncertainty is maximum. If we know instead that coins tend to be balanced, then we may say that the probability of acoin landing is exactly 0.5 or may be around 0.5 if we admit that the balance is not perfect. If we collect data, we can update these prior assumptions and hopefully reduce the uncertainty about the bias of the coin. Under this definition of probability, it is totally valid and natural to ask about the probability of life on Mars, the probability of the mass of the electron being 9.1 x 10-31 kg, or the probability of the 9th of July of 1816 being a sunny day. Notice for example that life on Mars exists or not; it is a binary outcome, but what we are really asking is how likely is it to find life on Mars given our data and what we know about biology and the physical conditions on that planet? The statement is about our state of knowledge and not, directly, about a property of nature. We are using probabilities because we can not be sure about the events, not because the events are necessarily random. Since this definition of probability is about our epistemic state of mind, sometimes it is referred to as the subjective definition of probability, explaining the slogan of subjective statistics often attached to the Bayesian paradigm. Nevertheless, this definition does not mean all statements should be treated as equally valid and so anything goes; this definition is about acknowledging that our understanding about the world is imperfect and conditioned by the data and models we have made. There is not such a thing as a model-free or theory-free understanding of the world; even if it will be possible to free ourselves from our social preconditioning, we will end up with a biological limitation: our brain, subject to the evolutionary process, has been wired with models of the world. We are doomed to think like humans and we will never think like bats or anything else! Moreover, the universe is an uncertain place and all we can do is make probabilistic statements about it. Notice that does not matter if the underlying reality of the world is deterministic or stochastic; we are using probability as a tool to quantify uncertainty.

Logic is about thinking without making mistakes. In Aristotelian or classical logic, we can only have statements that are true or false. In Bayesian definition of probability, certainty is just a special case: a true statement has a probability of 1, a false one has probability. We would assign a probability of 1 about life on Mars only after having conclusive data indicating something is growing and reproducing and doing other activities we associate with living organisms. Notice, however, that assigning a probability of 0 is harder because we can always think that there is some Martian spot that is unexplored, or that we have made mistakes with some experiment, or several other reasons that could lead us to falsely believe life is absent on Mars when it is not. Interesting enough, Cox mathematically proved that if we want to extend logic to contemplate uncertainty we must use probabilities and probability theory, from which Bayes’ theorem is just a logical consequence as we will see soon. Hence, another way of thinking about Bayesian statistics is as an extension of logic when dealing with uncertainty, something that clearly has nothing to do with subjective reasoning in the pejorative sense. Now that we know the Bayesian interpretation of probability, let’s see some of the mathematical properties of probabilities. For a more detailed study of probability theory, you can read Introduction to probability by Joseph K Blitzstein & Jessica Hwang.

Probabilities are numbers in the interval [0, 1], that is, numbers between 0 and 1, including both extremes. Probabilities follow some rules; one of these rules is the product rule:

We read this as follows: the probability of A and B is equal to the probability of A given B, multiplied by the probability of B. The expression p(A|B) is used to indicate a conditional probability; the name refers to the fact that the probability of A is conditioned by knowing B. For example, the probability that a pavement is wet is different from the probability that the pavement is wet if we know (or given that) is raining. In fact, a conditional probability is always larger than or equal to the unconditioned probability. If knowing B does not provides us with information about A, then p(A|B)=p(A). That is A and B are independent of each other. On the contrary, if knowing B give as useful information about A, then p(A|B) > p(A).

Conditional probabilities are a key concept in statistics, and understanding them is crucial to understanding Bayes’ theorem, as we will see soon. Let’s try to understand them from a different perspective. If we reorder the equation for the product rule, we get the following:

Hence, p(A|B) is the probability that both A and B happens, relative to the probability of B happening. Why do we divide by p(B)? Knowing B is equivalent to saying that we have restricted the space of possible events to B and thus, to find the conditional probability, we take the favorable cases and divide them by the total number of events.

Is important to realize that all probabilities are indeed conditionals, there is not such a thing as an absolute probability floating in vacuum space. There is always some model, assumptions, or conditions, even if we don’t notice or know them. The probability of rain is not the same if we are talking about Earth, Mars, or some other place in the Universe, the same way the probability of a coin landing heads or tails depends on our assumptions of the coin being biased in one way or another. Now that we are more familiar with the concept of probability, let’s jump to the next topic, probability distributions.

Probability distributions

A probability distribution is a mathematical object that describes how likely different events are. In general, these events are restricted somehow to a set of possible events. A common and useful conceptualization in statistics is to think that data was generated from some probability distribution with unobserved parameters. Since the parameters are unobserved and we only have data, we will use Bayes’ theorem to invert the relationship, that is, to go from the data to the parameters. Probability distributions are the building blocks of Bayesian models; by combining them in proper ways we can get useful complex models.

We will meet several probability distributions throughout the book; every time we discover one we will take a moment to try to understand it. Probably the most famous of all of them is the Gaussian or normal distribution. A variable x follows a Gaussian distribution if its values are dictated by the following formula:

In the formula, and are the parameters of the distributions. The first one can take any real value, that is, , and dictates the mean of the distribution (and also the median and mode, which are all equal). The second is the standard deviation, which can only be positive and dictates the spread of the distribution. Since there are an infinite number of possible combinations of and values, there is an infinite number of instances of the Gaussian distribution and all of them belong to the same Gaussian family. Mathematical formulas are concise and unambiguous and some people say even beautiful, but we must admit that meeting them can be intimidating; a good way to break the ice is to use Python to explore them. Let’s see what the Gaussian distribution family looks like:

import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
import seaborn as sns

mu_params = [-1, 0, 1]
sd_params = [0.5, 1, 1.5]
x = np.linspace(-7, 7, 100)
f, ax = plt.subplots(len(mu_params), len(sd_params), sharex=True, sharey=True)
for i in range(3):
    for j in range(3):
        mu = mu_params[i]
        sd = sd_params[j]
        y = stats.norm(mu, sd).pdf(x)
        ax[i,j].plot(x, y)
        ax[i,j].set_ylim(0, 1)
        ax[i,j].plot(0, 0, 
        label="$\alpha$ = {:3.2f}n$\beta$ = {:3.2f}".format(mu, sd), alpha=0)
        ax[i,j].legend()
ax[2,1].set_xlabel('$x$')
ax[1,0].set_ylabel('$pdf(x)$')

The output of the preceding code is as follows:

A variable, such as x, that comes from a probability distribution is called a random variable. It is not that the variable can take any possible value. On the contrary, the values are strictly dictated by the probability distribution; the randomness arises from the fact that we could not predict which value the variable will take, but only the probability of observing those values. A common notation used to say that a variable is distributed as a Gaussian or normal distribution with parameters and is as follows:

The symbol ~ is read as is distributed as.

There are two types of random variable, continuous and discrete. Continuous variables can take any value from some interval (we can use Python floats to represent them), and the discrete variables can take only certain values (we can use Python integers to represent them).

Many models assume that successive values of a random variables are all sampled from the same distribution and those values are independent of each other. In such a case, we will say that the variables are independently and identically distributed, or iid variables for short. Using mathematical notation, we can see that two variables are independent if for every value of x and y:

A common example of non iid variables are temporal series, where a temporal dependency in the random variable is a key feature that should be taken into account.

Summary

In this article we shall take up a practical approach to Bayesian statistics and discover how to implement Bayesian statistics with Python. Here we will learn to think of problems in terms of their probability and uncertainty and apply the Bayes’ theorem to derive their results.