It’s time for us to put descriptive statistics down for the time being. It was fun for a while, but we’re no longer content just determining the properties of observed data; now we want to start making deductions about data we haven’t observed. This leads us to the realm of inferential statistics.

In data analysis, probability is used to quantify uncertainty of our deductions about unobserved data. In the land of inferential statistics, probability reigns queen. Many regard her as a harsh mistress, but that’s just a rumor.

*(For more resources related to this topic, see here.)*

# Basic probability

Probability measures the likeliness that a particular event will occur. When mathematicians (us, for now!) speak of an event, we are referring to a set of potential outcomes of an experiment, or trial, to which we can assign a probability of occurrence.

Probabilities are expressed as a number between 0 and 1 (or as a percentage out of 100). An event with a probability of 0 denotes an impossible outcome, and a probability of 1 describes an event that is certain to occur.

The canonical example of probability at work is a coin flip. In the coin flip event, there are two outcomes: the coin lands on heads, or the coin lands on tails. Pretending that coins never land on their edge (they almost never do), those two outcomes are the only ones possible. The sample space (the set of all possible outcomes), therefore, is {heads, tails}. Since the entire sample space is covered by these two outcomes, they are said to be collectively exhaustive.

The sum of the probabilities of collectively exhaustive events is always 1. In this example, the probability that the coin flip will yield heads or yield tails is 1; it is certain that the coin will land on one of those. In a fair and correctly balanced coin, each of those two outcomes is equally likely. Therefore, we split the probability equally among the outcomes: in the event of a coin flip, the probability of obtaining heads is 0.5, and the probability of tails is 0.5 as well. This is usually denoted as follows:

The probability of a coin flip yielding either heads or tails looks like this:

And the probability of a coin flip yielding both heads and tails is denoted as follows:

The two outcomes, in addition to being collectively exhaustive, are also mutually exclusive. This means that they can never co-occur. This is why the probability of heads and tails is 0; it just can’t happen.

The next obligatory application of beginner probability theory is in the case of rolling a standard six-sided die. In the event of a die roll, the sample space is *{1, 2, 3, 4, 5, 6}*. With every roll of the die, we are sampling from this space. In this event, too, each outcome is equally likely, except now we have to divide the probability across six outcomes. In the following equation, we denote the probability of rolling a 1 as P(1):

Rolling a 1 or rolling a 2 is not collectively exhaustive (we can still roll a 3, 4, 5, or 6), but they are mutually exclusive; we can’t roll a 1 and 2. If we want to calculate the probability of either one of two mutually exclusive events occurring, we add the probabilities:

While rolling a 1 or rolling a 2 aren’t mutually exhaustive, rolling 1 and not rolling a 1 are. This is usually denoted in this manner:

These two events—and all events that are both collectively exhaustive and mutually exclusive—are called complementary events.

Our last pedagogical example in the basic probability theory is using a deck of cards. Our deck has 52 cards—4 for each number from 2 to 10 and 4 each of Jack, Queen, King, and Ace (no Jokers!). Each of these 4 cards belong to one suit, either a Heart, Club, Spade or Diamond. There are, therefore, 13 cards in each suit. Further, every Heart and Diamond card is colored red, and every Spade and Club are black. From this, we can deduce the following probabilities for the outcome of randomly choosing a card:

What, then, is the probability of getting a black card and an Ace? Well, these events are *conditionally independent*, meaning that the probability of either outcome does not affect the probability of the other. In cases like these, the probability of event A and event B is the product of the probability of A and the probability of B. Therefore:

Intuitively, this makes sense, because there are two black Aces out of a possible 52.

What about the probability that we choose a red card and a Heart? These two outcomes are not conditionally independent, because knowing that the card is red has a bearing on the likelihood that the card is also a Heart. In cases like these, the probability of event A and B is denoted as follows:

Where *P(A|B)* means *the probability of A given B*. For example, if we represent A as drawing a Heart and B as drawing a red card, *P(A | B)* means *what’s the probability of drawing a heart if we know that the card we drew was red?*. Since a red card is equally likely to be a Heart or a Diamond, *P(A|B)* is 0.5. Therefore:

In the preceding equation, we used the form *P(B) P(A|B)*. Had we used the form *P(A) P(B|A)*, we would have got the same answer:

So, these two forms are equivalent:

For kicks, let’s divide both sides of the equation by P(B). That yields the following equivalence:

This equation is known as *Bayes’ Theorem*. This equation is very easy to derive, but its meaning and influence is profound. In fact, it is one of the most famous equations in all of mathematics.

Bayes’ Theorem has been applied to and proven useful in an enormous amount of different disciplines and contexts. It was used to help crack the German Enigma code during World War II, saving the lives of millions. It was also used recently, and famously, by Nate Silver to help correctly predict the voting patterns of 49 states in the 2008 US presidential election.

At its core, Bayes’ Theorem tells us how to update the probability of a hypothesis in light of new evidence. Due to this, the following formulation of Bayes’ Theorem is often more intuitive:

where *H* is the hypothesis and *E* is the evidence.

Let’s see an example of Bayes’ Theorem in action!

There’s a hot new recreational drug on the scene called *Allighate* (or *Ally* for short). It’s named as such because it makes its users go wild and act like an alligator. Since the effect of the drug is so deleterious, very few people actually take the drug. In fact, only about 1 in every thousand people (0.1%) take it.

Frightened by fear-mongering late-night news, Daisy Girl, Inc., a technology consulting firm, ordered an Allighate testing kit for all of its 200 employees so that it could offer treatment to any employee who has been using it. Not sparing any expense, they bought the best kit on the market; it had 99% sensitivity and 99% specificity. This means that it correctly identified drug users 99 out of 100 times, and only falsely identified a non-user as a user once in every 100 times.

When the results finally came back, two employees tested positive. Though the two denied using the drug, their supervisor, Ronald, was ready to send them off to get help. Just as Ronald was about to send them off, Shanice, a clever employee from the statistics department, came to their defense.

Ronald incorrectly assumed that each of the employees who tested positive were using the drug with 99% certainty and, therefore, the chances that both were using it was 98%. Shanice explained that it was actually far more likely that neither employee was using Allighate.

How so? Let’s find out by applying Bayes’ theorem!

Let’s focus on just one employee right now; let H be the hypothesis that one of the employees is using Ally, and E represent the evidence that the employee tested positive.

We want to solve the left side of the equation, so let’s plug in values. The first part of the right side of the equation, *P(Positive Test | Ally User)*, is called the likelihood. The probability of testing positive if you use the drug is 99%; this is what tripped up Ronald—and most other people when they first heard of the problem. The second part, *P(Ally User)*, is called the prior. This is our belief that any one person has used the drug before we receive any evidence. Since we know that only .1% of people use Ally, this would be a reasonable choice for a prior. Finally, the denominator of the equation is a normalizing constant, which ensures that the final probability in the equation will add up to one of all possible hypotheses. Finally, the value we are trying to solve, *P(Ally user | Positive Test)*, is the posterior. It is the probability of our hypothesis updated to reflect new evidence.

In many practical settings, computing the normalizing factor is very difficult. In this case, because there are only two possible hypotheses, being a user or not, the probability of finding the evidence of a positive test is given as follows:

Which is: (.99 * .001) + (.01 * .999) = 0.01098

Plugging that into the denominator, our final answer is calculated as follows:

Note that the new evidence, which favored the hypothesis that the employee was using Ally, shifted our prior belief from *.001* to *.09*. Even so, our prior belief about whether an employee was using Ally was so extraordinarily low, it would take some very very strong evidence indeed to convince us that an employee was an Ally user.

Ignoring the prior probability in cases like these is known as *base-rate fallacy*. Shanice assuaged Ronald’s embarrassment by assuring him that it was a very common mistake.

Now to extend this to two employees: the probability of any two employees both using the drug is, as we now know, .01 squared, or 1 million to one. Squaring our new posterior yields, we get *.0081*. The probability that both employees use Ally, even given their positive results, is less than 1%. So, they are exonerated.

Sally is a different story, though. Her friends noticed her behavior had dramatically changed as of late—she snaps at co-workers and has taken to eating pencils. Her concerned cubicle-mate even followed her after work and saw her crawl into a sewer, not to emerge until the next day to go back to work.

Even though Sally passed the drug test, we know that it’s likely (almost certain) that she uses Ally. Bayes’ theorem gives us a way to quantify that probability!

Our prior is the same, but now our likelihood is pretty much as close to 1 as you can get – after all, how many non-Ally users do you think eat pencils and live in sewers?

# A tale of two interpretations

Though it may seem strange to hear, there is actually a hot philosophical debate about what probability really is. Though there are others, the two primary camps into which virtually all mathematicians fall are the frequentist camp and the Bayesian camp.

The frequentist interpretation describes probability as the relative likelihood of observing an outcome in an experiment when you repeat the experiment multiple times. Flipping a coin is a perfect example; the probability of heads converges to 50% as the number of times it is flipped goes to infinity.

The frequentist interpretation of probability is inherently objective; there is a true probability out there in the world, which we are trying to estimate.

The Bayesian interpretation, however, views probability as our degree of belief about something. Because of this, the Bayesian interpretation is subjective; when evidence is scarce, there are sometimes wildly different degrees of belief among different people.

Described in this manner, *Bayesianism* may scare many people off, but it is actually quite intuitive. For example, when a meteorologist describes the probability of rain as 70%, people rarely bat an eyelash. But this number only really makes sense within a Bayesian framework because exact meteorological conditions are not repeatable, as is required by frequentist probability.

Not simply a heady academic exercise, these two interpretations lead to different methodologies in solving problems in data analysis. Many times, both approaches lead to similar results.

Though practitioners may strongly align themselves with one side over another, good statisticians know that there’s a time and a place for both approaches.

Though Bayesianism as a valid way of looking at probability is debated, Bayes theorem is a fact about probability and is undisputed and non-controversial.

# Sampling from distributions

Observing the outcome of trials that involve a random variable, a variable whose value changes due to chance, can be thought of as sampling from a probability distribution—one that describes the likelihood of each member of the sample space occurring.

That sentence probably sounds much scarier than it needs to be. Take a die roll for example.

Figure 4.1: Probability distribution of outcomes of a die roll

Each roll of a die is like sampling from a discrete probability distribution for which each outcome in the sample space has a probability of 0.167 or 1/6. This is an example of a uniform distribution, because all the outcomes are uniformly as likely to occur. Further, there are a finite number of outcomes, so this is a discrete uniform distribution (there also exist continuous uniform distributions).

Flipping a coin is like sampling from a uniform distribution with only two outcomes. More specifically, the probability distribution that describes coin-flip events is called a *Bernoulli distribution*—it’s a distribution describing only two events.

## Parameters

We use probability distributions to describe the behavior of random variables because they make it easy to compute with and give us a lot of information about how a variable behaves. But before we perform computations with probability distributions, we have to specify the parameters of those distributions. These parameters will determine exactly what the distribution looks like and how it will behave.

For example, the behavior of both a 6-sided die and a 12-sided die is modeled with a uniform distribution. Even though the behavior of both the dice is modeled as uniform distributions, the behavior of each is a little different. To further specify the behavior of each distribution, we detail its parameter; in the case of the (discrete) uniform distribution, the parameter is called *n*. A uniform distribution with parameter n has n equally likely outcomes of probability *1 / n*. The *n* for a 6-sided die and a 12-sided die is 6 and 12 respectively.

For a Bernoulli distribution, which describes the probability distribution of an event with only two outcomes, the parameter is *p*. Outcome 1 occurs with probability *p*, and the other outcome occurs with probability *1 – p*, because they are collectively exhaustive. The flip of a fair coin is modeled as a Bernoulli distribution with *p = 0.5*.

Imagine a six-sided die with one side labeled 1 and the other five sides labeled 2. The outcome of the die roll trials can be described with a Bernoulli distribution, too! This time, *p = 0.16 (1/6)*. Therefore, the probability of not rolling a 1 is 5/6.

## The binomial distribution

The binomial distribution is a fun one. Like our uniform distribution described in the previous section, it is discrete.

When an event has two possible outcomes, success or failure, this distribution describes the number of successes in a certain number of trials. Its parameters are *n*, the number of trials, and *p*, the probability of success.

Concretely, a binomial distribution with *n=1* and *p=0.5* describes the behavior of a single coin flip—if we choose to view heads as successes (we could also choose to view tails as successes). A binomial distribution with *n=30* and *p=0.5* describes the number of *heads* we should expect.

Figure 4.2: A binomial distribution (n=30, p=0.5)

On average, of course, we would expect to have 15 heads. However, *randomness* is the name of the game, and seeing more or fewer heads is totally expected.

*How can we use the binomial distribution in practice?,* you ask. Well, let’s look at an application.

Larry the Untrustworthy Knave—who can only be trusted some of the time—gives us a coin that he alleges is fair. We flip it 30 times and observe 10 heads.

It turns out that the probability of getting exactly 10 heads on 30 flips is about *2.8%**. We can use R to tell us the probability of getting 10 *or fewer* heads using the *pbinom* function:

```
> pbinom(10, size=30, prob=.5)
[1] 0.04936857
```

It appears as if the probability of this occurring, in a correctly balanced coin, is roughly 5%. Do you think we should take Larry at his word?

*If you’re interestedThe way we determined the probability of getting exactly 10 heads is by using the probability formula for Bernoulli trials. The probability of getting

ksuccesses inntrials is equal to:where

pis the probability of getting one success and:

# The normal distribution

When we described the normal distribution and how ubiquitous it is? The behavior of many random variables in real life is very well described by a normal distribution with certain parameters.

The two parameters that uniquely specify a normal distribution are µ (*mu*) and σ (*sigma*). µ, the mean, describes where the distribution’s peak is located and σ, the standard deviation, describes how wide or narrow the distribution is.

Figure 4.3: Normal distributions with different parameters

The distribution of heights of American females is approximately normally distributed with parameters µ= 65 inches and σ= 3.5 inches.

Figure 4.4: Normal distributions with different parameters

With this information, we can easily answer questions about how probable it is to choose, at random, US women of certain heights.

We can’t really answer the question *What is the probability that we choose a person who is exactly 60 inches?*, because virtually no one is exactly 60 inches. Instead, we answer questions about how probable it is that a random person is within a certain range of heights.

What is the probability that a randomly chosen woman is 70 inches or taller? If you recall, the probability of a height within a range is the area under the curve, or the integral over that range. In this case, the range we will integrate looks like this:

Figure 4.5: Area under the curve of the height distribution from 70 inches to positive infinity

```
> f <- function(x){ dnorm(x, mean=65, sd=3.5) }
> integrate(f, 70, Inf)
0.07656373 with absolute error < 2.2e-06
```

The preceding R code indicates that there is a 7.66% chance of randomly choosing a woman who is 70 inches or taller.

Luckily for us, the normal distribution is so popular and well studied, that there is a function built into R, so we don’t need to use integration ourselves.

```
> pnorm(70, mean=65, sd=3.5)
[1] 0.9234363
```

The *pnorm* function tells us the probability of choosing a woman who is shorter than 70 inches. If we want to find P (> 70 inches), we can either subtract this value by 1 (which gives us the complement) or use the optional argument *lower.tail=FALSE*. If you do this, you’ll see that the result matches the 7.66% chance we arrived at earlier.

# Summary

You can check out similar books published by Packt Publishing on R (https://www.packtpub.com/tech/r):

*Unsupervised Learning with R*by Erik Rodríguez Pacheco (https://www.packtpub.com/big-data-and-business-intelligence/unsupervised-learning-r)*R Data Science Essentials*by Raja B. Koushik and Sharan Kumar Ravindran (https://www.packtpub.com/big-data-and-business-intelligence/r-data-science-essentials)

# Resources for Article:

**Further resources on this subject:**

- Dealing With A Mess [article]
- Navigating The Online Drupal Community [article]
- Design With Spring AOP [article]