How data scientists test hypotheses and probability

April 23, 2018 - 11:26 am

4697

Hypotheses and probability in data science

4 min read

Why hypotheses are important in statistical analysis

Hypothesis testing allows researchers and statisticians to develop hypotheses which are then assessed to determine the probability or the likelihood of those findings.

This statistics tutorial has been taken from Basic Statistics and Data Mining for Data Science.

Learn Programming & Development with a Packt Subscription

Whenever you wish to make an inference about a population from a sample, you must test a specific hypothesis. It’s common practice to state 2 different hypotheses:

Null hypothesis which states that there is no effect
Alternative/research hypothesis which states that there is an effect

So, the null hypothesis is one which says that there is no difference. For example, you might be looking at the mean income between males and females, but the null hypothesis you are testing is that there is no difference between the 2 groups.

The alternative hypothesis, meanwhile, is generally, although not exclusively, the one that researchers are really interested in. In this example, you might hypothesize that the mean income between males and females is different.

Read more: How to predict Bitcoin prices from historical and live data.

Why probability is important in statistical analysis

In statistics, nothing is ever certain because we are always dealing with samples rather than populations. This is why we always have to work in probabilities. The way hypotheses are assessed is by calculating the probability or the likelihood of finding our result. A probability value, which can range from zero to one, corresponding to 0% and 100% in percentages, is essentially a way of measuring the likelihood of a particular event occurring. You can use these values to assess whether the likelihood of any of these differences that you have found are the result of random chance.

How do hypotheses and probability interact?

It starts getting really interesting once we begin looking at how hypotheses and probability interact. Here’s an example. Suppose you want to know who is going to win the Super Bowl. I ask a fellow statistician, and he tells me that she’s built a predictive model and that he knows which team is going to win. Fine – my next question is how confident he is in that prediction. He says he’s 50% confident – are you going to trust his prediction? Of course you’re not – there are only 2 possible outcomes and 50% is ultimately just random chance.

So, say I ask another statistician. He also tells me that he has a prediction and that he has built a predictive model, and he’s 75% confident in the prediction he has made. You’re more likely to trust this prediction – you have a 75% chance of being right and a 25% chance of being wrong.

But let’s say you’re feeling cautious – a 25% chance of being wrong is too high. So, you ask another statistician for their prediction. She tells me that she’s also built a predictive model which she has 90% confidence is correct.

So, having formally stated our hypotheses we then have to select a criterion for acceptance or rejection of the null hypothesis.

With probability tests like the chi-squared test, the t-test, or regression or correlation, you’re testing the likelihood that a statistic of the magnitude that you obtained or greater would have occurred by chance, assuming that the null hypothesis is true.

It’s important to remember that you always assess the probability of the null hypothesis as true. You only reject the null hypothesis if you can say that the results would have been extremely unlikely under the conditions set by the null hypothesis. In this case, if you can reject the null hypothesis, you have found support for the alternative/research hypothesis. This doesn’t prove the alternative hypothesis, but it does tell you that the null hypothesis is unlikely to be true.

The criterion we typically use is whether the significance level sits above or below 0.05 (5%), indicating that a statistic of the size that we obtained, would only be likely to occur on 5% of occasions. By choosing a 5% criterion you are accepting that you will make a mistake in rejecting the null hypothesis 1 in 20 times.

Replication and data mining

If in traditional statistics we work with hypotheses and probabilities to deal with the fact that we’re always working with a sample rather than a population, in data mining, we can work in a slightly different way – we can use something called replication instead.

In a data mining project we might have 2 data sets – a training data set and a testing data set. We build our model on a training set and once we’ve done that, we take the results of that model and then apply it to a testing data set to see if we find similar results.