Estimating population statistics with Point Estimation

[box type="note" align="" class="" width=""]This article is an extract from the book Principles of Data Science, written by Sinan Ozdemir. The book is a great way to get into the field of data science. It takes a unique approach that bridges the gap between mathematics and computer science, taking you through the entire data science pipeline.[/box]

In this extract, we’ll learn how to estimate population means, variances and other statistics using the Point Estimation method. For the code samples, we’ve used Python 2.7.

A point estimate is an estimate of a population parameter based on sample data. To obtain these estimates, we simply apply the function that we wish to measure for our population to a sample of the data.

For example, suppose there is a company of 9,000 employees and we are interested in ascertaining the average length of breaks taken by employees in a single day. As we probably cannot ask every single person, we will take a sample of the 9,000 people and take a mean of the sample. This sample mean will be our point estimate.

The following code is broken into three parts:

We will use the probability distribution, known as the Poisson distribution, to randomly generate 9,000 answers to the question: for how many minutes in a day do you usually take breaks? This will represent our "population".

We will take a sample of 100 employees (using the Python random sample method) and find a point estimate of a mean (called a sample mean).

Compare our sample mean (the mean of the sample of 100 employees) to our population mean.

Let's take a look at the following code:

np.random.seed(1234)

long_breaks = stats.poisson.rvs(loc=10, mu=60, size=3000)

# represents 3000 people who take about a 60 minute break

The long_breaks variable represents 3000 answers to the question: how many minutes on an average do you take breaks for?, and these answers will be on the longer side. Let's see a visualization of this distribution, shown as follows:

pd.Series(long_breaks).hist()

estimating-population-statistics-point-estimation-img-0

We see that our average of 60 minutes is to the left of the distribution. Also, because we only sampled 3000 people, our bins are at their highest around 700-800 people.

Now, let's model 6000 people who take, on an average, about 15 minutes' worth of breaks. Let's again use the Poisson distribution to simulate 6000 people, as shown:

short_breaks = stats.poisson.rvs(loc=10, mu=15, size=6000)

# represents 6000 people who take about a 15 minute break

pd.Series(short_breaks).hist()

estimating-population-statistics-point-estimation-img-1

Okay, so we have a distribution for the people who take longer breaks and a distribution for the people who take shorter breaks. Again, note how our average break length of 15 minutes falls to the left-hand side of the distribution, and note that the tallest bar is about 1600 people.

breaks = np.concatenate((long_breaks, short_breaks))

# put the two arrays together to get our "population" of 9000 people

The breaks variable is the amalgamation of all the 9000 employees, both long and short break takers. Let's see the entire distribution of people in a single visualization:

pd.Series(breaks).hist()

estimating-population-statistics-point-estimation-img-2

We see how we have two humps. On the left, we have our larger hump of people who take about a 15 minute break, and on the right, we have a smaller hump of people who take longer breaks. Later on, we will investigate this graph further.

We can find the total average break length by running the following code:

breaks.mean()

# 39.99 minutes is our parameter

Our average company break length is about 40 minutes. Remember that our population is the entire company's employee size of 9,000 people, and our parameter is 40 minutes. In the real world, our goal would be to estimate the population parameter because we would not have the resources to ask every single employee in a survey their average break length for many reasons. Instead, we will use a point estimate.

So, to make our point, we want to simulate a world where we ask 100 random people about the length of their breaks. To do this, let's take a random sample of 100 employees out of the 9,000 employees we simulated, as shown:

sample_breaks = np.random.choice(a = breaks, size=100)

# taking a sample of 100 employees

Now, let's take the mean of the sample and subtract it from the population mean and see how far off we were:

breaks.mean() - sample_breaks.mean()

# difference between means is 4.09 minutes, not bad!

This is extremely interesting, because with only about 1% of our population (100 out of 9,000), we were able to get within 4 minutes of our population parameter and get a very accurate estimate of our population mean. Not bad!

Here, we calculated a point estimate for the mean, but we can also do this for proportion parameters. By proportion, I am referring to a ratio of two quantitative values.

Let's suppose that in a company of 10,000 people, our employees are 20% white, 10% black, 10% Hispanic, 30% Asian, and 30% identify as other. We will take a sample of 1,000 employees and see if their race proportions are similar.

employee_races = (["white"]*2000) + (["black"]*1000) +

        (["hispanic"]*1000) + (["asian"]*3000) +

        (["other"]*3000)

employee_races represents our employee population. For example, in our company of 10,000 people, 2,000 people are white (20%) and 3,000 people are Asian (30%).

Let's take a random sample of 1,000 people, as shown:

demo_sample = random.sample(employee_races, 1000) # Sample 1000 values

for race in set(demo_sample):

print( race + " proportion estimate:" )

print( demo_sample.count(race)/1000. )

The output obtained would be as follows:

hispanic proportion estimate:

0.103

white proportion estimate:

0.192

other proportion estimate:

0.288

black proportion estimate:

0.1

asian proportion estimate:

0.317

We can see that the race proportion estimates are very close to the underlying population's proportions. For example, we got 10.3% for Hispanic in our sample and the population proportion for Hispanic was 10%.

To summarize we can say that you’re familiar with point estimation method to estimate population means, variances and other statistics, and implement them in Python.

If you found our post useful, you can check out Principles of Data Science for more interesting Data Science tips and techniques.