Feature Improvement: Identifying missing values using EDA (Exploratory Data Analysis) technique

Today, we will work towards developing a better sense of data through identifying missing values in a dataset using Exploratory Data Analysis (EDA) technique and python packages.

Identifying missing values in data

Our first method of identifying missing values is to give us a better understanding of how to work with real-world data. Often, data can have missing values due to a variety of reasons, for example with survey data, some observations may not have been recorded. It is important for us to analyze our data, and get a sense of what the missing values are so we can decide how we want to handle missing values for our machine learning. To start, let's dive into a dataset the Pima Indian Diabetes Prediction dataset. This dataset is available on the UCI Machine Learning Repository at:

https://archive.ics.uci.edu/ml/datasets/pima+indians+diabetes

From the main website, we can learn a few things about this publicly available dataset. We have nine columns and 768 instances (rows). The dataset is primarily used for predicting the onset of diabetes within five years in females of Pima Indian heritage over the age of 21 given medical details about their bodies.

The dataset is meant to correspond with a binary (2-class) classification machine learning problem. Namely, the answer to the question, will this person develop diabetes within five years? The column names are provided as follows (in order):

Number of times pregnant
Plasma glucose concentration a 2 hours in an oral glucose tolerance test
Diastolic blood pressure (mm Hg)
Triceps skinfold thickness (mm)
2-Hour serum insulin measurement (mu U/ml)
Body mass index (weight in kg/(height in m)2)
Diabetes pedigree function
Age (years)
Class variable (zero or one)

The goal of the dataset is to be able to predict the final column of class variable, which predicts if the patient has developed diabetes, using the other eight features as inputs to a machine learning function.

There are two very important reasons we will be working with this dataset:

We will have to work with missing values
All of the features we will be working with will be quantitative

The first point makes more sense for now as a reason, because the point of this chapter is to deal with missing values. As far as only choosing to work with quantitative data, this will only be the case for this chapter. We do not have enough tools to deal with missing values in categorical columns. In the next chapter, when we talk about feature construction, we will deal with this procedure.

The exploratory data analysis (EDA)

To identify our missing values we will begin with an EDA of our dataset. We will be using some useful python packages, pandas and numpy, to store our data and make some simple calculations as well as some popular visualization tools to see what the distribution of our data looks like. Let's begin and dive into some code. First, we will do some imports:

# import packages we need for exploratory data analysis (EDA)

import pandas as pd # to store tabular data

import numpy as np # to do some math

import matplotlib.pyplot as plt # a popular data visualization tool

import seaborn as sns # another popular data visualization tool

%matplotlib inline

plt.style.use('fivethirtyeight') # a popular data visualization theme

We will import our tabular data through a CSV, as follows:

# load in our dataset using pandas

pima = pd.read_csv('../data/pima.data')

pima.head()

The head method allows us to see the first few rows in our dataset. The output is as follows:

feature-improvement-identifying-missing-values-using-eda-exploratory-data-analysis-technique-img-0

Something's not right here, there's no column names. The CSV must not have the names for the columns built into the file. No matter, we can use the data source's website to fill this in, as shown in the following code:

pima_column_names = ['times_pregnant', 'plasma_glucose_concentration',

'diastolic_blood_pressure', 'triceps_thickness', 'serum_insulin', 'bmi',

'pedigree_function', 'age', 'onset_diabetes']

pima = pd.read_csv('../data/pima.data', names=pima_column_names)

pima.head()

Now, using the head method again, we can see our columns with the appropriate headers. The output of the preceding code is as follows:

feature-improvement-identifying-missing-values-using-eda-exploratory-data-analysis-technique-img-1

Much better, now we can use the column names to do some basic stats, selecting, and visualizations. Let's first get our null accuracy as follows:

pima['onset_diabetes'].value_counts(normalize=True)

# get null accuracy, 65% did not develop diabetes

0 0.651042

1 0.348958

Name: onset_diabetes, dtype: float64

If our eventual goal is to exploit patterns in our data in order to predict the onset of diabetes, let us try to visualize some of the differences between those that developed diabetes and those that did not. Our hope is that the histogram will reveal some sort of pattern, or obvious difference in values between the classes of prediction:

# get a histogram of the plasma_glucose_concentration column for

# both classes

col = 'plasma_glucose_concentration'

plt.hist(pima[pima['onset_diabetes']==0][col], 10, alpha=0.5, label='nondiabetes')

plt.hist(pima[pima['onset_diabetes']==1][col], 10, alpha=0.5,

label='diabetes')

plt.legend(loc='upper right')

plt.xlabel(col)

plt.ylabel('Frequency')

plt.title('Histogram of {}'.format(col))

plt.show()

The output of the preceding code is as follows:

feature-improvement-identifying-missing-values-using-eda-exploratory-data-analysis-technique-img-2

It seems that this histogram is showing us a pretty big difference between plasma_glucose_concentration between the two prediction classes. Let's show the same histogram style for multiple columns as follows:

for col in ['bmi', 'diastolic_blood_pressure',

'plasma_glucose_concentration']:

plt.hist(pima[pima['onset_diabetes']==0][col], 10, alpha=0.5,

label='non-diabetes')

plt.hist(pima[pima['onset_diabetes']==1][col], 10, alpha=0.5,

label='diabetes')

plt.legend(loc='upper right')

plt.xlabel(col)

plt.ylabel('Frequency')

plt.title('Histogram of {}'.format(col))

plt.show()

The output of the preceding code will give us the following three histograms. The first one is show us the distributions of bmi for the two class variables (non-diabetes and diabetes):

feature-improvement-identifying-missing-values-using-eda-exploratory-data-analysis-technique-img-3

The next histogram to appear will shows us again contrastingly different distributions between a feature across our two class variables. This time we are looking at diastolic_blood_pressure:

feature-improvement-identifying-missing-values-using-eda-exploratory-data-analysis-technique-img-4

The final graph will show plasma_glucose_concentration differences between our two class Variables:

feature-improvement-identifying-missing-values-using-eda-exploratory-data-analysis-technique-img-5

We can definitely see some major differences simply by looking at just a few histograms. For example, there seems to be a large jump in plasma_glucose_concentration for those who will eventually develop diabetes. To solidify this, perhaps we can visualize a linear correlation matrix in an attempt to quantify the relationship between these variables. We will use the visualization tool, seaborn, which we imported at the beginning of this chapter for our correlation matrix as follows:

# look at the heatmap of the correlation matrix of our dataset

sns.heatmap(pima.corr())

# plasma_glucose_concentration definitely seems to be an interesting

feature here

Following is the correlation matrix of our dataset. This is showing us the correlation amongst the different columns in our Pima dataset. The output is as follows:

feature-improvement-identifying-missing-values-using-eda-exploratory-data-analysis-technique-img-6

This correlation matrix is showing a strong correlation between plasma_glucose_concentration and
onset_diabetes. Let's take a further look at the numerical correlations for the onset_diabetes
column, with the following code:

pima.corr()['onset_diabetes'] # numerical correlation matrix

# plasma_glucose_concentration definitely seems to be an interesting

feature here

times_pregnant 0.221898

plasma_glucose_concentration 0.466581

diastolic_blood_pressure 0.065068

triceps_thickness 0.074752

serum_insulin 0.130548

bmi 0.292695

pedigree_function 0.173844

age 0.238356

onset_diabetes 1.000000

Name: onset_diabetes, dtype: float64

We will explore the powers of correlation in a later Chapter 4, Feature Construction, but for now we are using exploratory data analysis (EDA) to hint at the fact that the plasma_glucose_concentration column will be an important factor in our prediction of the onset of diabetes.

Moving on to more important matters at hand, let's see if we are missing any values in our dataset by invoking the built-in isnull() method of the pandas DataFrame:

pima.isnull().sum()

>>>>

times_pregnant 0

plasma_glucose_concentration 0

diastolic_blood_pressure 0

triceps_thickness 0

serum_insulin 0

bmi 0

pedigree_function 0

age 0

onset_diabetes 0

dtype: int64

Great! We don't have any missing values. Let's go on to do some more EDA, first using the shape method to see the number of rows and columns we are working with:

pima.shape . # (# rows, # cols)

(768, 9)

Confirming we have 9 columns (including our response variable) and 768 data observations (rows). Now, let's take a peak at the percentage of patients who developed diabetes, using the following code:

pima['onset_diabetes'].value_counts(normalize=True)

# get null accuracy, 65% did not develop diabetes

0 0.651042

1 0.348958

Name: onset_diabetes, dtype: float64

This shows us that 65% of the patients did not develop diabetes, while about 35% did. We can use a nifty built-in method of a pandas DataFrame called describe to look at some basic descriptive statistics:

pima.describe() # get some basic descriptive statistics

We get the output as follows:

feature-improvement-identifying-missing-values-using-eda-exploratory-data-analysis-technique-img-7

This shows us quite quickly some basic stats such as mean, standard deviation, and some different percentile measurements of our data. But, notice that the minimum value of the BMI column is 0. That is medically impossible; there must be a reason for this to happen.

Perhaps the number zero has been encoded as a missing value instead of the None value or a missing cell. Upon closer inspection, we see that the value 0 appears as a minimum value for the following columns:

times_pregnant
plasma_glucose_concentration
diastolic_blood_pressure
triceps_thickness
serum_insulin
bmi
onset_diabetes

Because zero is a class for onset_diabetes and 0 is actually a viable number for times_pregnant, we may conclude that the number 0 is encoding missing values for:

plasma_glucose_concentration
diastolic_blood_pressure
triceps_thickness
serum_insulin
bmi

So, we actually do having missing values! It was obviously not luck that we happened upon the zeros as missing values, we knew it beforehand. As a data scientist, you must be ever vigilant and make sure that you know as much about the dataset as possible in order to find missing values encoded as other symbols. Be sure to read any and all documentation that comes with open datasets in case they mention any missing values.

If no documentation is available, some common values used instead of missing values are:

0 (for numerical values)
unknown or Unknown (for categorical variables)
? (for categorical variables)

To summarize, we have five columns where the fields are left with missing values and symbols.

[box type="note" align="" class="" width=""]You just read an excerpt from a book Feature Engineering Made Easy co-authored by Sinan Ozdemir and Divya Susarla. To learn more about missing values and manipulating features, do check out Feature Engineering Made Easy and develop expert proficiency in Feature Selection, Learning, and Optimization.[/box]