This article written by Trent Hauck, the author of scikit-learn Cookbook, Packt Publishing, will cover the following recipes:
(For more resources related to this topic, see here.)
Even though by design the articles are unordered, you could argue by virtue of the art of data science, we’ve saved the best for last.
For the most part, each recipe within this article is applicable to the various models we’ve worked with. In some ways, you can think about this article as tuning the parameters and features. Ultimately, we need to choose some criteria to determine the “best” model. We’ll use various measures to define best. Then in the Cross validation with ShuffleSplit recipe, we will randomize the evaluation across subsets of the data to help avoid overfitting.
In this recipe, we’ll create, quite possibly, the most important post-model validation exercise—cross validation. We’ll talk about k-fold cross validation in this recipe. There are several varieties of cross validation, each with slightly different randomization schemes. K-fold is perhaps one of the most well-known randomization schemes.
We’ll create some data and then fit a classifier on the different folds. It’s probably worth mentioning that if you can keep a holdout set, then that would be best. For example, we have a dataset where N = 1000. If we hold out 200 data points, then use cross validation between the other 800 points to determine the best parameters.
First, we’ll create some fake data, then we’ll examine the parameters, and finally, we’ll look at the size of the resulting dataset:
>>> N = 1000
>>> holdout = 200
>>> from sklearn.datasets import make_regression
>>> X, y = make_regression(1000, shuffle=True)
Now that we have the data, let’s hold out 200 points, and then go through the fold scheme like we normally would:
>>> X_h, y_h = X[:holdout], y[:holdout]
>>> X_t, y_t = X[holdout:], y[holdout:]
>>> from sklearn.cross_validation import KFold
K-fold gives us the option of choosing how many folds we want, if we want the values to be indices or Booleans, if want to shuffle the dataset, and finally, the random state (this is mainly for reproducibility). Indices will actually be removed in later versions. It’s assumed to be True.
Let’s create the cross validation object:
>>> kfold = KFold(len(y_t), n_folds=4)
Now, we can iterate through the k-fold object:
>>> output_string = "Fold: {}, N_train: {}, N_test: {}"
>>> for i, (train, test) in enumerate(kfold):
print output_string.format(i, len(y_t[train]),
len(y_t[test]))
Fold: 0, N_train: 600, N_test: 200
Fold: 1, N_train: 600, N_test: 200
Fold: 2, N_train: 600, N_test: 200
Fold: 3, N_train: 600, N_test: 200
Each iteration should return the same split size.
It’s probably clear, but k-fold works by iterating through the folds and holds out 1/n_folds * N, where N for us was len(y_t).
From a Python perspective, the cross validation objects have an iterator that can be accessed by using the in operator. Often times, it’s useful to write a wrapper around a cross validation object that will iterate a subset of the data. For example, we may have a dataset that has repeated measures for data points or we may have a dataset with patients and each patient having measures.
We’re going to mix it up and use pandas for this part:
>>> import numpy as np
>>> import pandas as pd
>>> patients = np.repeat(np.arange(0, 100, dtype=np.int8), 8)
>>> measurements = pd.DataFrame({'patient_id': patients,
'ys': np.random.normal(0, 1, 800)})
Now that we have the data, we only want to hold out certain customers instead of data points:
>>> custids = np.unique(measurements.patient_id)
>>> customer_kfold = KFold(custids.size, n_folds=4)
>>> output_string = "Fold: {}, N_train: {}, N_test: {}"
>>> for i, (train, test) in enumerate(customer_kfold):
train_cust_ids = custids[train]
training = measurements[measurements.patient_id.isin(
train_cust_ids)]
testing = measurements[~measurements.patient_id.isin(
train_cust_ids)]
print output_string.format(i, len(training), len(testing))
Fold: 0, N_train: 600, N_test: 200
Fold: 1, N_train: 600, N_test: 200
Fold: 2, N_train: 600, N_test: 200
Fold: 3, N_train: 600, N_test: 200
We’ve looked at the using cross validation iterators that scikit-learn comes with, but we can also use a helper function to perform cross validation for use automatically. This is similar to how other objects in scikit-learn are wrapped by helper functions, pipeline for instance.
First, we’ll need to create a sample classifier; this can really be anything, a decision tree, a random forest, whatever. For us, it’ll be a random forest. We’ll then create a dataset and use the cross validation functions.
First import the ensemble module and we’ll get started:
>>> from sklearn import ensemble
>>> rf = ensemble.RandomForestRegressor(max_features='auto')
Okay, so now, let’s create some regression data:
>>> from sklearn import datasets
>>> X, y = datasets.make_regression(10000, 10)
Now that we have the data, we can import the cross_validation module and get access to the functions we’ll use:
>>> from sklearn import cross_validation
>>> scores = cross_validation.cross_val_score(rf, X, y)
>>> print scores
[ 0.86823874 0.86763225 0.86986129]
For the most part, this will delegate to the cross validation objects. One nice thing is that, the function will handle performing the cross validation in parallel.
We can activate verbose mode play by play:
>>> scores = cross_validation.cross_val_score(rf, X, y, verbose=3, cv=4)
[CV] no parameters to be set
[CV] no parameters to be set, score=0.872866 - 0.7s
[CV] no parameters to be set
[CV] no parameters to be set, score=0.873679 - 0.6s
[CV] no parameters to be set
[CV] no parameters to be set, score=0.878018 - 0.7s
[CV] no parameters to be set
[CV] no parameters to be set, score=0.871598 - 0.6s
[Parallel(n_jobs=1)]: Done 1 jobs | elapsed: 0.7s
[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 2.6s finished
As we can see, during each iteration, we scored the function. We also get an idea of how long the model runs.
It’s also worth knowing that we can score our function predicated on which kind of model we’re trying to fit.
ShuffleSplit is one of the simplest cross validation techniques. This cross validation technique will simply take a sample of the data for the number of iterations specified.
ShuffleSplit is another cross validation technique that is very simple. We’ll specify the total elements in the dataset, and it will take care of the rest. We’ll walk through an example of estimating the mean of a univariate dataset. This is somewhat similar to resampling, but it’ll illustrate one reason why we want to use cross validation while showing cross validation.
First, we need to create the dataset. We’ll use NumPy to create a dataset, where we know the underlying mean. We’ll sample half of the dataset to estimate the mean and see how close it is to the underlying mean:
>>> import numpy as np
>>> true_loc = 1000
>>> true_scale = 10
>>> N = 1000
>>> dataset = np.random.normal(true_loc, true_scale, N)
>>> import matplotlib.pyplot as plt
>>> f, ax = plt.subplots(figsize=(7, 5))
>>> ax.hist(dataset, color='k', alpha=.65, histtype='stepfilled');
>>> ax.set_title("Histogram of dataset");
>>> f.savefig("978-1-78398-948-5_06_06.png")
NumPy will give the following output:
Now, let’s take the first half of the data and guess the mean:
>>> from sklearn import cross_validation
>>> holdout_set = dataset[:500]
>>> fitting_set = dataset[500:]
>>> estimate = fitting_set[:N/2].mean()
>>> import matplotlib.pyplot as plt
>>> f, ax = plt.subplots(figsize=(7, 5))
>>> ax.set_title("True Mean vs Regular Estimate")
>>> ax.vlines(true_loc, 0, 1, color='r', linestyles='-', lw=5,
alpha=.65, label='true mean')
>>> ax.vlines(estimate, 0, 1, color='g', linestyles='-', lw=5,
alpha=.65, label='regular estimate')
>>> ax.set_xlim(999, 1001)
>>> ax.legend()
>>> f.savefig("978-1-78398-948-5_06_07.png")
We’ll get the following output:
Now, we can use ShuffleSplit to fit the estimator on several smaller datasets:
>>> from sklearn.cross_validation import ShuffleSplit
>>> shuffle_split = ShuffleSplit(len(fitting_set))
>>> mean_p = []
>>> for train, _ in shuffle_split:
mean_p.append(fitting_set[train].mean())
shuf_estimate = np.mean(mean_p)
>>> import matplotlib.pyplot as plt
>>> f, ax = plt.subplots(figsize=(7, 5))
>>> ax.vlines(true_loc, 0, 1, color='r', linestyles='-', lw=5,
alpha=.65, label='true mean')
>>> ax.vlines(estimate, 0, 1, color='g', linestyles='-', lw=5,
alpha=.65, label='regular estimate')
>>> ax.vlines(shuf_estimate, 0, 1, color='b', linestyles='-', lw=5,
alpha=.65, label='shufflesplit estimate')
>>> ax.set_title("All Estimates")
>>> ax.set_xlim(999, 1001)
>>> ax.legend(loc=3)
The output will be as follows:
As we can see, we got an estimate that was similar to what we expected, but we were able to take many samples to get that estimate.
In this recipe, we’ll quickly look at stratified k-fold valuation. We’ve walked through different recipes where the class representation was unbalanced in some manner. Stratified k-fold is nice because its scheme is specifically designed to maintain the class proportions.
We’re going to create a small dataset. In this dataset, we will then use stratified k-fold validation. We want it small so that we can see the variation. For larger samples. it probably won’t be as big of a deal.
We’ll then plot the class proportions at each step to illustrate how the class proportions are maintained:
>>> from sklearn import datasets
>>> X, y = datasets.make_classification(n_samples=int(1e3), weights=[1./11])
Let’s check the overall class weight distribution:
>>> y.mean()
0.90300000000000002
Roughly, 90.5 percent of the samples are 1, with the balance 0.
Let’s create a stratified k-fold object and iterate it through each fold. We’ll measure the proportion of verse that are 1. After that we’ll plot the proportion of classes by the split number to see how and if it changes. This code will hopefully illustrate how this is beneficial. We’ll also plot this code against a basic ShuffleSplit:
>>> from sklearn import cross_validation
>>> n_folds = 50
>>> strat_kfold = cross_validation.StratifiedKFold(y,
n_folds=n_folds)
>>> shuff_split = cross_validation.ShuffleSplit(n=len(y),
n_iter=n_folds)
>>> kfold_y_props = []
>>> shuff_y_props = []
>>> for (k_train, k_test), (s_train, s_test) in zip(strat_kfold,
shuff_split):
kfold_y_props.append(y[k_train].mean())
shuff_y_props.append(y[s_train].mean())
Now, let’s plot the proportions over each fold:
>>> import matplotlib.pyplot as plt
>>> f, ax = plt.subplots(figsize=(7, 5))
>>> ax.plot(range(n_folds), shuff_y_props, label="ShuffleSplit",
color='k')
>>> ax.plot(range(n_folds), kfold_y_props, label="Stratified",
color='k', ls='--')
>>> ax.set_title("Comparing class proportions.")
>>> ax.legend(loc='best')
The output will be as follows:
We can see that the proportion of each fold for stratified k-fold is stable across folds.
Stratified k-fold works by taking the y value. First, getting the overall proportion of the classes, then intelligently splitting the training and test set into the proportions. This will generalize to multiple labels:
>>> import numpy as np
>>> three_classes = np.random.choice([1,2,3], p=[.1, .4, .5],
size=1000)
>>> import itertools as it
>>> for train, test in cross_validation.StratifiedKFold(three_classes, 5):
print np.bincount(three_classes[train])
[ 0 90 314 395]
[ 0 90 314 395]
[ 0 90 314 395]
[ 0 91 315 395]
[ 0 91 315 396]
As we can see, we got roughly the sample sizes of each class for our training and testing proportions.
In this recipe, we’re going to introduce grid search with basic Python, though we will use sklearn for the models and matplotlib for the visualization.
In this recipe, we will perform the following tasks:
Also, the model we’ll fit is a basic decision tree classifier. Our parameter space will be 2 dimensional to help us with the visualization:
The parameter space will then be the Cartesian product of the those two sets:
We’ll see in a bit how we can iterate through this space with itertools.
Let’s create the dataset and then get started:
>>> from sklearn import datasets
>>> X, y = datasets.make_classification(n_samples=2000, n_features=10)
Earlier we said that we’d use grid search to tune two parameters—criteria and max_features. We need to represent those as Python sets, and then use itertools product to iterate through them:
>>> criteria = {'gini', 'entropy'}
>>> max_features = {'auto', 'log2', None}
>>> import itertools as it
>>> parameter_space = it.product(criteria, max_features)
Great! So now that we have the parameter space, let’s iterate through it and check the accuracy of each model as specified by the parameters. Then, we’ll store that accuracy so that we can compare different parameter spaces. We’ll also use a test and train split of 50, 50:
import numpy as np
train_set = np.random.choice([True, False], size=len(y))
from sklearn.tree import DecisionTreeClassifier
accuracies = {}
for criterion, max_feature in parameter_space:
dt = DecisionTreeClassifier(criterion=criterion,
max_features=max_feature)
dt.fit(X[train_set], y[train_set])
accuracies[(criterion, max_feature)] = (dt.predict(X[~train_set])
== y[~train_set]).mean()
>>> accuracies
{('entropy', None): 0.974609375, ('entropy', 'auto'): 0.9736328125,
('entropy', 'log2'): 0.962890625, ('gini', None): 0.9677734375, ('gini',
'auto'): 0.9638671875, ('gini', 'log2'): 0.96875}
So we now have the accuracies and its performance. Let’s visualize the performance:
>>> from matplotlib import pyplot as plt
>>> from matplotlib import cm
>>> cmap = cm.RdBu_r
>>> f, ax = plt.subplots(figsize=(7, 4))
>>> ax.set_xticklabels([''] + list(criteria))
>>> ax.set_yticklabels([''] + list(max_features))
>>> plot_array = []
>>> for max_feature in max_features:
m = []
>>> for criterion in criteria:
m.append(accuracies[(criterion, max_feature)])
plot_array.append(m)
>>> colors = ax.matshow(plot_array, vmin=np.min(accuracies.values())
- 0.001, vmax=np.max(accuracies.values()) + 0.001,
cmap=cmap)
>>> f.colorbar(colors)
The following is the output:
It’s fairly easy to see which one performed best here. Hopefully, you can see how this process can be taken to the further stage with a brute force method.
This works fairly simply, we just have to perform the following steps:
In this recipe, we’ll do an exhaustive grid search through scikit-learn. This is basically the same thing we did in the previous recipe, but we’ll utilize built-in methods.
We’ll also walk through an example of performing randomized optimization. This is an alternative to brute force search. Essentially, we’re trading computer cycles to make sure that we search the entire space. We were fairly calm in the last recipe. However, you could imagine a model that has several steps, first imputation for fix missing data, then PCA reduce the dimensionality to classification. Your parameter space could get very large, very fast; therefore, it can be advantageous to only search a part of that space.
To get started, we’ll need to perform the following steps:
Run the following code to create some classification data:
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(1000, n_features=5)
Now, we’ll create our logistic regression object:
>>> from sklearn.linear_model import LogisticRegression
>>> lr = LogisticRegression(class_weight='auto')
We need to specify the parameters we want to search. For GridSearch, we can just specify the ranges that we care about, but for RandomizedSearchCV, we’ll need to actually specify the distribution over the same space from which to sample:
>>> lr.fit(X, y)
LogisticRegression(C=1.0, class_weight={0: 0.25, 1: 0.75},
dual=False,fit_intercept=True,
intercept_scaling=1, penalty='l2',
random_state=None, tol=0.0001)
>>> grid_search_params = {'penalty': ['l1', 'l2'],'C': [1, 2, 3, 4]}
The only change we’ll need to make is to describe the C parameter as a probability distribution. We’ll keep it simple right now, though we will use scipy to describe the distribution:
>>> import scipy.stats as st
>>> import numpy as np
>>> random_search_params = {'penalty': ['l1', 'l2'],
'C': st.randint(1, 4)}
Now, we’ll fit the classifier. This works by passing lr to the parameter search objects:
>>> from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
>>> gs = GridSearchCV(lr, grid_search_params)
GridSearchCV implements the same API as the other models:
>>> gs.fit(X, y)
GridSearchCV(cv=None, estimator=LogisticRegression(C=1.0,
class_weight=’auto’, dual=False, fit_intercept=True,
intercept_scaling=1, penalty=’l2′, random_state=None,
tol=0.0001), fit_params={}, iid=True, loss_func=None,
n_jobs=1, param_grid={‘penalty’: [‘l1’, ‘l2’], ‘C’:
[1, 2, 3, 4]}, pre_dispatch=’2*n_jobs’, refit=True,
score_func=None, scoring=None, verbose=0)
As we can see with the param_grid parameter, our penalty and C are both arrays.
To access the scores, we can use the grid_scores_ attribute of the grid search. We also want to find the optimal set of parameters. We can also look at the marginal performance of the grid search:
>>> gs.grid_scores_
[mean: 0.90300, std: 0.01192, params: {'penalty': 'l1', 'C': 1},
mean: 0.90100, std: 0.01258, params: {'penalty': 'l2', 'C': 1},
mean: 0.90200, std: 0.01117, params: {'penalty': 'l1', 'C': 2},
mean: 0.90100, std: 0.01258, params: {'penalty': 'l2', 'C': 2},
mean: 0.90200, std: 0.01117, params: {'penalty': 'l1', 'C': 3},
mean: 0.90100, std: 0.01258, params: {'penalty': 'l2', 'C': 3},
mean: 0.90100, std: 0.01258, params: {'penalty': 'l1', 'C': 4},
mean: 0.90100, std: 0.01258, params: {'penalty': 'l2', 'C': 4}]
We might want to get the max score:
>>> gs.grid_scores_[1][1]
0.90100000000000002
>>> max(gs.grid_scores_, key=lambda x: x[1])
mean: 0.90300, std: 0.01192, params: {'penalty': 'l1', 'C': 1}
The parameters obtained are the best choices for our logistic regression.
This recipe is about creating fake estimators; this isn’t the pretty or exciting stuff, but it is worthwhile to have a reference point for the model you’ll eventually build.
In this recipe, we’ll perform the following tasks:
We’ll perform these two steps for regression data and classification data.
First, we’ll create the random data:
>>> from sklearn.datasets import make_regression, make_classification
# classification if for later
>>> X, y = make_regression()
>>> from sklearn import dummy
>>> dumdum = dummy.DummyRegressor()
>>> dumdum.fit(X, y)
DummyRegressor(constant=None, strategy='mean')
By default, the estimator will predict by just taking the mean of the values and predicting the mean values:
>>> dumdum.predict(X)[:5]
array([ 2.23297907, 2.23297907, 2.23297907, 2.23297907, 2.23297907])
There are other two other strategies we can try. We can predict a supplied constant (refer to constant=None from the preceding command). We can also predict the median value.
Supplying a constant will only be considered if strategy is “constant”.
Let’s have a look:
>>> predictors = [("mean", None),
("median", None),
("constant", 10)]
>>> for strategy, constant in predictors:
dumdum = dummy.DummyRegressor(strategy=strategy,
constant=constant)
>>> dumdum.fit(X, y)
>>> print "strategy: {}".format(strategy), ",".join(map(str,
dumdum.predict(X)[:5]))
strategy: mean 2.23297906733,2.23297906733,2.23297906733,2.23297906733,2.23297906733
strategy: median 20.38535248,20.38535248,20.38535248,20.38535248,20.38535248
strategy: constant 10.0,10.0,10.0,10.0,10.0
We actually have four options for classifiers. These strategies are similar to the continuous case, it’s just slanted toward classification problems:
>>> predictors = [("constant", 0),
("stratified", None),
("uniform", None),
("most_frequent", None)]
We’ll also need to create some classification data:
>>> X, y = make_classification()
>>> for strategy, constant in predictors:
dumdum = dummy.DummyClassifier(strategy=strategy,
constant=constant)
dumdum.fit(X, y)
print "strategy: {}".format(strategy), ",".join(map(str,
dumdum.predict(X)[:5]))
strategy: constant 0,0,0,0,0
strategy: stratified 1,0,0,1,0
strategy: uniform 0,0,0,1,1
strategy: most_frequent 1,1,1,1,1
It’s always good to test your models against the simplest models and that’s exactly what the dummy estimators give you. For example, imagine a fraud model. In this model, only 5 percent of the data set is fraud. Therefore, we can probably fit a pretty good model just by never guessing any fraud.
We can create this model by using the stratified strategy, using the following command. We can also get a good example of why class imbalance causes problems:
>>> X, y = make_classification(20000, weights=[.95, .05])
>>> dumdum = dummy.DummyClassifier(strategy='most_frequent')
>>> dumdum.fit(X, y)
DummyClassifier(constant=None, random_state=None, strategy='most_frequent')
>>> from sklearn.metrics import accuracy_score
>>> print accuracy_score(y, dumdum.predict(X))
0.94575
We were actually correct very often, but that’s not the point. The point is that this is our baseline. If we cannot create a model for fraud that is more accurate than this, then it isn’t worth our time.
This article taught us how we can take a basic model produced from one of the recipes and tune it so that we can achieve better results than we could with the basic model.
Further resources on this subject:
I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…
Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…
Once we learn how to deploy an Ubuntu server, how to manage users, and how…
Key-takeaways: Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…
While developing a web application, or setting dynamic pages and meta tags we need to deal with…
Software architecture is one of the most discussed topics in the software industry today, and…