This article written by Daniele Teti, the author of Ipython Interactive Computing and Visualization Cookbook, the basics of the machine learning scikit-learn package (http://scikit-learn.org) is introduced. Its clean API makes it really easy to define, train, and test models. Plus, scikit-learn is specifically designed for speed and (relatively) big data.
(For more resources related to this topic, see here.)
A very basic example of linear regression in the context of curve fitting is shown here. This toy example will allow to illustrate key concepts such as linear models, overfitting, underfitting, regularization, and cross-validation.
You can find all instructions to install scikit-learn on the main documentation. For more information, refer to http://scikit-learn.org/stable/install.html. With anaconda, you can type conda install scikit-learn in a terminal.
We will generate a one-dimensional dataset with a simple model (including some noise), and we will try to fit a function to this data. With this function, we can predict values on new data points. This is a curve fitting regression problem.
In [1]: import numpy as np import scipy.stats as st import sklearn.linear_model as lm import matplotlib.pyplot as plt %matplotlib inline
In [2]: f = lambda x: np.exp(3 * x)
In [3]: x_tr = np.linspace(0., 2, 200) y_tr = f(x_tr)
In [4]: x = np.array([0, .1, .2, .5, .8, .9, 1]) y = f(x) + np.random.randn(len(x))
In [5]: plt.plot(x_tr[:100], y_tr[:100], '--k') plt.plot(x, y, 'ok', ms=10)
In the image, the dotted curve represents the generative model.
In [6]: # We create the model. lr = lm.LinearRegression() # We train the model on our training dataset. lr.fit(x[:, np.newaxis], y) # Now, we predict points with our trained model. y_lr = lr.predict(x_tr[:, np.newaxis])
We need to convert x and x_tr to column vectors, as it is a general convention in scikit-learn that observations are rows, while features are columns. Here, we have seven observations with one feature.
In [7]: plt.plot(x_tr, y_tr, '--k') plt.plot(x_tr, y_lr, 'g') plt.plot(x, y, 'ok', ms=10) plt.xlim(0, 1) plt.ylim(y.min()-1, y.max()+1) plt.title("Linear regression")
In [8]: lrp = lm.LinearRegression() plt.plot(x_tr, y_tr, '--k') for deg in [2, 5]: lrp.fit(np.vander(x, deg + 1), y) y_lrp = lrp.predict(np.vander(x_tr, deg + 1)) plt.plot(x_tr, y_lrp, label='degree ' + str(deg)) plt.legend(loc=2) plt.xlim(0, 1.4) plt.ylim(-10, 40) # Print the model's coefficients. print(' '.join(['%.2f' % c for c in lrp.coef_])) plt.plot(x, y, 'ok', ms=10) plt.title("Linear regression") 25.00 -8.57 0.00 -132.71 296.80 -211.76 72.80 -8.68 0.00
We have fitted two polynomial models of degree 2 and 5. The degree 2 polynomial appears to fit the data points less precisely than the degree 5 polynomial. However, it seems more robust; the degree 5 polynomial seems really bad at predicting values outside the data points (look for example at the x
Note the large coefficients of the degree 5 polynomial; this is generally a sign of overfitting.
The ridge regression model has a meta-parameter, which represents the weight of the regularization term. We could try different values with trials and errors, using the Ridge class. However, scikit-learn includes another model called RidgeCV, which includes a parameter search with cross-validation. In practice, it means that we don’t have to tweak this parameter by hand—scikit-learn does it for us. As the models of scikit-learn always follow the fit-predict API, all we have to do is replace lm.LinearRegression() by lm.RidgeCV() in the previous code. We will give more details in the next section.
In [9]: ridge = lm.RidgeCV() plt.plot(x_tr, y_tr, '--k') for deg in [2, 5]: ridge.fit(np.vander(x, deg + 1), y); y_ridge = ridge.predict(np.vander(x_tr, deg+1)) plt.plot(x_tr, y_ridge, label='degree ' + str(deg)) plt.legend(loc=2) plt.xlim(0, 1.5) plt.ylim(-5, 80) # Print the model's coefficients. print(' '.join(['%.2f' % c for c in ridge.coef_])) plt.plot(x, y, 'ok', ms=10) plt.title("Ridge regression") 11.36 4.61 0.00 2.84 3.54 4.09 4.14 2.67 0.00
This time, the degree 5 polynomial seems more precise than the simpler degree 2 polynomial (which now causes underfitting). Ridge regression reduces the overfitting issue here. Observe how the degree 5 polynomial’s coefficients are much smaller than in the previous example.
In this section, we explain all the aspects covered in this article.
scikit-learn implements a clean and coherent API for supervised and unsupervised learning. Our data points should be stored in a (N,D) matrix X, where N is the number of observations and D is the number of features. In other words, each row is an observation. The first step in a machine learning task is to define what the matrix X is exactly.
In a supervised learning setup, we also have a target, an N-long vector y with a scalar value for each observation. This value is continuous or discrete, depending on whether we have a regression or classification problem, respectively.
In scikit-learn, models are implemented in classes that have the fit() and predict() methods. The fit() method accepts the data matrix X as input, and y as well for supervised learning models. This method trains the model on the given data.
The predict() method also takes data points as input (as a (M,D) matrix). It returns the labels or transformed points as predicted by the trained model.
Ordinary least squares regression is one of the simplest regression methods. It consists of approaching the output values yi with a linear combination of Xij:
Here, w = (w1, …, wD) is the (unknown) parameter vector. Also,
This sum of the components squared is called the L2 norm. It is convenient because it leads to differentiable loss functions so that gradients can be computed and common optimization procedures can be performed.
Ordinary least squares regression fits a linear model to the data. The model is linear both in the data points Xiand in the parameters wj. In our example, we obtain a poor fit because the data points were generated according to a nonlinear generative model (an exponential function).
However, we can still use the linear regression method with a model that is linear in wj, but nonlinear in xi. To do this, we need to increase the number of dimensions in our dataset by using a basis of polynomial functions. In other words, we consider the following data points:
Here, D is the maximum degree. The input matrix X is therefore the Vandermonde matrix associated to the original data points xi. For more information on the Vandermonde matrix, refer to http://en.wikipedia.org/wiki/Vandermonde_matrix.
Here, it is easy to see that training a linear model on these new data points is equivalent to training a polynomial model on the original data points.
Polynomial interpolation with linear regression can lead to overfitting if the degree of the polynomials is too large. By capturing the random fluctuations (noise) instead of the general trend of the data, the model loses some of its predictive power. This corresponds to a divergence of the polynomial’s coefficients wj.
A solution to this problem is to prevent these coefficients from growing unboundedly. With ridge regression (also known as Tikhonov regularization) this is done by adding a regularization term to the loss function. For more details on Tikhonov regularization, refer to http://en.wikipedia.org/wiki/Tikhonov_regularization.
By minimizing this loss function, we not only minimize the error between the model and the data (first term, related to the bias), but also the size of the model’s coefficients (second term, related to the variance). The bias-variance trade-off is quantified by the hyperparameter
Here, ridge regression led to a polynomial with smaller coefficients, and thus a better fit.
A drawback of the ridge regression model compared to the ordinary least squares model is the presence of an extra hyperparameter
To solve this problem, we can use a grid search; we loop over many possible values for
How can we assess the performance of a model with a given
There are many ways to split the initial dataset into two parts like this. One possibility is to remove one sample to form the train set and to put this one sample into the test set. This is called Leave-One-Out cross-validation. With N samples, we obtain N sets of train and test sets. The cross-validated performance is the average performance on all these set decompositions.
As we will see later, scikit-learn implements several easy-to-use functions to do cross-validation and grid search. In this article, there exists a special estimator called RidgeCV that implements a cross-validation and grid search procedure that is specific to the ridge regression model. Using this model ensures that the best hyperparameter
Here are a few references about least squares:
Here are a few references about cross-validation and grid search:
Here are a few references about scikit-learn:
Using the scikit-learn Python package, this article illustrates fundamental data mining and machine learning concepts such as supervised and unsupervised learning, classification, regression, feature selection, feature extraction, overfitting, regularization, cross-validation, and grid search.
Further resources on this subject:
I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…
Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…
Once we learn how to deploy an Ubuntu server, how to manage users, and how…
Key-takeaways: Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…
While developing a web application, or setting dynamic pages and meta tags we need to deal with…
Software architecture is one of the most discussed topics in the software industry today, and…