Let’s start using one of the most well-known toy datasets, explore it, and select one of the dimensions to learn how to build a linear regression model for its values. Let’s start by importing all the libraries (scikit-learn, seaborn, and matplotlib); one of the excellent features of Seaborn is its ability to define very professional-looking style settings. In this case, we will use the whitegrid style:
import numpy as np from sklearn import datasets import seaborn.apionly as sns %matplotlib inline import matplotlib.pyplot as plt sns.set(style='whitegrid', context='notebook')
The Iris Dataset
It’s time to load the Iris dataset. This is one of the most well-known historical datasets. You will find it in many books and publications. Given the good properties of the data, it is useful for classification and regression examples.
The Iris dataset (https://archive.ics.uci.edu/ml/datasets/Iris) contains 50 records for each of the three types of iris, 150 lines in a total over five fields. Each line is a measurement of the following:
Sepal length in cm Sepal width in cm Petal length in cm Petal width in cm
The final field is the type of flower (setosa, versicolor, or virginica). Let’s use the
load_dataset method to create a matrix of values from the dataset:
iris2 = sns.load_dataset('iris')
In order to understand the dependencies between variables, we will implement the covariance operation. It will receive two arrays as parameters and will return the covariance(x,y) value:
def covariance (X, Y): xhat=np.mean(X) yhat=np.mean(Y) epsilon=0 for x,y in zip (X,Y): epsilon=epsilon+(x-xhat)*(y-yhat) return epsilon/(len(X)-1)
Let’s try the implemented function and compare it with the NumPy function. Note that we calculated
cov (a,b), and NumPy generated a matrix of all the combinations
cov(a,b), so our result should be equal to the
values (1,0) and
(0,1) of that matrix:
print (covariance ([1,3,4], [1,0,2])) print (np.cov([1,3,4], [1,0,2])) 0.5 [[ 2.33333333 0.5 ] [ 0.5 1. ]]
Having done a minimal amount of testing of the correlation function as defined earlier, receive two arrays, such as covariance, and use them to get the final value:
def correlation (X, Y): return (covariance(X,Y)/(np.std(X, ddof=1)*np.std(Y, ddof=1))) ##We have to indicate ddof=1 the unbiased std
Let’s test this function with two sample arrays, and compare this with the
(1,0) values of the correlation matrix from NumPy:
print (correlation ([1,1,4,3], [1,0,2,2])) print (np.corrcoef ([1,1,4,3], [1,0,2,2])) 0.870388279778 [[ 1. 0.87038828] [ 0.87038828 1. ]]
Getting an intuitive idea with Seaborn pairplot
A very good idea when starting worke on a problem is to get a graphical representation of all the possible variable combinations.
Seaborn’s pairplot function provides a complete graphical summary of all the variable pairs, represented as scatterplots, and a representation of the univariate distribution for the matrix diagonal.
Let’s look at how this plot type shows all the variables dependencies, and try to look for a linear relationship as a base to test our regression methods:
sns.pairplot(iris2, size=3.0) <seaborn.axisgrid.PairGrid at 0x7f8a2a30e828>
Pairplot of all the variables in the dataset.
Lets’ select two variables that, from our initial analysis, have the property of being linearly dependent. They are
Let’s now take a look at this variable combination, which shows a clear linear tendency:
This is the representation of the chosen variables, in a scatter type graph:
This is the current distribution of data that we will try to model with our linear prediction function.
Creating the prediction function
First, let’s define the function that will abstractedly represent the modeled data, in the form of a linear function, with the form y=beta*x+alpha:
def predict(alpha, beta, x_i): return beta * x_i + alpha
Defining the error function
It’s now time to define the function that will show us the difference between predictions and the expected output during training. We have two main alternatives: measuring the absolute difference between the values (or
L1), or measuring a variant of the square of the difference (or L2). Let’s define both versions, including the first formulation inside the second:
def error(alpha, beta, x_i, y_i): #L1 return y_i - predict(alpha, beta, x_i) def sum_sq_e(alpha, beta, x, y): #L2 return sum(error(alpha, beta, x_i, y_i) ** 2 for x_i, y_i in zip(x, y))
Now, we will define a function implementing the correlation method to find the parameters for our regression:
def correlation_fit(x, y): beta = correlation(x, y) * np.std(y, ddof=1) / np.std(x,ddof=1) alpha = np.mean(y) - beta * np.mean(x) return alpha, beta
Let’s then run the fitting function and print the guessed parameters:
alpha, beta = correlation_fit(X, Y) print(alpha) print(beta) 1.08355803285 2.22994049512
Let’s now graph the regressed line with the data in order to intuitively show the appropriateness of the solution:
plt.scatter(X,Y) xr=np.arange(0,3.5) plt.plot(xr,(xr*beta)+alpha)
This is the final plot we will get with our recently calculated slope and intercept:
Final regressed line
Polynomial regression and an introduction to underfitting and overfitting
When looking for a model, one of the main characteristics we look for is the power of generalizing with a simple functional expression. When we increase the complexity of the model, it’s possible that we are building a model that is good for the training data, but will be too optimized for that particular subset of data.
Underfitting, on the other hand, applies to situations where the model is too simple, such as this case, which can be represented fairly well with a simple linear model.
In the following example, we will work on the same problem as before, using the scikit- learn library to search higher-order polynomials to fit the incoming data with increasingly complex degrees.
Going beyond the normal threshold of a quadratic function, we will see how the function looks to fit every wrinkle in the data, but when we extrapolate, the values outside the normal range are clearly out of range:
from sklearn.linear_model import Ridge from sklearn.preprocessing import PolynomialFeatures from sklearn.pipeline import make_pipeline ix=iris2['petal_width'] iy=iris2['petal_length'] # generate points used to represent the fitted function x_plot = np.linspace(0, 2.6, 100) # create matrix versions of these arrays X = ix[:, np.newaxis] X_plot = x_plot[:, np.newaxis] plt.scatter(ix, iy, s=30, marker='o', label="training points") for count, degree in enumerate([3, 6, 20]): model = make_pipeline(PolynomialFeatures(degree), Ridge()) model.fit(X, iy) y_plot = model.predict(X_plot) plt.plot(x_plot, y_plot, label="degree %d" % degree) plt.legend(loc='upper left') plt.show()
The combined graph shows how the different polynomials’ coefficients describe the data population in different ways. The 20 degree polynomial shows clearly how it adjusts perfectly for the trained dataset, and after the known values, it diverges almost spectacularly, going against the goal of generalizing for future data.
Curve ﬁtting of the initial dataset, with polynomials of increasing values
With this, we successfully explored how to develop an efficient linear regression model in Python and how you can make predictions using the designed model. We’ve reviewed ways to identify and optimize the correlation between the prediction and the expected output using simple and definite functions.
If you enjoyed our post, you must check out Machine Learning for Developers to uncover advanced tools for building machine learning applications on your fingertips.