In this article by Alberto Boschetti and Luca Massaron authors of book Regression Analysis with Python, we will learn about gradient descent, its feature scaling and a simple implementation.

(For more resources related to this topic, see here.)

As an alternative from the usual classical optimization algorithms, the gradient descent technique is able to minimize the cost function of a linear regression analysis using much less computations. In terms of complexity, gradient descent ranks in the order O(n*p), thus making learning regression coefficients feasible even in the occurrence of a large n (that stands for the number of observations) and large p (number of variables).

The method works by leveraging a simple heuristic that gradually converges to the optimal solution starting from a random one. Explaining it using simple words, it resembles walking blind in the mountains. If you want to descend to the lowest valley, even if you don't know and can't see the path, you can proceed approximately by going downhill for a while, then stopping, then directing downhill again, and so on, always directing at each stage where the surface descends until you arrive at a point when you cannot descend anymore. Hopefully, at that point, you will have reached your destination.

In such a situation, your only risk is to pass by an intermediate valley (where there is a wood or a lake for instance) and mistake it for your desired arrival because the land stops descending there.

In an optimization process, such a situation is defined as a local minimum (whereas your target is a global minimum, instead of the best minimum possible) and it is a possible outcome of your journey downhill depending on the function you are working on minimizing. The good news, in any case, is that the error function of the linear model family is a bowl-shaped one (technically, our cost function is a concave one) and it is unlikely that you can get stuck anywhere if you properly descend.

The necessary steps to work out a gradient-descent-based solution are hereby described.

Given our cost function for a set of coefficients (the vector w):

gradient-descent-work-img-0

We first start by choosing a random initialization for w by choosing some random numbers (taken from a standardized normal curve, for instance, having zero mean and unit variance).

Then, we start reiterating an update of the values of w (opportunely using the gradient descent computations) until the marginal improvement from the previous J(w) is small enough to let us figure out that we have finally reached an optimum minimum.

We can opportunely update our coefficients, separately one by one, by subtracting from each of them a portion alpha (α, the learning rate) of the partial derivative of the cost function:

Here, in our formula, wj is to be intended as a single coefficient (we are iterating over them). After resolving the partial derivative, the final resolution form is:

Simplifying everything, our gradient for the coefficient of x is just the average of our predicted values multiplied by their respective x value. We have to notice that by introducing more parameters to be estimated during the optimization procedure, we are actually introducing more dimensions to our line of fit (turning it into a hyperplane, a multidimensional surface) and such dimensions have certain communalities and differences to be taken into account.

Alpha, called the learning rate, is very important in the process, because if it is too large, it may cause the optimization to detour and fail. You have to think of each gradient as a jump or as a run in a direction. If you fully take it, you may happen to pass over the optimum minimum and end up in another rising slope. Too many consecutive long steps may even force you to climb up the cost slope, worsening your initial position (given by a cost function that is its summed square, the loss of an overall score of fitness).

Using a small alpha, the gradient descent won't jump beyond the solution, but it may take much longer to reach the desired minimum. How to choose the right alpha is a matter of trial and error. Anyway, starting from an alpha, such as 0.01, is never a bad choice based on our experience in many optimization problems.

Naturally, the gradient, given the same alpha, will in any case produce shorter steps as you approach the solution. Visualizing the steps in a graph can really give you a hint about whether the gradient descent is working out a solution or not.

Though quite conceptually simple (it is based on an intuition that we surely applied ourselves to move step by step where we can optimizing our result), gradient descent is very effective and indeed scalable when working with real data. Such interesting characteristics elevated it to be the core optimization algorithm in machine learning, not being limited to just the linear model's family, but also, for instance, extended to neural networks for the process of back propagation that updates all the weights of the neural net in order to minimize the training errors. Surprisingly, the gradient descent is also at the core of another complex machine learning algorithm, the gradient boosting tree ensembles, where we have an iterative process minimizing the errors using a simpler learning algorithm (a so-called weak learner because it is limited by an high bias) for progressing toward the optimization.

Scikit-learn linear_regression and other linear models present in the linear methods module are actually powered by gradient descent, making Scikit-learn our favorite choice while working on data science projects with large and big data.

Feature scaling

While using the classical statistical approach, not the machine learning one, working with multiple features requires attention while estimating the coefficients because of their similarities that can cause a variance inflection of the estimates.

Moreover, multicollinearity between variables also bears other drawbacks because it can render very difficult, if not impossible to achieve, matrix inversions, the matrix operation at the core of the normal equation coefficient estimation (and such a problem is due to the mathematical limitation of the algorithm). Gradient descent, instead, is not affected at all by reciprocal correlation, allowing the estimation of reliable coefficients even in the presence of perfect collinearity.

Anyway, though being quite resistant to the problems that affect other approaches, gradient descent's simplicity renders it vulnerable to other common problems, such as the different scale present in each feature. In fact, some features in your data may be represented by the measurements in units, some others in decimals, and others in thousands, depending on what aspect of reality each feature represents. For instance, in the dataset we decide to take as an example, the Boston houses dataset (http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html), a feature is the average number of rooms (a float ranging from about 5 to over 8), others are the percentage of certain pollutants in the air (float between 0 and 1), and so on, mixing very different measurements. When it is the case that the features have a different scale, though the algorithm will be processing each of them separately, the optimization will be dominated by the variables with the more extensive scale. Working in a space of dissimilar dimensions will require more iterations before convergence to a solution (and sometimes, there could be no convergence at all).

The remedy is very easy; it is just necessary to put all the features on the same scale. Such an operation is called feature scaling. Feature scaling can be achieved through standardization or normalization. Normalization rescales all the values in the interval between zero and one (usually, but different ranges are also possible), whereas standardization operates removing the mean and dividing by the standard deviation to obtain a unit variance. In our case, standardization is preferable both because it easily permits retuning the obtained standardized coefficients into their original scale and because, centering all the features at the zero mean, it makes the error surface more tractable by many machine learning algorithms, in a much more effective way than just rescaling the maximum and minimum of a variable.

An important reminder while applying feature scaling is that changing the scale of the features implies that you will have to use rescaled features also for predictions.

A simple implementation

Let's try the algorithm first using the standardization based on the Scikit-learn preprocessing module:

import numpy as np

import random

from sklearn.datasets import load_boston

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LinearRegression

 

boston = load_boston()

standardization = StandardScaler()

y = boston.target

X = np.column_stack((standardization.fit_transform(boston.data), np.ones(len(y))))

In the preceding code, we just standardized the variables using the StandardScaler class from Scikit-learn. This class can fit a data matrix, record its column means and standard deviations, and operate a transformation on itself as well as on any other similar matrixes, standardizing the column data. By means of this method, after fitting, we keep a track of the means and standard deviations that have been used because they will come handy if afterwards we will have to recalculate the coefficients using the original scale.

Now, we just record a few functions for the following computations:

def random_w(p):

    return np.array([np.random.normal() for j in range(p)])

 

def hypothesis(X, w):

    return np.dot(X,w)

 

def loss(X, w, y):

    return hypothesis(X, w) - y

 

def squared_loss(X, w, y):

    return loss(X, w, y)**2

 

def gradient(X, w, y):

    gradients = list()

    n = float(len(y))

    for j in range(len(w)):

        gradients.append(np.sum(loss(X, w, y) * X[:,j]) / n)

    return gradients

 

def update(X, w, y, alpha=0.01):

    return [t - alpha*g for t, g in zip(w, gradient(X, w, y))]

 

def optimize(X, y, alpha=0.01, eta = 10**-12, iterations = 1000):

    w = random_w(X.shape[1])

    for k in range(iterations):

        SSL = np.sum(squared_loss(X,w,y))

        new_w = update(X,w,y, alpha=alpha)

        new_SSL = np.sum(squared_loss(X,new_w,y))

        w = new_w

        if k>=5 and (new_SSL - SSL <= eta and new_SSL - SSL >= -eta):

            return w

    return w

We can now calculate our regression coefficients:

w = optimize(X, y, alpha = 0.02, eta = 10**-12, iterations = 20000)

print ("Our standardized coefficients: " + 

  ', '.join(map(lambda x: "%0.4f" % x, w)))

Our standardized coefficients: -0.9204, 1.0810, 0.1430, 0.6822, -2.0601, 2.6706, 0.0211, -3.1044, 2.6588, -2.0759, -2.0622, 0.8566, -3.7487, 22.5328

A simple comparison with Scikit-learn's solution can prove if our code worked fine:

sk=LinearRegression().fit(X[:,:-1],y)

w_sk = list(sk.coef_) + [sk.intercept_]

print ("Scikit-learn's standardized coefficients: " + ', '.join(map(lambda x: "%0.4f" % x, w_sk)))

Scikit-learn's standardized coefficients: -0.9204, 1.0810, 0.1430, 0.6822, -2.0601, 2.6706, 0.0211, -3.1044, 2.6588, -2.0759, -2.0622, 0.8566, -3.7487, 22.5328

A noticeable particular to mention is our choice of alpha. After some tests, the value of 0.02 has been chosen for its good performance on this very specific problem. Alpha is the learning rate and, during optimization, it can be fixed or changed according to a line search method, modifying its value in order to minimize the cost function at each single step of the optimization process. In our example, we opted for a fixed learning rate and we had to look for its best value by trying a few optimization values and deciding on which minimized the cost in the minor number of iterations.

Summary

In this article we learned about gradient descent, its feature scaling and a simple implementation using an algorithm based on Scikit-learn preprocessing module.