In this article by **Patrick R. Nicolas**, the author of the book Scala for Machine Learning, we will cover the basics of ridge regression. The purpose of regression is to minimize a loss function, the **residual sum of squares** (**RSS**) being the one commonly used. The problem of overfitting can be addressed by adding a penalty term to the loss function. The **penalty term** is an element of the larger concept of regularization.

*(For more resources related to this topic, see here.)*

# L_{n} roughness penalty

**Regularization** consists of adding a penalty function *J(w)* to the loss function (or RSS in the case of a regressive classifier) in order to prevent the model parameters (or weights) from reaching high values. A model that fits a training set very well tends to have many features variable with relatively large weights. This process is known as **shrinkage**. Practically, shrinkage consists of adding a function with model parameters as an argument to the loss function:

The penalty function is completely independent from the training set *{x,y}*. The penalty term is usually expressed as a power to function of the norm of the model parameters (or weights) *wd*. For a model of *D* dimension the generic **Lp-norm** is defined as follows:

NotationRegularization applies to parameters or weights associated to an observation. In order to be consistent with our notation

w0being the intercept value, the regularization applies to the parametersw1…wd.

The two most commonly used penalty functions for regularization are L_{1} and L_{2}.

Regularization in machine learningThe regularization technique is not specific to the linear or logistic regression. Any algorithm that minimizes the residual sum of squares, such as support vector machine or feed-forward neural network, can be regularized by adding a roughness penalty function to the RSS.

The L_{1} regularization applied to the linear regression is known as the **Lasso regularization**. The **Ridge regression** is a linear regression that uses the L_{2} regularization penalty.

You may wonder which regularization makes sense for a given training set. In a nutshell, L_{2} and L_{1} regularizations differ in terms of computation efficiency, estimation, and features selection (refer to the *13.3 L1 regularization: basics* section in the book *Machine Learning: A Probabilistic Perspective*, and the *Feature selection, L1 vs. L2 regularization, and rotational invariance* paper available at http://www.machinelearning.org/proceedings/icml2004/papers/354.pdf).

The various differences between the two regularizations are as follows:

**Model estimation**: L_{1}generates a sparser estimation of the regression parameters than L_{2}. For large non-sparse dataset, L_{2}has a smaller estimation error than L_{1.}**Feature selection**: L_{1}is more effective in reducing the regression weights for features with high value than L_{2}. Therefore, L_{1}is a reliable features selection tool.**Overfitting**: Both L_{1}and L_{2}reduce the impact of overfitting. However, L_{1}has a significant advantage in overcoming overfitting (or excessive complexity of a model) for the same reason it is more appropriate for selecting features.**Computation**: L_{2 }is conducive to a more efficient computation model. The summation of the loss function and L_{2}penalty*w2*is a continuous and differentiable function for which the first and second derivative can be computed (**convex minimization**). The L_{1}term is the summation of*|wi|*, and therefore, not differentiable.**Terminology**The ridge regression is sometimes called the

**penalized least squares**regression. The L_{2}regularization is also known as the**weight decay**.

Let’s implement the ridge regression, and then evaluate the impact of the L_{2}-norm penalty factor.

# Ridge regression

The ridge regression is a multivariate linear regression with a L_{2} norm penalty term, and can be calculated as follows:

The computation of the ridge regression parameters requires the resolution of the system of linear equations similar to the linear regression.

Matrix representation of ridge regression closed form is as follows:

Iis the identity matrix and it is using the QR decomposition, as shown here:

## Implementation

The implementation of the ridge regression adds L_{2} regularization term to the multiple linear regression computation of the Apache Commons Math library.

The methods of *RidgeRegression* have the same signature as its ordinary least squares counterpart. However, the class has to inherit the abstract base class *AbstractMultipleLinearRegression* in the Apache Commons Math and override the generation of the QR decomposition to include the penalty term, as shown in the following code:

classRidgeRegression[T <% Double](valxt: XTSeries[Array[T]], valy: DblVector, vallambda: Double) { extendsAbstractMultipleLinearRegressionwith PipeOperator[Array[T], Double] { private var qr: QRDecomposition = null private[this] valmodel: Option[RegressionModel] = … … }

Besides the input time series *xt *and the labels *y*, the ridge regression requires the *lambda* factor of the L_{2} penalty term. The instantiation of the class train the *model*. The steps to create the ridge regression models are as follows:

- Extract the Q and R matrices for the input values,
*newXSampleData*(line*1*) - Compute the weights using the
*calculateBeta*defined in the base class (line*2*) - Return the tuple regression weights
*calculateBeta*and the residuals*calculateResiduals*private val model: Option[(DblVector, Double)] = { this.

**newXSampleData**(xt.toDblMatrix) //1 newYSampleData(y) val _rss =**calculateResiduals**.toArray.map(x => x*x).sum val wRss = (**calculateBeta**.toArray, _rss) //2 Some(RegressionModel(wRss._1, wRss._2)) }

The QR decomposition in the *AbstractMultipleLinearRegression *base class does not include the penalty term (line *3*); the identity matrix with *lambda* factor in the diagonal has to be added to the matrix to be decomposed (line *4*).

override protected defnewXSampleData(x: DblMatrix): Unit = { super.newXSampleData(x) //3 val xtx: RealMatrix = getX val nFeatures = xt(0).size Range(0, nFeatures).foreach(i => xtx.setEntry(i,i,xtx.getEntry(i,i) +lambda)) //4 qr = new QRDecomposition(xtx) }

The regression weights are computed by resolving the system of linear equations using substitution on the QR matrices. It overrides the *calculateBeta* function from the base class:

override protected defcalculateBeta: RealVector = qr.getSolver().solve(getY())

## Test case

The objective of the test case is to identify the impact of the L_{2} penalization on the RSS value, and then compare the predicted values with original values.

Let’s consider the first test case related to the regression on the daily price variation of the Copper ETF (symbol: CU) using the stock daily volatility and volume as feature. The implementation of the extraction of observations is identical as with the least squares regression:

valsrc=DataSource(path, true, true, 1) valprice= src |> YahooFinancials.adjClose valvolatility= src |> YahooFinancials.volatility valvolume= src |> YahooFinancials.volume //1 val _price = price.get.toArray valdeltaPrice= XTSeries[Double](_price .drop(1) .zip(_price.take(_price.size -1)) .map( z => z._1 - z._2)) //2 val data = volatility.get .zip(volume.get) .map(z => Array[Double](z._1, z._2)) //3 val features = XTSeries[DblVector](data.take(data.size-1)) val regression = newRidgeRegression[Double](features, deltaPrice, lambda) //4regression.rss match { case Some(rss) => Display.show(rss, logger) ….

The observed data, ETF daily *price*, and the features (*volatility* and *volume*) are extracted from the source *src* (line *1*). The daily price change, *deltaPrice*, is computed using a combination of Scala *take* and *drop* methods (line *2*). The *features* vector is created by zipping *volatility* and *volume* (line *3*). The model is created by instantiating the *RidgeRegression* class (line *4*). The RSS value, *rss*, is finally displayed (line *5*).

The RSS value, *rss*, is plotted for different values of *lambda <= 1.0* in the following graph:

Graph of RSS versus Lambda for Copper ETF

The residual sum of squares decreased as *λ* increases. The curve seems to be reaching for a minimum around *λ=1*. The case of *λ = 0* corresponds to the least squares regression.

Next, let’s plot the RSS value for *λ* varying between 1 and 100:

Graph RSS versus large value Lambda for Copper ETF

This time around RSS increases with λ before reaching a maximum for *λ > 60*. This behavior is consistent with other findings (refer to *Lecture 5: Model selection and assessment*, a lecture by H. Bravo and R. Irizarry from department of Computer Science, University of Maryland, in 2010, available at http://www.cbcb.umd.edu/~hcorrada/PracticalML/pdf/lectures/selection.pdf). As *λ* increases, the overfitting gets more expensive, and therefore, the RSS value increases.

The regression *weights* can by simply outputted as follows:

regression.weights.get

Let’s plot the predicted price variation of the Copper ETF using the ridge regression with different value of lambda (*λ*):

Graph of ridge regression on Copper ETF price variation with variable Lambda

The original price variation of the Copper ETF *Δ = price(t+1)-price(t)* is plotted as *λ =0*. The predicted values for *λ = 0.8* is very similar to the original data. The predicted values for *λ = 0.8* follows the pattern of the original data with reduction of large variations (peaks and troves). The predicted values for *λ = 5* corresponds to a smoothed dataset. The pattern of the original data is preserved but the magnitude of the price variation is significantly reduced.

The reader is invited to apply the more elaborate K-fold validation routine and compute precision, recall, and F_{1} measure to confirm the findings.

# Summary

The ridge regression is a powerful alternative to the more common least squares regression because it reduces the risk of overfitting. Contrary to the Naïve Bayes classifiers, it does not require conditional independence of the model features.

## Resources for Article:

**Further resources on this subject:**

- Differences in style between Java and Scala code [Article]
- Dependency Management in SBT [Article]
- Introduction to MapReduce [Article]