Machine Learning Technique: Supervised Learning

7 min read

In this article by Andrea Isoni author of the book Machine Learning for the Web, we will the most relevant regression and classification techniques are discussed. All of these algorithms share the same background procedure, and usually the name of the algorithm refers to both classification and a regression method. The linear regression algorithms, Naive Bayes, decision tree, and support vector machine are going to be discussed in the following sections. To understand how to employ the techniques, a classification and a regression problem will be solved using the mentioned methods. Essentially, a labeled train dataset will be used to train the models, which means to find the values of the parameters, as we discussed in the introduction. As usual, the code is available in the my GitHub folder at https://github.com/ai2010/machine_learning_for_the_web/tree/master/chapter_3/.

(For more resources related to this topic, see here.)

We will conclude the article with an extra algorithm that may be used for classification, although it is not specifically designed for this purpose (hidden Markov model). We will now begin to explain the general causes of error in the methods when predicting the true labels associated with a dataset.

Model error estimation

We said that the trained model is used to predict the labels of new data, and the quality of the prediction depends on the ability of the model to generalize, that is, the correct prediction of cases not present in the trained data. This is a well-known problem in literature and related to two concepts: bias and variance of the outputs.

The bias is the error due to a wrong assumption in the algorithm. Given a point x(t) with label yt, the model is biased if it is trained with different training sets, and the predicted label ytpred will always be different from yt. The variance error instead refers to the different wrongly predicted labels of the given point x(t). A classic example to explain the concepts is to consider a circle with the true value at the center (true label), as shown in the following figure. The closer the predicted labels are to the center, the more unbiased the model and the lower the variance (top left in the following figure). The other three cases are also shown here:

Variance and bias example.

A model with low variance and low bias errors will have the predicted labels that is blue dots (as show in the preceding figure) concentrated on the red center (true label). The high bias error occurs when the predictions are far away from the true label, while high variance appears when the predictions are in a wide range of values.

We have already seen that labels can be continuous or discrete, corresponding to regression classification problems respectively. Most of the models are suitable for solving both problems, and we are going to use word regression and classification referring to the same model. More formally, given a set of N data points and corresponding labels, a model with a set of parameters with the true parameter values will have the mean square error (MSE), equal to:

We will use the MSE as a measure to evaluate the methods discussed in this article. Now we will start describing the generalized linear methods.

Generalized linear models

The generalized linear model is a group of models that try to find the M parameters that form a linear relationship between the labels yi and the feature vector x(i) that is as follows:

Here, are the errors of the model. The algorithm for finding the parameters tries to minimize the total error of the model defined by the cost function J:

The minimization of J is achieved using an iterative algorithm called batch gradient descent:

Here, a is called learning rate, and it is a trade-off between convergence speed and convergence precision. An alternative algorithm that is called stochastic gradient descent, that is loop for :

The qj is updated for each training example i instead of waiting to sum over the entire training set. The last algorithm converges near the minimum of J, typically faster than batch gradient descent, but the final solution may oscillate around the real values of the parameters. The following paragraphs describe the most common model and the corresponding cost function, J.

Linear regression

Linear regression is the simplest algorithm and is based on the model:

The cost function and update rule are:

Ridge regression

Ridge regression, also known as Tikhonov regularization, adds a term to the cost function J such that:

, where l is the regularization parameter. The additional term has the function needed to prefer a certain set of parameters over all the possible solutions penalizing all the parameters qj different from 0. The final set of qj shrank around 0, lowering the variance of the parameters but introducing a bias error. Indicating with the superscript l the parameters from the linear regression, the ridge regression parameters are related by the following formula:

This clearly shows that the larger the l value, the more the ridge parameters are shrunk around 0.

Lasso regression

Lasso regression is an algorithm similar to ridge regression, the only difference being that the regularization term is the sum of the absolute values of the parameters:

Logistic regression

Despite the name, this algorithm is used for (binary) classification problems, so we define the labels. The model is given the so-called logistic function expressed by:

In this case, the cost function is defined as follows:

From this, the update rule is formally the same as linear regression (but the model definition, , is different):

Note that the prediction for a point p, , is a continuous value between 0 and 1. So usually, to estimate the class label, we have a threshold at =0.5 such that:

The logistic regression algorithm is applicable to multiple label problems using the techniques one versus all or one versus one. Using the first method, a problem with K classes is solved by training K logistic regression models, each one assuming the labels of the considered class j as +1 and all the rest as 0. The second approach consists of training a model for each pair of labels ( trained models).

Probabilistic interpretation of generalized linear models

Now that we have seen the generalized linear model, let’s find the parameters qj that satisfy the relationship:

In the case of linear regression, we can assume as normally distributed with mean 0 and variance s² such that the probability is equivalent to:

Therefore, the total likelihood of the system can be expressed as follows:

In the case of the logistic regression algorithm, we are assuming that the logistic function itself is the probability:

Then the likelihood can be expressed by:

In both cases, it can be shown that maximizing the likelihood is equivalent to minimizing the cost function, so the gradient descent will be the same.

k-nearest neighbours (KNN)

This is a very simple classification (or regression) method in which given a set of feature vectors with corresponding labels yi, a test point x(t) is assigned to the label value with the majority of the label occurrences in the K nearest neighbors found, using a distance measure such as the following:

Euclidean:
Manhattan:
Minkowski: (if q=2, this reduces to the Euclidean distance)

In the case of regression, the value yt is calculated by replacing the majority of occurrences by the average of the labels . The simplest average (or the majority of occurrences) has uniform weights, so each point has the same importance regardless of their actual distance from x(t). However, a weighted average with weights equal to the inverse distance from x(t) may be used.

Summary

In this article, the major classification and regression algorithms, together with the techniques to implement them, were discussed. You should now be able to understand in which situation each method can be used and how to implement it using Python and its libraries (sklearn and pandas).