6 min read

[box type=”note” align=”” class=”” width=””]This article is an excerpt from a book written by Alberto Boschetti and Luca Massaron, titled Python Data Science Essentials – Second Edition. This book provides the fundamentals of data science with Python by leveraging the latest tools and libraries such as Jupyter notebooks, NumPy, pandas and scikit-learn.[/box]

In this article, we will learn about two easy and effective classifiers known as linear and logistic regressors.

Linear and logistic regressions are the two methods that can be used to linearly predict a target value or a target class, respectively. Let’s start with an example of linear regression predicting a target value.

In this article, we will again use the Boston dataset, which contains 506 samples, 13 features (all real numbers), and a (real) numerical target (which renders it ideal for regression problems). We will divide our dataset into two sections by using a train/test split cross- validation to test our methodology (in the example, 80 percent of our dataset goes in training and 20 percent in test):

In: from sklearn.datasets import load_boston boston = load_boston()

from sklearn.cross_validation import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, test_size=0.2, random_state=0)

The dataset is now loaded and the train/test pairs have been created. In the next few steps, we’re going to train and fit the regressor in the training set and predict the target variable in the test dataset. We are then going to measure the accuracy of the regression task by using the MAE score. As for the scoring function, we decided on the mean absolute error in order to penalize errors just proportionally to the

size of the error itself (using the more common mean squared error would have emphasized larger errors more, since errors are squared):

In: from sklearn.linear_model import LinearRegression regr = LinearRegression()

regr.fit(X_train, Y_train) Y_pred = regr.predict(X_test)

from sklearn.metrics import mean_absolute_error print ("MAE", mean_absolute_error(Y_test, Y_pred)) Out: MAE 3.84281058945

Great! We achieved our goal in the simplest possible way. Now, let’s take a look at the time needed to train the system:

In: %timeit regr.fit(X_train, y_train)

Out: 1000 loops, best of 3: 381 µs per loop

That was really quick! The results, of course, are not all that great. However, linear regression offers a very good trade-off between performance and speed of training and simplicity. Now, let’s take a look under the hood of the algorithm. Why is it so fast but not that accurate? The answer is somewhat expected-this is so because it’s a very simple linear method.

Let’s briefly dig into a mathematical explanation of this technique. Let’s name X(i) the ith sample (it is actually a row vector of numerical features) and Y(i) its target. The goal of linear regression is to find a good weight (column) vector W, which is best suited for approximating the target value when multiplied by the observation vector, that is, X(i) * W Y(i) (note that this is a dot product). W should be the same, and the best for every observation. Thus, solving the following equation becomes easy:

Linear and Logistic Regression models

W can be found easily with the help of a matrix inversion (or, more likely, a pseudo- inversion, which is a computationally efficient way) and a dot product. Here’s the reason linear regression is so fast. Note that this is a simplistic explanation—the real method adds another virtual feature to compensate for the bias of the process. However, this does not change the complexity of the regression algorithm much.

We progress now to logistic regression. In spite of what the name suggests, it is a classifier and not a regressor. It must be used in classification problems where you are dealing with only two classes (binary classification). Typically, target labels are Boolean; that is, they have values as either True/False or 0/1 (indicating the presence or absence of the expected outcome). In our example, we keep on using the same dataset. The target is to guess whether a house value is over or under the average of a threshold value we are interested in. In essence, we moved from a regression problem to a binary classification one because now our target is to guess how likely an example is to be a part of a group. We start preparing the dataset by using the following commands:

In: import numpy as np

avg_price_house = np.average(boston.target) high_priced_idx = (Y_train >= avg_price_house) Y_train[high_priced_idx] = 1

Y_train[np.logical_not(high_priced_idx)] = 0 Y_train = Y_train.astype(np.int8) high_priced_idx = (Y_test >= avg_price_house) Y_test[high_priced_idx] = 1

Y_test[np.logical_not(high_priced_idx)] = 0 Y_test = Y_test.astype(np.int8)

Now, we will train and apply the classifier. To measure its performance, we will simply print the classification report:

In: from sklearn.linear_model import LogisticRegression clf = LogisticRegression()

clf.fit(X_train, Y_train) Y_pred = clf.predict(X_test)

from sklearn.metrics import classification_report print (classification_report(Y_test, Y_pred))

Out:  

precision recall f1-score support

0   0.81    0.90   0.85     61

      1    0.82    0.68   0.75     41

 avg / total   0.83    0.81    0.81    102

The output of this command can change on your machine depending on the optimization process of the LogisticRegression classifier (no seed has been set for replicability of the results).

The precision and recall values are over 80 percent. This is already a good result for a very simple method. The training speed is impressive, too. Thanks to Jupyter Notebook, we can have a comparison of the algorithm with a more advanced classifier in terms of performance and speed:

In: %timeit clf.fit(X_train, y_train)

100 loops, best of 3: 2.54 ms per loop

What’s under the hood of a logistic regression? The simplest classifier a person could imagine (apart from a mean) is a linear regressor followed by a hard threshold:

Linear Regression

Here, sign(a) = +1 if a is greater or equal than zero, and 0 otherwise.

To smooth down the hardness of the threshold and predict the probability of belonging to a class, logistic regression resorts to the logit function. Its output is a (0 to 1] real number (0.0 and 1.0 are attainable only via rounding, otherwise the logit function just tends toward them), which indicates the probability that the observation belongs to class 1. Using a formula, that becomes:

Logistic Regression ModelsHere 
Linear and Logistic Regression Models

Why the logistic function instead of some other function? Well, because it just works pretty well in most real cases. In the remaining cases, if you’re not completely satisfied with its results, you may want to try some other nonlinear functions instead (there is limited variety of suitable ones, though).

To summarize, we learned about two classic algorithms used in machine learning namely linear and logistic regression. With the help of an example, we put the theory into practice by predicting a target value which helped us understand the trade-offs and benefits.

If you enjoyed this excerpt, check out the book Python Data Science Essentials – Second Edition to know more about other popular machine learning algorithms such as Naive Bayes, k-Nearest Neighbors (kNN), Support Vector Machines (SVM) etc.

Python Data Science Essentials Second Edition

 

 

LEAVE A REPLY

Please enter your comment!
Please enter your name here