18 min read

In this article by Rui Miguel Forte, the author of the book, Mastering Predictive Analytics with R, we’ll learn about linear regression. Regression problems involve predicting a numerical output. The simplest but most common type of regression is linear regression. In this article, we’ll explore why linear regression is so commonly used, its limitations, and extensions.

(For more resources related to this topic, see here.)

Introduction to linear regression

In linear regression, the output variable is predicted by a linearly weighted combination of input features. Here is an example of a simple linear model:

 Mastering Predictive Analytics with R

The preceding model essentially says that we are estimating one output, denoted by ˆy, and this is a linear function of a single predictor variable (that is, a feature) denoted by the letter x. The terms involving the Greek letter β are the parameters of the model and are known as regression coefficients. Once we train the model and settle on values for these parameters, we can make a prediction on the output variable for any value of x by a simple substitution in our equation. Another example of a linear model, this time with three features and with values assigned to the regression coefficients, is given by the following equation:

 Mastering Predictive Analytics with R

In this equation, just as with the previous one, we can observe that we have one more coefficient than the number of features. This additional coefficient, β0, is known as the intercept and is the expected value of the model when the value of all input features is zero. The other β coefficients can be interpreted as the expected change in the value of the output per unit increase of a feature. For example, in the preceding equation, if the value of the feature x1 rises by one unit, the expected value of the output will rise by 1.91 units. Similarly, a unit increase in the feature x3 results in a decrease of the output by 7.56 units. In a simple one-dimensional regression problem, we can plot the output on the y axis of a graph and the input feature on the x axis. In this case, the model predicts a straight-line relationship between these two, where β0 represents the point at which the straight line crosses or intercepts the y axis and β1 represents the slope of the line. We often refer to the case of a single feature (hence, two regression coefficients) as simple linear regression and the case of two or more features as multiple linear regression.

Assumptions of linear regression

Before we delve into the details of how to train a linear regression model and how it performs, we’ll look at the model assumptions. The model assumptions essentially describe what the model believes about the output variable y that we are trying to predict. Specifically, linear regression models assume that the output variable is a weighted linear function of a set of feature variables. Additionally, the model assumes that for fixed values of the feature variables, the output is normally distributed with a constant variance. This is the same as saying that the model assumes that the true output variable y can be represented by an equation such as the following one, shown for two input features:

Mastering Predictive Analytics with R

Here, ε represents an error term, which is normally distributed with zero mean and constant variance σ2:

Mastering Predictive Analytics with R

We might hear the term homoscedasticity as a more formal way of describing the notion of constant variance. By homoscedasticity or constant variance, we are referring to the fact that the variance in the error component does not vary with the values or levels of the input features. In the following plot, we are visualizing a hypothetical example of a linear relationship with heteroskedastic errors, which are errors that do not have a constant variance. The data points lie close to the line at low values of the input feature, because the variance is low in this region of the plot, but lie farther away from the line at higher values of the input feature because of the higher variance.

Mastering Predictive Analytics with R

The ε term is an irreducible error component of the true function y and can be used to represent random errors, such as measurement errors in the feature values. When training a linear regression model, we always expect to observe some amount of error in our estimate of the output, even if we have all the right features, enough data, and the system being modeled really is linear. Put differently, even with a true function that is linear, we still expect that once we find a line of best fit through our training examples, our line will not go through all, or even any of our data points because of this inherent variance exhibited by the error component. The critical thing to remember, though, is that in this ideal scenario, because our error component has zero mean and constant variance, our training criterion will allow us to come close to the true values of the regression coefficients given a sufficiently large sample, as the errors will cancel out.

Another important assumption relates to the independence of the error terms. This means that we do not expect the residual or error term associated with one particular observation to be somehow correlated with that of another observation. This assumption can be violated if observations are functions of each other, which is typically the result of an error in the measurement. If we were to take a portion of our training data, double all the values of the features and outputs, and add these new data points to our training data, we could create the illusion of having a larger data set; however, there will be pairs of observations whose error terms will depend on each other as a result, and hence our model assumption would be violated. Incidentally, artificially growing our data set in such a manner is never acceptable for any model. Similarly, correlated error terms may occur if observations are related in some way by an unmeasured variable. For example, if we are measuring the malfunction rate of parts from an assembly line, then parts from the same factory might have a correlation in the error, for example, due to different standards and protocols used in the assembly process. Therefore, if we don’t use the factory as a feature, we may see correlated errors in our sample among observations that correspond to parts from the same factory. The study of experimental design is concerned with identifying and reducing correlations in error terms, but this is beyond the scope of this book.

Finally, another important assumption concerns the notion that the features themselves are statistically independent of each other. It is worth clarifying here that in linear models, although the input features must be linearly weighted, they themselves may be the output of another function. To illustrate this, one may be surprised to see that the following is a linear model of three features, sin(z1), ln(z2), and exp(z3):

Mastering Predictive Analytics with R

We can see that this is a linear model by making a few transformations on the input features and then making the replacements in our model:

Mastering Predictive Analytics with R

Now, we have an equation that is more recognizable as a linear regression model. If the previous example made us believe that nearly everything could be transformed into a linear model, then the following two examples will emphatically convince us that this is not in fact the case:

Mastering Predictive Analytics with R

Both models are not linear models because of the first regression coefficient (β1). The first model is not a linear model because β1 is acting as the exponent of the first input feature. In the second model, β1 is inside a sine function. The important lesson to take away from these examples is that there are cases where we can apply transformations on our input features in order to fit our data to a linear model; however, we need to be careful that our regression coefficients are always the linear weights of the resulting new features.

Simple linear regression

Before looking at some real-world data sets, it is very helpful to try to train a model on artificially generated data. In an artificial scenario such as this, we know what the true output function is beforehand, something that as a rule is not the case when it comes to real-world data. The advantage of performing this exercise is that it gives us a good idea of how our model works under the ideal scenario when all of our assumptions are fully satisfied, and it helps visualize what happens when we have a good linear fit. We’ll begin by simulating a simple linear regression model. The following R snippet is used to create a data frame with 100 simulated observations of the following linear model with a single input feature:

Mastering Predictive Analytics with R

Here is the code for the simple linear regression model:

> set.seed(5427395)
> nObs = 100
> x1minrange = 5
> x1maxrange = 25
> x1 = runif(nObs, x1minrange, x1maxrange)
> e = rnorm(nObs, mean = 0, sd = 2.0)
> y = 1.67 * x1 - 2.93 + e
> df = data.frame(y, x1)

For our input feature, we randomly sample points from a uniform distribution. We used a uniform distribution to get a good spread of data points. Note that our final df data frame is meant to simulate a data frame that we would obtain in practice, and as a result, we do not include the error terms, as these would be unavailable to us in a real-world setting.

When we train a linear model using some data such as those in our data frame, we are essentially hoping to produce a linear model with the same coefficients as the ones from the underlying model of the data. Put differently, the original coefficients define a population regression line. In this case, the population regression line represents the true underlying model of the data. In general, we will find ourselves attempting to model a function that is not necessarily linear. In this case, we can still define the population regression line as the best possible linear regression line, but a linear regression model will obviously not perform equally well.

Estimating the regression coefficients

For our simple linear regression model, the process of training the model amounts to an estimation of our two regression coefficients from our data set. As we can see from our previously constructed data frame, our data is effectively a series of observations, each of which is a pair of values (xi, yi) where the first element of the pair is the input feature value and the second element of the pair is its output label. It turns out that for the case of simple linear regression, it is possible to write down two equations that can be used to compute our two regression coefficients. Instead of merely presenting these equations, we’ll first take a brief moment to review some very basic statistical quantities that the reader has most likely encountered previously, as they will be featured very shortly.

The mean of a set of values is just the average of these values and is often described as a measure of location, giving a sense of where the values are centered on the scale in which they are measured. In statistical literature, the average value of a random variable is often known as the expectation, so we often find that the mean of a random variable X is denoted as E(X). Another notation that is commonly used is bar notation, where we can represent the notion of taking the average of a variable by placing a bar over that variable. To illustrate this, the following two equations show the mean of the output variable y and input feature x:

Mastering Predictive Analytics with R

A second very common quantity, which should also be familiar, is the variance of a variable. The variance measures the average square distance that individual values have from the mean. In this way, it is a measure of dispersion, so that a low variance implies that most of the values are bunched up close to the mean, whereas a higher variance results in values that are spread out. Note that the definition of variance involves the definition of the mean, and for this reason, we’ll see the use of the x variable with a bar on it in the following equation that shows the variance of our input feature x:

Mastering Predictive Analytics with R

Finally, we’ll define the covariance between two random variables x and y using the following equation:

Mastering Predictive Analytics with R

From the previous equation, it should be clear that the variance, which we just defined previously, is actually a special case of the covariance where the two variables are the same. The covariance measures how strongly two variables are correlated with each other and can be positive or negative. A positive covariance implies a positive correlation; that is, when one variable increases, the other will increase as well. A negative covariance suggests the opposite; when one variable increases, the other will tend to decrease. When two variables are statistically independent of each other and hence uncorrelated, their covariance will be zero (although it should be noted that a zero covariance does not necessarily imply statistical independence).

Armed with these basic concepts, we can now present equations for the estimates of the two regression coefficients for the case of simple linear regression:

Mastering Predictive Analytics with R

The first regression coefficient can be computed as the ratio of the covariance between the output and the input feature, and the variance of the input feature. Note that if the output feature were to be independent of the input feature, the covariance would be zero and therefore, our linear model would consist of a horizontal line with no slope. In practice, it should be noted that even when two variables are statistically independent, we will still typically see a small degree of covariance due to the random nature of the errors, so if we were to train a linear regression model to describe their relationship, our first regression coefficient would be nonzero in general. Later, we’ll see how significance tests can be used to detect features we should not include in our models.

To implement linear regression in R, it is not necessary to perform these calculations as R provides us with the lm() function, which builds a linear regression model for us. The following code sample uses the df data frame we created previously and calculates the regression coefficients:

> myfit <- lm(y~x1, df)
> myfit

Call:
lm(formula = y ~ x1, data = df)
Coefficients:
(Intercept)          x1
     -2.380       1.641

In the first line, we see that the usage of the lm() function involves first specifying a formula and then following up with the data parameter, which in our case is our data frame. For the case of simple linear regression, the syntax of the formula that we specify for the lm() function is the name of the output variable, followed by a tilde (~) and then by the name of the single input feature. Finally, the output shows us the values for the two regression coefficients. Note that the β0 coefficient is labeled as the intercept, and the β1 coefficient is labeled by the name of the corresponding feature (in this case, x1) in the equation of the linear model.

The following graph shows the population line and the estimated line on the
same plot:

Mastering Predictive Analytics with R

As we can see, the two lines are so close to each other that they are barely distinguishable, showing that the model has estimated the true population line very closely.

Multiple linear regression

Whenever we have more than one input feature and want to build a linear regression model, we are in the realm of multiple linear regression. The general equation for a multiple linear regression model with k input features is:

Mastering Predictive Analytics with R

Our assumptions about the model and about the error component ε remain the same as with simple linear regression, remembering that as we now have more than one input feature, we assume that these are independent of each other. Instead of using simulated data to demonstrate multiple linear regression, we will analyze two real-world data sets.

Predicting CPU performance

Our first real-world data set was presented by the researchers Dennis F. Kibler, David W. Aha, and Marc K. Albert in a 1989 paper titled Instance-based prediction of real-valued attributes and published in Journal of Computational Intelligence. The data contain the characteristics of different CPU models, such as the cycle time and the amount of cache memory. When deciding between processors, we would like to take all of these things into account, but ideally, we’d like to compare processors on a single numerical scale. For this reason, we often develop programs to benchmark the relative performance of a CPU. Our data set also comes with the published relative performance of our CPUs and our objective will be to use the available CPU characteristics as features to predict this. The data set can be obtained online from the UCI Machine Learning Repository via this link: http://archive.ics.uci.edu/ml/datasets/Computer+Hardware.

The UCI Machine Learning Repository is a wonderful online resource that hosts a large number of data sets, many of which are often cited by authors of books and tutorials. It is well worth the effort to familiarize yourself with this website and its data sets. A very good way to learn predictive analytics is to practice using the techniques you learn in this book on different data sets, and the UCI repository provides many of these for exactly this purpose.

The machine.data file contains all our data in a comma-separated format, with one line per CPU model. We’ll import this in R and label all the columns. Note that there are 10 columns in total, but we don’t need the first two for our analysis, as these are just the brand and model name of the CPU. Similarly, the final column is a predicted estimate of the relative performance that was produced by the researchers themselves; our actual output variable, PRP, is in column 9. We’ll store the data that we need in a data frame called machine:

> machine <- read.csv("machine.data", header = F)
> names(machine) <- c("VENDOR", "MODEL", "MYCT", "MMIN",
"MMAX", "CACH", "CHMIN", "CHMAX", "PRP", "ERP")
> machine <- machine[, 3:9]
> head(machine, n = 3)
MYCT MMIN MMAX CACH CHMIN CHMAX PRP
1 125 256 6000 256   16   128 198
2   29 8000 32000   32     8   32 269
3   29 8000 32000   32     8   32 220

The data set also comes with the definition of the data columns:

Column name

Definition

MYCT

The machine cycle time in nanoseconds

MMIN

The minimum main memory in kilobytes

MMAX

The maximum main memory in kilobytes

CACH

The cache memory in kilobytes

CHMIN

The minimum channels in units

CHMAX

The maximum channels in units

PRP

The published relative performance (our output variable)

The data set contains no missing values, so no observations need to be removed or modified. One thing that we’ll notice is that we only have roughly 200 data points, which is generally considered a very small sample. Nonetheless, we will proceed with splitting our data into a training set and a test set, with an 85-15 split, as follows:

> library(caret)
> set.seed(4352345)
> machine_sampling_vector <- createDataPartition(machine$PRP,
   p = 0.85, list = FALSE)
> machine_train <- machine[machine_sampling_vector,]
> machine_train_features <- machine[, 1:6]
> machine_train_labels <- machine$PRP[machine_sampling_vector]
> machine_test <- machine[-machine_sampling_vector,]
> machine_test_labels <- machine$PRP[-machine_sampling_vector]

Now that we have our data set up and running, we’d usually want to investigate further and check whether some of our assumptions for linear regression are valid. For example, we would like to know whether we have any highly correlated features. To do this, we can construct a correlation matrix with the cor()function and use the findCorrelation() function from the caret package to get suggestions for which features to remove:

> machine_correlations <- cor(machine_train_features)
> findCorrelation(machine_correlations)
integer(0)
> findCorrelation(machine_correlations, cutoff = 0.75)
[1] 3
> cor(machine_train$MMIN, machine_train$MMAX)
[1] 0.7679307

Using the default cutoff of 0.9 for a high degree of correlation, we found that none of our features should be removed. When we reduce this cutoff to 0.75, we see that caret recommends that we remove the third feature (MMAX). As the final line of preceding code shows, the degree of correlation between this feature and MMIN is 0.768. While the value is not very high, it is still high enough to cause us a certain degree of concern that this will affect our model. Intuitively, of course, if we look at the definitions of our input features, we will certainly tend to expect that a model with a relatively high value for the minimum main memory will also be likely to have a relatively high value for the maximum main memory. Linear regression can sometimes still give us a good model with correlated variables, but we would expect to get better results if our variables were uncorrelated. For now, we’ve decided to keep all our features for this data set.

Summary

In this article, we studied linear regression, a method that allows us to fit a linear model in a supervised learning setting where we have a number of input features and a single numeric output. Simple linear regression is the name given to the scenario where we have only one input feature, and multiple linear regression describes the case where we have multiple input features. Linear regression is very commonly used as a first approach to solving a regression problem. It assumes that the output is a linear weighted combination of the input features in the presence of an irreducible error component that is normally distributed and has zero mean and constant variance. The model also assumes that the features are independent.

LEAVE A REPLY

Please enter your comment!
Please enter your name here