[box type=”note” align=”” class=”” width=””]The following book excerpt is taken from the title Statistics for Data Science, written by James D. Miller. This book is a comprehensive primer on the basic concepts of statistics and their application in different data science tasks.[/box]

In this article, we explain the implementation of boosting – a popular technique used to improve the performance of a data model – using the popular R programming language.

We will take a high-level look at a thought-provoking prediction problem drawn from Mastering Predictive Analytics with R, Second Edition, by James D. Miller and Rui Miguel Forte.

Here, an original example of patterns made by radiation on a telescope camera are analyzed in an attempt to predict whether a certain pattern came from gamma rays leaking into the atmosphere or from regular background radiation.

Gamma rays leave distinctive elliptical patterns and so we can create a set of features to describe these. The dataset used is the MAGIC Gamma Telescope Data Set, hosted by the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope

This data consists of 19,020 observations, holding the following list of attributes:

## Prepping the data

First, various steps need to be performed on our example data.

The data is first loaded into an R data frame object named magic, recoding the CLASS output variable to use classes 1 and -1 for gamma rays and background radiation respectively:

> names(magic) <- c("FLENGTH", "FWIDTH", "FSIZE", "FCONC", "FCONC1",

"FASYM", "FM3LONG", "FM3TRANS", "FALPHA", "FDIST", "CLASS")

> magic\$CLASS <- as.factor(ifelse(magic\$CLASS =='g', 1, -1))

Next, the data is split into two files: a training data and a test data frame using an 80-20 split:

> library(caret)

> set.seed(33711209)

> magic_sampling_vector <- createDataPartition(magic\$CLASS,

p = 0.80, list = FALSE)

> magic_train <- magic[magic_sampling_vector, 1:10]

> magic_train_output <- magic[magic_sampling_vector, 11]

> magic_test <- magic[-magic_sampling_vector, 1:10]

> magic_test_output <- magic[-magic_sampling_vector, 11]

The model used for boosting is a simple multilayer perceptron with a single hidden layer leveraging R’s nnet package.

Neural networks, often produce higher accuracy when inputs are normalized, so, in this example, before training any models, this preprocessing is performed:

> magic_pp <- preProcess(magic_train, method = c("center",

"scale"))

> magic_train_pp <- predict(magic_pp, magic_train)

> magic_train_df_pp <- cbind(magic_train_pp,

CLASS = magic_train_output)

> magic_test_pp <- predict(magic_pp, magic_test)

## Training

Boosting is designed to work best with weak learners, so a very small number of hidden neurons in the model’s hidden layer are used.

Concretely, we will begin with the simplest possible multilayer perceptron that uses a single hidden neuron. To understand the effect of using boosting, a baseline performance is established by training a single neural network (and measuring its performance).

This is to accomplish the following:

> library(nnet)

> n_model <- nnet(CLASS ~ ., data = magic_train_df_pp, size = 1)

> n_test_predictions <- predict(n_model, magic_test_pp,

type = "class")

> (n_test_accuracy <- mean(n_test_predictions ==

magic_test_output))

[1] 0.7948988

This establishes that we have a baseline accuracy of around 79.5 percent. Not too bad, but can boost to improve upon this score?

To that end, the function AdaBoostNN(), which is shown as follows, is used. This function will take input from a data frame, the name of the output variable, the number of single hidden layer neural network models to be built, and finally, the number of hidden units these neural networks will have.

The function will then implement the AdaBoost algorithm and return a list of models with their corresponding weights.

Here is the function:

hidden_units) {

require("nnet")

models <- list()

alphas <- list()

n <- nrow(training_data)

model_formula <- as.formula(paste(output_column, '~ .', sep = ''))

w <- rep((1/n), n)

for (m in 1:M) {

model <- nnet(model_formula, data = training_data,

size = hidden_units, weights = w)

models[[m]] <- model

predictions <- as.numeric(predict(model,

training_data[, -which(names(training_data) ==

output_column)], type = "class"))

errors <- predictions != training_data[, output_column]

error_rate <- sum(w * as.numeric(errors)) / sum(w)

alpha <- 0.5 * log((1 - error_rate) / error_rate)

alphas[[m]] <- alpha

temp_w <- mapply(function(x, y) if (y) { x * exp(alpha) }

else { x * exp(-alpha)}, w, errors)

w <- temp_w / sum(temp_w)

}

return(list(models = models, alphas = unlist(alphas)))

}

The preceding function uses the following logic:

1. First, initialize empty lists of models and model weights (alphas). Compute the number of observations in the training data, storing this in the variable n. The name of the output column provided is then used to create a formula that describes the neural network that will be built.
2. In the dataset used, this formula will be CLASS ~ ., meaning that the neural network will compute CLASS as a function of all the other columns as input features.
3. Next, initialize the weights vector and define a loop that will run for M iterations in order to build M models.
4. In every iteration, the first step is to use the current setting of the weights vector to train a neural network using as many hidden units as specified in the input, hidden_units.
5. Then, compute a vector of predictions that the model generates on the training data using the predict() function. By comparing these predictions to the output column of the training data, calculate the errors that the current model makes on the training data. This then allows the computation of the error rate.
6. This error rate is set as the weight of the current model and, finally, the observation weights to be used in the next iteration of the loop are updated according to whether each observation was correctly classified.
7. The weight vector is then normalized and we are ready to begin the next iteration!
8. After completing M iterations, output a list of models and their corresponding model weights.

There is now a function able to train our ensemble classifier using AdaBoost, but we also need a function to make the actual predictions. This function will take in the output list produced by our training function, AdaBoostNN(), along with a test dataset.

This function is AdaBoostNN.predict() and it is shown as follows:

prediction_matrix <- sapply(models, function (x)

as.numeric(predict(x, test_data, type = "class")))

weighted_predictions <- t(apply(prediction_matrix, 1,

function(x) mapply(function(y, z) y * z, x, alphas)))

final_predictions <- apply(weighted_predictions, 1,

function(x) sign(sum(x)))

return(final_predictions)

}

This function first extracts the models and the model weights (from the list produced by the previous function). A matrix of predictions is created, where each column corresponds to the vector of predictions made by a particular model. Thus, there will be as many columns in this matrix as the models that we used for boosting.

We then multiply the predictions produced by each model with their corresponding model weight. For example, every prediction from the first model is in the first column of the prediction matrix and will have its value multiplied by the first model weight α1.

.Lastly, the matrix of weighted observations is reduced into a single vector of observations by summing the weighted predictions for each observation and taking the sign of the result. This vector of predictions is then returned by the function.

As an experiment, we will train ten neural network models with a single hidden unit and see if boosting improves accuracy:

'CLASS')

> mean(predictions == magic_test_output)

[1] 0.804365

We see in this example, boosting ten models shows a marginal improvement in accuracy, but perhaps training more models might make more of a difference.

## What we learned

From the preceding example, you may conclude that, for the neural networks with one hidden unit, as the number of boosting models increases, we see an improvement in accuracy, but after 100 models, this tapers off and is actually slightly less for 200 models. The improvement over the baseline of a single model is substantial for these networks. When we increase the complexity of our learner by having a hidden layer with three hidden neurons, we get a much smaller improvement in performance. At 200 models, both ensembles perform at a similar level, indicating that, at this point, our accuracy is being limited by the type of model trained.

If you found this article useful, make sure to check out the book Statistics for Data Science for interesting statistical techniques and their implementation in R.

Data Science Enthusiast. A massive science fiction and Manchester United fan. Loves to read, write and listen to music.