7 min read

The purpose of this article by Michele Usuelli, author of the book, R Machine Learning Essentials, is to show how you machine learning helps in solving a business problem.

(For more resources related to this topic, see here.)

Predicting the output

The past marketing campaign targeted part of the customer base. Among other 1,000 clients, how do we identify the 100 that are keener to subscribe? We can build a model that learns from the data and estimates which clients are more similar to the ones that subscribed in the previous campaign. For each client, the model estimates a score that is higher if the client is more likely to subscribe. There are different machine learning models determining the scores and we use two well-performing techniques, as follows:

  • Logistic regression: This is a variation of the linear regression to predict a binary output
  • Random forest: This is an ensemble based on a decision tree that works well in presence of many features

In the end, we need to choose one out of the two techniques. There are cross-validation methods that allow us to estimate model accuracy. Starting from that, we can measure the accuracy of both the options and pick the one performing better.

After choosing the most proper machine learning algorithm, we can optimize it using cross validation. However, in order to avoid overcomplicating the model building, we don’t perform any feature selection or parameter optimization.

These are the steps to build and evaluate the models:

  1. Load the randomForest package containing the random forest algorithm:
    library(‘randomForest’)
  2. Define the formula defining the output and the variable names. The formula is in the format output ~ feature1 + feature2 + …:
    arrayFeatures <- names(dtBank)
    arrayFeatures <- arrayFeatures[arrayFeatures != 'output']
    formulaAll <- paste('output', '~')
    formulaAll <- paste(formulaAll, arrayFeatures[1])
    for(nameFeature in arrayFeatures[-1]){
    formulaAll <- paste(formulaAll, '+', nameFeature)
    }
    formulaAll <- formula(formulaAll)
  3. Initialize the table containing all the testing sets:
    dtTestBinded <- data.table()
  4. Define the number of iterations:
    nIter <- 10
  5. Start a for loop:
    for(iIter in 1:nIter)
    {
  6. Define the training and the test datasets:
    indexTrain <- sample(
    x = c(TRUE, FALSE),
    size = nrow(dtBank),
    replace = T,
    prob = c(0.8, 0.2)
    )
    dtTrain <- dtBank[indexTrain]
    dtTest <- dtBank[!indexTrain]
  7. Select a subset from the test set in such a way that we have the same number of output == 0 and output == 1. First, we split dtTest in two parts (dtTest0 and dtTest1) on the basis of the output and we count the number of rows of each part (n0 and n1). Then, as dtTest0 has more rows, we randomly select n1 rows. In the end, we redefine dtTest binding dtTest0 and dtTest1, as follows:
    dtTest1 <- dtTest[output == 1]
    dtTest0 <- dtTest[output == 0]
    n0 <- nrow(dtTest0)
    n1 <- nrow(dtTest1)
    dtTest0 <- dtTest0[sample(x = 1:n0, size = n1)]
    dtTest <- rbind(dtTest0, dtTest1)
  8. Build the random forest model using randomForest. The formula argument defines the relationship between variables and the data argument defines the training dataset. In order to avoid overcomplicating the model, all the other parameters are left as their defaults:
    modelRf <- randomForest(
    formula = formulaAll,
    data = dtTrain
    )
  9. Build the logistic regression model using glm, which is a function used to build Generalized Linear Models (GLM). GLMs are a generalization of the linear regression and they allow to define a link function that connects the linear predictor with the outputs. The input is the same as the random forest, with the addition of family = binomial(logit) defining that the regression is logistic:
    modelLr <- glm(
    formula = formulaAll,
    data = dtTest,
    family = binomial(logit)
    )
  10. Predict the output of the random forest. The function is predict and its main arguments are object defining the model and newdata defining the test set, as follows:
    dtTest[, outputRf := predict(
    object = modelRf, newdata = dtTest, type='response'
    )]
  11. Predict the output of the logistic regression, using predict similar to the random forest. The other argument is type=’response’ and it is necessary in the case of the logistic regression:
    dtTest[, outputLr := predict(
    object = modelLr, newdata = dtTest, type='response'
    )]
  12. Add the new test set to dtTestBinded:
    dtTestBinded <- rbind(dtTestBinded, dtTest)
  13. End the for loop:
    }

We built dtTestBinded containing the output column that defines which clients subscribed and the scores estimated by the models. Comparing the scores with the real output, we can validate the model performances.

In order to explore dtTestBinded, we can build a chart showing how the scores of the non-subscribing clients are distributed. Then, we add the distribution of the subscribing clients to the chart and compare them. In this way, we can see the difference between the scores of the two groups. Since we use the same chart for the random forest and for the logistic regression, we define a function building the chart by following the given steps:

  1. Define the function and its input that includes the data table and the name of the score column:
    plotDistributions <- function(dtTestBinded, colPred)
    {
  2. Compute the distribution density for the clients that didn’t subscribe. With output == 0, we extract the clients not subscribing, and using density, we define a density object. The adjust parameter defines the smoothing bandwidth that is a parameter of the way we build the curve starting from the data. The bandwidth can be interpreted as the level of detail:
    densityLr0 <- dtTestBinded[
       output == 0,
       density(get(colPred), adjust = 0.5)
       ]
  3. Compute the distribution density for the clients that subscribed:
    densityLr1 <- dtTestBinded[
       output == 1,
       density(get(colPred), adjust = 0.5)
       ]
  4. Define the colors in the chart using rgb. The colors are transparent red and transparent blue:
    col0 <- rgb(1, 0, 0, 0.3)
    col1 <- rgb(0, 0, 1, 0.3)
  5. Build the plot with the density of the clients not subscribing. Here, polygon is a function that adds the area to the chart:
    plot(densityLr0, xlim = c(0, 1), main = 'density')
    polygon(densityLr0, col = col0, border = 'black')
  6. Add the clients that subscribed to the chart:
    polygon(densityLr1, col = col1, border = 'black')
  7. Add the legend:
    legend(
       'top',
       c('0', '1'),
       pch = 16,
       col = c(col0, col1)
    )
  8. End the function:
    return()
    }

Now, we can use plotDistributions on the random forest output:

par(mfrow = c(1, 1))
plotDistributions(dtTestBinded, 'outputRf')

The histogram obtained is as follows:

 R Machine Learning Essentials

The x-axis represents the score and the y-axis represents the density that is proportional to the number of clients that subscribed for similar scores. Since we don’t have a client for each possible score, assuming a level of detail of 0.01, the density curve is smoothed in the sense that the density of each score is the average between the data with a similar score.

The red and blue areas represent the non-subscribing and subscribing clients respectively. As can be easily noticed, the violet area comes from the overlapping of the two curves. For each score, we can identify which density is higher. If the highest curve is red, the client will be more likely to subscribe, and vice versa.

For the random forest, most of the non-subscribing client scores are between 0 and 0.2 and the density peak is around 0.05. The subscribing clients have a more spread score, although higher, and their peak is around 0.1. The two distributions overlap a lot, so it’s not easy to identify which clients will subscribe starting from their scores. However, if the marketing campaign targets all customers with a score higher than 0.3, they will likely belong to the blue cluster. In conclusion, using random forest, we are able to identify a small set of customers that will subscribe very likely.

Summary

In this article, you learned how to predict your output using proper machine learning techniques.

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here