The purpose of this article by Michele Usuelli, author of the book, R Machine Learning Essentials, is to show how you machine learning helps in solving a business problem.
(For more resources related to this topic, see here.)
The past marketing campaign targeted part of the customer base. Among other 1,000 clients, how do we identify the 100 that are keener to subscribe? We can build a model that learns from the data and estimates which clients are more similar to the ones that subscribed in the previous campaign. For each client, the model estimates a score that is higher if the client is more likely to subscribe. There are different machine learning models determining the scores and we use two well-performing techniques, as follows:
In the end, we need to choose one out of the two techniques. There are cross-validation methods that allow us to estimate model accuracy. Starting from that, we can measure the accuracy of both the options and pick the one performing better.
After choosing the most proper machine learning algorithm, we can optimize it using cross validation. However, in order to avoid overcomplicating the model building, we don’t perform any feature selection or parameter optimization.
These are the steps to build and evaluate the models:
arrayFeatures <- names(dtBank)
arrayFeatures <- arrayFeatures[arrayFeatures != 'output']
formulaAll <- paste('output', '~')
formulaAll <- paste(formulaAll, arrayFeatures[1])
for(nameFeature in arrayFeatures[-1]){
formulaAll <- paste(formulaAll, '+', nameFeature)
}
formulaAll <- formula(formulaAll)
dtTestBinded <- data.table()
nIter <- 10
for(iIter in 1:nIter)
{
indexTrain <- sample(
x = c(TRUE, FALSE),
size = nrow(dtBank),
replace = T,
prob = c(0.8, 0.2)
)
dtTrain <- dtBank[indexTrain]
dtTest <- dtBank[!indexTrain]
dtTest1 <- dtTest[output == 1]
dtTest0 <- dtTest[output == 0]
n0 <- nrow(dtTest0)
n1 <- nrow(dtTest1)
dtTest0 <- dtTest0[sample(x = 1:n0, size = n1)]
dtTest <- rbind(dtTest0, dtTest1)
modelRf <- randomForest(
formula = formulaAll,
data = dtTrain
)
modelLr <- glm(
formula = formulaAll,
data = dtTest,
family = binomial(logit)
)
dtTest[, outputRf := predict(
object = modelRf, newdata = dtTest, type='response'
)]
dtTest[, outputLr := predict(
object = modelLr, newdata = dtTest, type='response'
)]
dtTestBinded <- rbind(dtTestBinded, dtTest)
}
We built dtTestBinded containing the output column that defines which clients subscribed and the scores estimated by the models. Comparing the scores with the real output, we can validate the model performances.
In order to explore dtTestBinded, we can build a chart showing how the scores of the non-subscribing clients are distributed. Then, we add the distribution of the subscribing clients to the chart and compare them. In this way, we can see the difference between the scores of the two groups. Since we use the same chart for the random forest and for the logistic regression, we define a function building the chart by following the given steps:
plotDistributions <- function(dtTestBinded, colPred)
{
densityLr0 <- dtTestBinded[
output == 0,
density(get(colPred), adjust = 0.5)
]
densityLr1 <- dtTestBinded[
output == 1,
density(get(colPred), adjust = 0.5)
]
col0 <- rgb(1, 0, 0, 0.3)
col1 <- rgb(0, 0, 1, 0.3)
plot(densityLr0, xlim = c(0, 1), main = 'density')
polygon(densityLr0, col = col0, border = 'black')
polygon(densityLr1, col = col1, border = 'black')
legend(
'top',
c('0', '1'),
pch = 16,
col = c(col0, col1)
)
return()
}
Now, we can use plotDistributions on the random forest output:
par(mfrow = c(1, 1))
plotDistributions(dtTestBinded, 'outputRf')
The histogram obtained is as follows:
The x-axis represents the score and the y-axis represents the density that is proportional to the number of clients that subscribed for similar scores. Since we don’t have a client for each possible score, assuming a level of detail of 0.01, the density curve is smoothed in the sense that the density of each score is the average between the data with a similar score.
The red and blue areas represent the non-subscribing and subscribing clients respectively. As can be easily noticed, the violet area comes from the overlapping of the two curves. For each score, we can identify which density is higher. If the highest curve is red, the client will be more likely to subscribe, and vice versa.
For the random forest, most of the non-subscribing client scores are between 0 and 0.2 and the density peak is around 0.05. The subscribing clients have a more spread score, although higher, and their peak is around 0.1. The two distributions overlap a lot, so it’s not easy to identify which clients will subscribe starting from their scores. However, if the marketing campaign targets all customers with a score higher than 0.3, they will likely belong to the blue cluster. In conclusion, using random forest, we are able to identify a small set of customers that will subscribe very likely.
In this article, you learned how to predict your output using proper machine learning techniques.
Further resources on this subject:
I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…
Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…
Once we learn how to deploy an Ubuntu server, how to manage users, and how…
Key-takeaways: Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…
While developing a web application, or setting dynamic pages and meta tags we need to deal with…
Software architecture is one of the most discussed topics in the software industry today, and…