Implementing cost-effective IoT analytics for predictive maintenance [Tutorial]

Predictive maintenance is a common value proposition cited for IoT analytics. In this tutorial will look at a value formula for net savings. Then we walk through an example as a way to highlight how to think financially about when it makes sense to implement a decision and when it does not.

The economics of predictive maintenance may not be entirely obvious. Believe it or not, it does not always make sense, even if you can predict early failures accurately. In many cases, you will actually lose money by doing it. Even when it can save you money, there is an optimal point for when it should be used. The optimal point depends on the costs and the accuracy of the predictive model.

This article is an excerpt from a book written by Andrew Minteer titled Analytics for the Internet of Things (IoT).

The value formula

A formula to guide decision making compares the cost of allowing a failure to occur versus the cost to proactively repair the component while considering the probability of predicting the failure:

Net Savings = (Cost of Failure * (Expected Number of Failures - Expected True Positive Predictions)) - (Proactive Repair Cost * (Expected True Positives + Expected False Positives))

If the cost of failure is the same as the proactive repair cost, even with a perfect prediction model, then there will be no savings. Make sure to include intangible costs into the cost of failure. Some examples of intangible costs include legal expenses, loss of brand equity, and even the customer's expenses.

Predictive repair does make sense when there is a large spread between the cost of failure and the cost of proactive replacement, combined with a well-performing prediction model. For example, if the cost of a failure is a locomotive engine replacement at $1 million USD and the cost of a proactive repair is $200 USD, then the accuracy of the model does not even have to be all that great before a proactive replacement program makes financial sense.

On the other hand, if the failure is a $400 USD automotive turbocharger replacement, and the proactive repair cost is $350 USD for a turbocharger actuator subcomponent replacement, the predictive model would need to be highly accurate for that to make financial sense.

An example of making a value decision

To illustrate the example, we will walk through a business situation and then some R code that simulates a cost-benefit curve for that decision. The code will use a fitted predictive model to calculate the net savings (or lack thereof) to generate a cost curve. The cost curve can then be used in a business decision on what proportion of units with predicted failures should have a proactive replacement.

Imagine you work for a company that builds diesel-powered generators. There is a coolant control valve that normally lasts for 4,000 hours of operation until there is a planned replacement. From the analysis, your company has realized that the generators built two years prior are experiencing an earlier than the expected failure of the valve.

When the valve fails, the engine overheats and several other components are damaged. The cost of failure, including labor rates for repair personnel and the cost to the customer for downtime, is an average of $1,000 USD. The cost of a proactive replacement of the valve is $253 USD.

Should you replace all coolant valves in the population? It depends on how high a failure rate is expected. In this case, about 10% of the current non-failed units are expected to fail before the scheduled replacement. Also, importantly, it matters how well you can predict the failures.

The following R code simulates this situation and uses a simple predictive model (logistic regression) to estimate a cost curve. The model has an AUC of close to 0.75. This will vary as you run the code since the dataset is randomly simulated:

#make sure all needed packages are installed
if(!require(caret)){
install.packages("caret")
}
if(!require(pROC)){
install.packages("pROC")
}
if(!require(dplyr)){
install.packages("dplyr")
}
if(!require(data.table)){
install.packages("data.table")
}

#Load required libraries
library(caret)
library(pROC)
library(dplyr)
library(data.table)

#Generate sample data
simdata = function(N=1000) {

#simulate 4 features
X = data.frame(replicate(4,rnorm(N)))
#create a hidden data structure to learn
hidden = X[,1]^2+sin(X[,2]) + rnorm(N)*1
#10% TRUE, 90% FALSE
rare.class.probability = 0.1
#simulate the true classification values
y.class = factor(hidden<quantile(hidden,c(rare.class.probability)))
return(data.frame(X,Class=y.class))
}

#make some data structure
model_data = simdata(N=50000)

#train a logistic regression model on the simulated data
training <- createDataPartition(model_data$Class, p = 0.6, list=FALSE)
trainData <- model_data[training,]
testData <- model_data[-training,]
glmModel <- glm(Class~ . , data=trainData, family=binomial)
testData$predicted <- predict(glmModel, newdata=testData, type="response")

#calculate AUC
roc.glmModel <- pROC::roc(testData$Class, testData$predicted)
auc.glmModel <- pROC::auc(roc.glmModel)
print(auc.glmModel)

#Pull together test data and predictions
simModel <- data.frame(trueClass = testData$Class,
predictedClass = testData$predicted)

# Reorder rows and columns
simModel <- simModel[order(simModel$predictedClass, decreasing = TRUE), ]
simModel <- select(simModel, trueClass, predictedClass)
simModel$rank <- 1:nrow(simModel)

#Assign costs for failures and proactive repairs
proactive_repair_cost <- 253 # Cost of proactively repairing a part
failure_repair_cost <- 1000 # Cost of a failure of the part (include all costs such as lost production, etc not just the repair cost)

# Define each predicted/actual combination
fp.cost <- proactive_repair_cost # The part was predicted to fail but did not (False Positive)
fn.cost <- failure_repair_cost # The part was not predicted to fail and it did (False Negative)
tp.cost <- (proactive_repair_cost - failure_repair_cost) # The part was predicted to fail and it did (True Positive). This will be negative for a savings.
tn.cost <- 0.0 # The part was not predicted to fail and it did not (True Negative)

#incorporate probability of future failure
simModel$future_failure_prob <- prob_failure

#Function to assign costs for each instance
assignCost <- function(pred, outcome, tn.cost, fn.cost, fp.cost, tp.cost, prob){
cost <- ifelse(pred == 0 & outcome == FALSE, tn.cost, # No cost since no action was taken and no failure
ifelse(pred == 0 & outcome == TRUE, fn.cost, # The cost of no action and a repair resulted
ifelse(pred == 1 & outcome == FALSE, fp.cost, # The cost of proactive repair which was not needed
ifelse(pred == 1 & outcome == TRUE, tp.cost, 999999999)))) # The cost of proactive repair which avoided a failure
return(cost)
}

# Initialize list to hold final output
master <- vector(mode = "list", length = 100)

#use the simulated model. In practice, this code can be adapted to compare multiple models
test_model <- simModel

# Create a loop to increment through dynamic threshold (starting at 1.0 [no proactive repairs] to 0.0 [all proactive repairs])
threshold <- 1.00
for (i in 1:101) {
#Add predicted class with percentile ranking
test_model$prob_ntile <- ntile(test_model$predictedClass, 100) / 100
# Dynamically determine if proactive repair would apply based on incrementing threshold
test_model$glm_failure <- ifelse(test_model$prob_ntile >= threshold, 1, 0)
test_model$threshold <- threshold

# Compare to actual outcome to assign costs
test_model$glm_impact <- assignCost(test_model$glm_failure, test_model$trueClass, tn.cost, fn.cost, fp.cost, tp.cost, test_model$future_failure_prob)

# Compute cost for not doing any proactive repairs
test_model$nochange_impact <- ifelse(test_model$trueClass == TRUE, fn.cost, tn.cost) # *test_model$future_failure_prob)

# Running sum to produce the overall impact
test_model$glm_cumul_impact <- cumsum(test_model$glm_impact) / nrow(test_model)
test_model$nochange_cumul_impact <- cumsum(test_model$nochange_impact) / nrow(test_model)

# Count the # of classified failures
test_model$glm_failure_ct <- cumsum(test_model$glm_failure)

# Create new object to house the one row per iteration output for the final plot
master[[i]] <- test_model[nrow(test_model),]

# Reduce the threshold by 1% and repeat to calculate new value
threshold <- threshold - 0.01
}

finalOutput <- rbindlist(master)
finalOutput <- subset(finalOutput,
select = c(threshold,
glm_cumul_impact, glm_failure_ct, nochange_cumul_impact)
)

# Set baseline to costs of not doing any proactive repairs
baseline <- finalOutput$nochange_cumul_impact

# Plot the cost curve
par(mfrow = c(2,1))
plot(row(finalOutput)[,1],
finalOutput$glm_cumul_impact,
type = "l",
lwd = 3,
main = paste("Net Costs: Proactive Repair Cost of $", proactive_repair_cost, ", Failure cost $", failure_repair_cost, sep = ""),
ylim = c(min(finalOutput$glm_cumul_impact) - 100,
max(finalOutput$glm_cumul_impact) + 100),
xlab = "Percent of Population",
ylab = "Net Cost ($) / Unit")

# Plot the cost difference of proactive repair program and a 'do nothing' approach
plot(row(finalOutput)[,1],
baseline - finalOutput$glm_cumul_impact,
type = "l",
lwd = 3,
col = "black",
main = paste("Savings: Proactive Repair Cost of $", proactive_repair_cost, ", Failure cost $", failure_repair_cost,sep = ""),
ylim = c(min(baseline - finalOutput$glm_cumul_impact) - 100,
max(baseline - finalOutput$glm_cumul_impact) + 100),
xlab = "% of Population",
ylab = "Savings ($) / Unit")
abline(h=0,col="gray")

As seen in the resulting net cost and savings curves, based on the model's predictions, the optimal savings would be from a proactive repair program of the top 30 percentile units. The savings decreases after this, although you would still save money when replacing up to 75% of the population. After this point, you should expect to spend more than you save. The following set of charts is the output from the preceding code:

implementing-cost-effective-iot-analytics-for-predictive-maintenance-tutorial-img-0

Cost and savings curves for the proactive repair $253 and failure cost at $1,000 scenario

Note the changes in the following graph when the failure cost drops to $300 USD. At no point do you save money, as the proactive repair cost will always outweigh the reduced failure cost. This does not mean you should not do a proactive repair; you may still want to do so in order to satisfy your customers. Even in such a case, this cost curve method can help in decisions on how much you are willing to spend to address the problem. You can rerun the code with proactive_repair_cost set to 253 and failure_repair_cost set to 300 to generate the following charts:

implementing-cost-effective-iot-analytics-for-predictive-maintenance-tutorial-img-1

Cost and savings curves for the proactive repair $253 and failure cost at $300 scenario

And finally, notice how the savings curve changes when the failure cost moves to $5,000. You will notice that the spread between the proactive repair cost and the failure cost determines much of when doing a proactive repair makes business sense. You can rerun the code with proactive_repair_cost set to 253 and failure_repair_cost set to 5000 to generate the following charts:

implementing-cost-effective-iot-analytics-for-predictive-maintenance-tutorial-img-2

Cost and savings curves for the proactive repair $253 and failure cost at $5,000 scenario

Ultimately, the decision is a business case based on the expected costs and benefits. ML modeling can help optimize savings under the right conditions. Utilizing cost curves helps to determine the expected costs and savings of proactive replacements.

In this tutorial, we looked at implementing economically cost effective IoT analytics for predictive maintenance with example. To further explore IoT Analytics and cloud check out the book Analytics for the Internet of Things (IoT).

AWS IoT Analytics: The easiest way to run analytics on IoT data, Amazon says

Build an IoT application with Azure IoT [Tutorial]

Intelligent Edge Analytics: 7 ways machine learning is driving edge computing adoption in 2