15 min read

(For more resources related to this topic, see here.)

Using the Feature Selection node creatively to remove or decapitate perfect predictors

In this recipe, we will identify perfect or near perfect predictors in order to insure that they do not contaminate our model. Perfect predictors earn their name by being correct 100 percent of the time, usually indicating circular logic and not a prediction of value. It is a common and serious problem.

When this occurs we have accidentally allowed information into the model that could not possibly be known at the time of the prediction. Everyone 30 days late on their mortgage receives a late letter, but receiving a late letter is not a good predictor of lateness because their lateness caused the letter, not the other way around.

The rather colorful term decapitate is borrowed from the data miner Dorian Pyle. It is a reference to the fact that perfect predictors will be found at the top of any list of key drivers (“caput” means head in Latin). Therefore, to decapitate is to remove the variable at the top. Their status at the top of the list will be capitalized upon in this recipe.

The following table shows the three time periods; the past, the present, and the future. It is important to remember that, when we are making predictions, we can use information from the past to predict the present or the future but we cannot use information from the future to predict the future. This seems obvious, but it is common to see analysts use information that was gathered after the date for which predictions are made. As an example, if a company sends out a notice after a customer has churned, you cannot say that the notice is predictive of churning.






Contract Start



Renewal Date


January 1, 2010

January 1, 2012


January 2, 2012


February 15, 2010

February 15, 2012

Out of Contract



March 21, 2010

March 21, 2012




April 5, 2010

April 5, 2012


April 9, 2012

New Customer

24 Months Ago




Getting ready

We will start with a blank stream, and will be using the cup98lrn reduced vars2.txt data set.

How to do it…

To identify perfect or near-perfect predictors in order to insure that they do not contaminate our model:

  1. Build a stream with a Source node, a Type node, and a Table then force instantiation by running the Table node.
  2. Force TARGET_B to be flag and make it the target.
  3. Add a Feature Selection Modeling node and run it.

  4. Edit the resulting generated model and examine the results. In particular, focus on the top of the list.

  5. Review what you know about the top variables, and check to see if any could be related to the target by definition or could possibly be based on information that actually postdates the information in the target.
  6. Add a CHAID Modeling node, set it to run in Interactive mode, and run it.
  7. Examine the first branch, looking for any child node that might be perfectly predicted; that is, look for child nodes whose members are all found in one category.

  8. Continue steps 6 and 7 for the first several variables.

  9. Variables that are problematic (steps 5 and/or 7) need to be set to None in the Type node.

How it works…

Which variables need decapitation? The problem is information that, although it was known at the time that you extracted it, was not known at the time of decision. In this case, the time of decision is the decision that the potential donor made to donate or not to donate. Was the amount, Target_D known before the decision was made to donate? Clearly not. No information that dates after the information in the target variable can ever be used in a predictive model.

This recipe is built of the following foundation—variables with this problem will float up to the top of the Feature Selection results.

They may not always be perfect predictors, but perfect predictors always must go. For example, you might find that, if a customer initially rejects or postpones a purchase, there should be a follow up sales call in 90 days. They are recorded as rejected offer in the campaign, and as a result most of them had a follow up call in 90 days after the campaign. Since a couple of the follow up calls might not have happened, it won’t be a perfect predictor, but it still must go.

Note that variables such as RFA_2 and RFA_2A are both very recent information and highly predictive. Are they a problem? You can’t be absolutely certain without knowing the data. Here the information recorded in these variables is calculated just prior to the campaign. If the calculation was made just after, they would have to go. The CHAID tree almost certainly would have shown evidence of perfect prediction in this case.

There’s more…

Sometimes a model has to have a lot of lead time; predicting today’s weather is a different challenge than next year’s prediction in the farmer’s almanac. When more lead time is desired you could consider dropping all of the _2 series variables. What would the advantage be? What if you were buying advertising space and there was a 45 day delay for the advertisement to appear? If the _2 variables occur between your advertising deadline and your campaign you might have to use information attained in the _3 campaign.

Next-Best-Offer for large datasets

Association models have been the basis for next-best-offer recommendation engines for a long time. Recommendation engines are widely used for presenting customers with cross-sell offers. For example, if a customer purchases a shirt, pants, and a belt; which shoes would he also likely buy? This type of analysis is often called market-basket analysis as we are trying to understand which items customers purchase in the same basket/transaction.

Recommendations must be very granular (for example, at the product level) to be usable at the check-out register, website, and so on. For example, knowing that female customers purchase a wallet 63.9 percent of the time when they buy a purse is not directly actionable. However, knowing that customers that purchase a specific purse (for example, SKU 25343) also purchase a specific wallet (for example, SKU 98343) 51.8 percent of the time, can be the basis for future recommendations.

Product level recommendations require the analysis of massive data sets (that is, millions of rows). Usually, this data is in the form of sales transactions where each line item (that is, row of data) represents a single product. The line items are tied together by a single transaction ID.

IBM SPSS Modeler association models support both tabular and transactional data. The tabular format requires each product to be represented as column. As most product level recommendations would contain thousands of products, this format is not practical. The transactional format uses the transactional data directly and requires only two inputs, the transaction ID and the product/item.

Getting ready

This example uses the file stransactions.sav and scoring.csv.

How to do it…

To recommend the next best offer for large datasets:

  1. Start with a new stream by navigating to File | New Stream.
  2. Go to File | Stream Properties from the IBM SPSS Modeler menu bar. On the Options tab change the Maximum members for nominal fields to 50000. Click on OK.
  3. Add a Statistics File source node to the upper left of the stream. Set the file field by navigating to transactions.sav. On the Types tab, change the Product_Code field to Nominal and click on the Read Values button. Click on OK.
  4. Add a CARMA Modeling node connected to the Statistics File source node in step 3. On the Fields tab, click on the Use custom settings and check the Use transactional format check box. Select Transaction_ID as the ID field and Product_Code as the Content field.

  5. On the Model tab of the CARMA Modeling node, change the Minimum rule support (%) to 0.0 and the Minimum rule confidence (%) to 5.0. Click on the Run button to build the model. Double-click the generated model to ensure that you have approximately 40,000 rules.

  6. Add a Var File source node to the middle left of the stream. Set the file field by navigating to scoring.csv. On the Types tab, click on the Read Values button. Click on the Preview button to preview the data. Click on OK to dismiss all dialogs.

  7. Add a Sort node connected to the Var File node in step 6. Choose Transaction_ID and Line_Number (with Ascending sort) by clicking the down arrow on the right of the dialog. Click on OK.
  8. Connect the Sort node in step 7 to the generated model (replacing the current link).
  9. Add an Aggregate node connected to the generated model.
  10. Add a Merge node connected to the generated model. Connect the Aggregate node in step 9 to the Merge node. On the Merge tab, choose Keys as the Merge Method, select Transaction_ID, and click on the right arrow. Click on OK.
  11. Add a Select node connected to the Merge node in step 10. Set the condition to Record_Count = Line_Number. Click on OK. At this point, the stream should look as follows:

  12. Add a Table node connected to the Select node in step 11. Right-click on the Table node and click on Run to see the next-best-offer for the input data.

How it works…

In steps 1-5, we set up the CARMA model to use the transactional data (without needing to restructure the data). CARMA was selected over A Priori for its improved performance and stability with large data sets. For recommendation engines, the settings for the Model tab are somewhat arbitrary and are driven by the practical limitations of the number of rules generated. Lowering the thresholds for confidence and rule support generates more rules. Having more rules can have a negative impact on scoring performance but will result in more (albeit weaker) recommendations.

Rule Support

How many transactions contain the entire rule (that is, both antecedents (“if” products) and consequents (“then” products))


If a transaction contains all the antecedents (“if” products), what percentage of the time does it contain the consequents (“then” products)

In step 5, when we examine the model we see the generated Association Rules with the corresponding rules support and confidences.

In the remaining steps (7-12), we score a new transaction and generate 3 next-best-offers based on the model containing the Association Rules. Since the model was built with transactional data, the scoring data must also be transactional. This means that each row is scored using the current row and the prior rows with the same transaction ID. The only row we generally care about is the last row for each transaction where all the data has been presented to the model. To accomplish this, we count the number of rows for each transaction and select the line number that equals the total row count (that is, the last row for each transaction).

Notice that the model returns 3 recommended products, each with a confidence, in order of decreasing confidence. A next-best-offer engine would present the customer with the best option first (or potentially all three options ordered by decreasing confidence). Note that, if there is no rule that applies to the transaction, nulls will be returned in some or all of the corresponding columns.

There’s more…

In this recipe, you’ll notice that we generate recommendations across the entire transactional data set. By using all transactions, we are creating generalized next-best-offer recommendations; however, we know that we can probably segment (that is, cluster) our customers into different behavioral groups (for example, fashion conscience, value shoppers, and so on.). Partitioning the transactions by behavioral segment and generating separate models for each segment will result in rules that are more accurate and actionable for each group. The biggest challenge with this approach is that you will have to identify the customer segment for each customer before making recommendations (that is, scoring). A unified approach would be to use the general recommendations for a customer until a customer segment can be assigned then use segmented models.

Correcting a confusion matrix for an imbalanced target variable by incorporating priors

Classification models generate probabilities and a classification predicted class value. When there is a significant imbalance in the proportion of True values in the target variable, the confusion matrix as seen in the Analysis node output will show that the model has all predicted class values equal to the False value, leading an analyst to conclude the model is not effective and needs to be retrained. Most often, the conventional wisdom is to use a Balance node to balance the proportion of True and False values in the target variable, thus eliminating the problem in the confusion matrix.

However, in many cases, the classifier is working fine without the Balance node; it is the interpretation of the model that is biased. Each model generates a probability that the record belongs to the True class and the predicted class is derived from this value by applying a threshold of 0.5. Often, no record has a propensity that high, resulting in every predicted class value being assigned False.

In this recipe we learn how to adjust the predicted class for classification problems with imbalanced data by incorporating the prior probability of the target variable.

Getting ready

This recipe uses the datafile cup98lrn_reduced_vars3.sav and the stream Recipe – correct with priors.str.

How to do it…

To incorporate prior probabilities when there is an imbalanced target variable:

  1. Open the stream Recipe – correct with priors.str by navigating to File | Open Stream.
  2. Make sure the datafile points to the correct path to the datafile cup98lrn_reduced_vars3.sav.
  3. Open the generated model TARGET_B, and open the Settings tab. Note that compute Raw Propensity is checked. Close the generated model.
  4. Duplicate the generated model by copying and pasting the node in the stream. Connect the duplicated model to the original generated model.
  5. Add a Type node to the stream and connect it to the generated model. Open the Type node and scroll to the bottom of the list. Note that the fields related to the two models have not yet been instantiated. Click on Read Values so that they are fully instantiated.

  6. Insert a Filler node and connect it to the Type node.
  7. Open the Filler node and, in the variable list, select $N1-TARGET_B. Inside the Condition section, type $RP1-TARGET_B’ >= TARGET_B_Mean, Click on OK to dismiss the Filler node (after exiting the Expression Builder).

  8. Insert an Analysis node to the stream. Open the Analysis node and click on the check box for Coincidence Matrices. Click on OK.
  9. Run the stream to the Analysis node. Notice that the coincidence matrix (confusion matrix) for $N-TARGET_B has no predictions with value = 1, but the coincidence matrix for the second model, the one adjusted by step 7 ($N1-TARGET_B), has more than 30 percent of the records labeled as value = 1.

How it works…

Classification algorithms do not generate categorical predictions; they generate probabilities, likelihoods, or confidences. For this data set, the target variable, TARGET_B, has two values: 1 and 0. The classifier output from any classification algorithm will be a number between 0 and 1. To convert the probability to a 1 or 0 label, the probability is thresholded, and the default in Modeler (and all predictive analytics software) is the threshold at 0.5. This recipe changes that default threshold to the prior probability.

The proportion of TARGET_B = 1 values in the data is 5.1 percent, and therefore this is the classic imbalanced target variable problem. One solution to this problem is to resample the data so that the proportion of 1s and 0s are equal, normally achieved through use of the Balance node in Modeler. Moreover, one can create the Balance node from running a Distribution node for TARGET_B, and using the Generate | Balance node (reduce) option. The justification for balancing the sample is that, if one doesn’t do it, all the records will be classified with value = 0.

The reason for all the classification decisions having value 0 is not because the Neural Network isn’t working properly. Consider the histogram of predictions from the Neural Network shown in the following screenshot. Notice that the maximum value of the predictions is less than 0.4, but the center of density is about 0.05. The actual shape of the histogram and the maximum predicted value depend on the Neural Network; some may have maximum values slightly above 0.5.

If the threshold for the classification decision is set to 0.5, since no neural network predicted confidence is greater than 0.5, all of the classification labels will be 0. However, if one sets the threshold to the TARGET_B prior probability, 0.051, many of the predictions will exceed that value and be labeled as 1. We can see the result of the new threshold by color-coding the histogram of the previous figure with the new class label, in the following screenshot.

This recipe used a Filler node to modify the existing predicted target value. The categorical prediction from the Neural Network whose prediction is being changed is $N1-TARGET_B. The $ variables are special field names that are used automatically in the Analysis node and Evaluation node. It’s possible to construct one’s own $ fields with a Derive node, but it is safer to modify the one that’s already in the data.

There’s more…

This same procedure defined in this recipe works for other modeling algorithms as well, including logistic regression. Decision trees are a different matter. Consider the following screenshot. This result, stating that the C5 tree didn’t split at all, is the result of the imbalanced target variable.

Rather than balancing the sample, there are other ways to get a tree built. For C&RT or Quest trees, go to the Build Options, select the Costs & Priors item, and select Equal for all classes for priors: equal priors. This option forces C&RT to treat the two classes mathematically as if their counts were equal. It is equivalent to running the Balance node to boost samples so that there are equal numbers of 0s and 1s. However, it’s done without adding additional records to the data, slowing down training; equal priors is purely a mathematical reweighting.

The C5 tree doesn’t have the option of setting priors. An alternative, one that will work not only with C5 but also with C&RT, CHAID, and Quest trees, is to change the Misclassification Costs so that the cost of classifying a one as a zero is 20, approximately the ratio of the 95 percent 0s to 5 percent 1s.

Subscribe to the weekly Packt Hub newsletter. We'll send you the results of our AI Now Survey, featuring data and insights from across the tech landscape.

* indicates required


Please enter your comment!
Please enter your name here