7 min read

[box type=”note” align=”” class=”” width=””]The following excerpt is taken from the book IBM SPSS Modeler Essentials written by Keith McCormick and Jesus Salcedo. This book gets you up and running with the fundamentals of SPSS Modeler, a premium tool for data mining and predictive analytics.[/box]

In today’s tutorial we will demonstrate how easy it is to work with missing values in a dataset using the SPSS Modeler.

Missing data is different than other topics in data modeling that you cannot choose to ignore . This is because failing to make a choice just means you are using the default option for a procedure, which most of the time is not optimal. In fact, it is important to remember that every model deals with missing data in a certain way, and some modeling techniques handle missing data better than others.

In SPSS Modeler, there are four types of missing data:

Type of missing data Definition
$Null$ value Applies only to numeric fields. This is a cell that is empty or has an illegal value
White space Applies only to string fields. This is a cell that is empty or has spaces.
Empty string Applies only to string fields. This is a cell that is empty. Empty string is a subset of white space
Blank value This is predefined code, and it applies to any type of field

The first step in dealing with missing data is to assess the type and amount of missing data for each field. Consider whether there is a pattern as to why data might be missing. This can help determine if missing values could have affected responses. Only then can we decide how to handle it.

There are two problems associated with missing data, and these affect the quantity and quality of the data:

  • Missing data reduces sample size (quantity)
  • Responders may be different from non-responders (quality—there could be biased results)

Ways to address missing data

There are three ways to address missing data:

  • Remove fields
  • Remove cases
  • Impute missing values

It can be necessary at times to remove fields with a large proportion of missing values. The easiest way to remove fields is to use a Filter node (discussed later in the book), however you can also use the Data Audit node to do this.

[box type=”info” align=”” class=”” width=””]Note that in some cases missing data can be predictive of behavior, so it is important to assess the importance of a variable before removing a field.[/box]

In some situations, it may be necessary to remove cases instead of fields. For example, you may be developing a predictive model to predict customers’ purchasing behavior and you simply do not have enough information concerning new customers. The easiest way to remove cases would be to use a Select node (discussed in the next chapter); however, you can also use the Data Audit node to do this.

Imputing missing values implies replacing values for fields. However, some people do not estimate values for categorical fields because it does not seem right. In general, it is easier to estimate missing values for numeric fields, such as age, where often analysts will use the mean, median, or mode.

[box type=”info” align=”” class=”” width=””]Note that it is not a good idea to estimate missing data if you are missing a large percentage of information for that field, because estimates will not be accurate. Typically, we try not to impute more than 5% of values.[/box]

To close out of the Data Audit node:

  1. Click OK to return to the stream canvas

Defining missing values in the Type node

When working with missing data, the first thing you need to do is define the missing data so that Modeler knows there is missing data, otherwise Modeler will think that the missing data is another value for a field (which, in some situations, it is, as in our dataset, but quite often this is not the case).

Although the Data Audit node provides a report of missing values, blank values need to be defined within a Type node (or Type tab of a source node) in order for these to be identified by the Data Audit node. The Type tab (or node) is the only place where users can define missing values (Missing column).

[box type=”info” align=”” class=”” width=””]Note that in the Type node, blank values and $null$ values are not shown; however, empty strings and white space are depicted by “” or ” “.[/box]

To define blank values:

  1. Edit the Var.File node.
  2. Click on the Types tab.
  3. Click on the Missing cell for the field Region.
  4. Select Specify in the Missing column.
  5. Click Define blanks.

Selecting Define blanks chooses Null and White space (remember, Empty String is a subset of White space, so it is also selected), and in this way these types of missing data are specified. To specify a predefined code, or a blank value, you can add each individual value to a separate cell in the Missing values area, or you can enter a range of numeric values if they are consecutive.

6. Type “Not applicable” in the first Missing values cell.

7. Hit Enter:

Region values

We have now specified that “Not applicable” is a code for missing data for the field Region.

8. Click OK.

In our dataset, we will only define one field as having missing data.

9. Click on the Clear Values button.

10. Click on the Read Values button:

Demographic

The asterisk indicates that missing values have been defined for the field Region. Now Not applicable is no longer considered a valid value for the field Region, but it will still be shown in graphs and other output. However, models will now treat the category Not applicable as a missing value.

11. Click OK.

Imputing missing values with the Data Audit node

As we have seen, the Data Audit node allows you to identify missing values so that you can get a sense of how much missing data you have. However, the Data Audit node also allows you to remove fields or cases that have missing data, as well as providing several options for data imputation:

  1. Rerun the Data Audit node.

Note that the field Region only has 15,774 valid cases now, because we have correctly identified that the Not applicable category was a predefined code for missing data.

2. Click on the Quality tab.

We are not going to impute any missing values in this example because it is not necessary, but we are going to show you some of the options, since these will be useful in other situations. To impute missing values you first need to specify when you want to impute missing values. For example:

3. Click in the Impute when cell for the field Region.

4. Select the Blank & Null Values.

Now you need to specify how the missing values will be imputed.

5. Click in the Impute Method cell for the field Region.

6. Select Specify.

In this dialog box, you can specify which imputation method you want to use, and once you have chosen a method, you can then further specify details about the imputation. There are several imputation methods:

  • Fixed uses the same value for all cases. This fixed value can be a constant, the mode, the mean, or a midpoint of the range (the options will vary depending on the measurement level of the field).
  • Random uses a random (different) value based on a normal or uniform distribution. This allows for there to be variation for the field with imputed values.
  • Expression allows you to create your own equation to specify missing values.
  • Algorithm uses a value predicted by a C&R Tree model.

Imputation settings

We are not going to impute any values now so click Cancel.

If we had selected an imputation method, we would then:

  1. Click on the field Region to select it.
  2. Click on the Generate menu.

The Generate menu of the Data Audit node allows you to remove fields, remove cases, or impute missing values, explained as follows:

  • Missing Values Filter Node: This removes fields with too much missing data, or keeps fields with missing data so that you can investigate them further
  • Missing Values Select Node: This removes cases with missing data, or keeps cases with missing data so that you can further investigate
  • Missing Values SuperNode: This imputes missing values:

Generate

If we were going to impute values, we would then click Missing Values SuperNode.

In this way you can impute missing values using SPSS Modeler, and it makes your analysis a lot more easier.

If you found our post useful, make sure to check out our book IBM SPSS Modeler Essentials, for more information on data mining and generating hidden insights using the popular SPSS Modeler tool.

IBM SPSS Modeler Essentials

 

Data Science Enthusiast. A massive science fiction and Manchester United fan. Loves to read, write and listen to music.

LEAVE A REPLY

Please enter your comment!
Please enter your name here