How to combine data files within IBM SPSS Modeler

[box type="note" align="" class="" width=""]The following extract is taken from the book IBM SPSS Modeler Essentials, written by Keith McCormick and Jesus Salcedo. SPSS Modeler is one of the popularly used enterprise tools for data mining and predictive analytics. [/box]

In this article, we will explore how SPSS Modeler can be effectively used to combine different file types for efficient data modeling.

In many organizations, different pieces of information for the same individuals are held in separate locations. To be able to analyze such information within Modeler, the data files must be combined into one single file. The Merge node joins two or more data sources so that information held for an individual in different locations can be analyzed collectively. The following diagram shows how the Merge node can be used to combine two separate data files that contain different types of information:

combine-data-files-within-ibm-spss-modeler-img-0

Like the Append node, the Merge node is found in the Record Ops palette. This node takes multiple data sources and creates a single source containing all or some of the input fields.

Let's go through an example of how to use the Merge node to combine data files:

Open the Merge stream.

The Merge stream contains the files we previously appended, as well as the main data file we were working with in earlier chapters.

2. Place a Merge node from the Record Ops palette on the canvas.

3. Connect the last Reclassify node to the Merge node.

4. Connect the Filter node to the Merge node.

[box type="info" align="" class="" width=""]Like the Append node, the order in which data sources are connected to the Merge node impacts the order in which the sources are displayed. The fields of the first source connected to the Merge node will appear first, followed by the fields of the second source connected to the Merge node, and so on.[/box]

5. Connect the Merge node to a Table node:

combine-data-files-within-ibm-spss-modeler-img-1

6. Edit the Merge node.

Since the Merge node can cope with a variety of different situations, the Merge tab allows you to specify the merging method.

There are four methods for merging:

Order: It joins the first record in the first dataset with the first record in the second dataset, and so on. If any of the datasets run out of records, no further output records are produced. This method can be dangerous if there happens to be any cases that are missing from a file, or if files have been sorted differently.
Keys: It is the most commonly used method, used when records that have the same value in the field(s) defined as the key are merged. If multiple records contain the same value on the key field, all possible merges are returned.
Condition: It joins records from files that meet a specified condition.
Ranked condition: It specifies whether each row pairing in the primary dataset and all secondary datasets are to be merged; use the ranking expression to sort any multiple matches from low to high order.

Let's combine these files. To do this:

Set Merge Method to Keys.

Fields contained in all input sources appear in the Possible keys list. To identify one of more fields as the key field(s), move the selected field into the Keys for merge list. In our case, there are two fields that appear in both files, ID and Year.

2. Select ID in the Possible keys list and place it into the Keys for merge list:

combine-data-files-within-ibm-spss-modeler-img-2

There are five major methods of merging using a key field:

Include only matching records (inner join) merges only complete records, that is, records; that are available in all datasets.
Include matching and non-matching records (full outer join) merges records that appear in any of the datasets; that is, the incomplete records are still retained. The undefined value ($null$) is added to the missing fields and included in the output.
Include matching and selected non-matching records (partial outerjoin) performs left and right outer-joins. All records from the specified file are retained, along with only those records from the other file(s) that match records in the specified file on the key field(s). The Select... button allows you to designate which file is to contribute incomplete records.
Include records in first dataset not matching any others (anti-join) provides an easy way of identifying records in a dataset that do not have records with the same key values in any of the other datasets involved in the merge. This option only retains records from the dataset that match with no other records.
Combine duplicate key fields is the final option in this dialog, and it deals with the problem of duplicate field names (one from each dataset) when key fields are used. This option ensures that there is only one output field with a given name, and this is enabled by default.

The Filter tab

The Filter tab lists the data sources involved in the merge, and the ordering of the sources determines the field ordering of the merged data. Here, you can rename and remove fields. Earlier, we saw that the field Year appeared in both datasets; here we can remove one version of this field (we could also rename one version of the field to keep both):

Click on the arrow next to the second Year field:

combine-data-files-within-ibm-spss-modeler-img-3

The second Year field will no longer appear in the combined data file.

The Optimization tab

The Optimization tab provides two options that allow you to merge data more efficiently when one input dataset is significantly larger than the other datasets, or when the data is already presorted by all or some of the key fields that you are using to merge:

Click OK.
Run the Table:

combine-data-files-within-ibm-spss-modeler-img-4

All of these files have now been combined. The resulting table should have 44 fields and 143,531 records.

We saw how the Merge node is used to join data files that contain different information for the same records.

If you found this post useful, make sure to check out IBM SPSS Modeler Essentials for more information on leveraging SPSS Modeler to get faster and efficient insights from your data.