Merge Optimization Settings

The system provides two options that can help you merge data more efficiently in certain situations. These options allow you to optimize merging when one input dataset is significantly larger than the other datasets or when your data are already sorted by all or some of the key fields that you are using for the merge.

Note: Optimizations from this tab only apply to IBM® SPSS® Modeler native node execution only; that is, when the Merge node does not pushback to SQL. Optimization settings have no effect on SQL generation.

One input dataset is relatively large. Select to indicate that one of the input datasets is much larger than the others. The system will cache the smaller datasets in memory and then perform the merge by processing the large dataset without caching or sorting it. You will commonly use this type of join with data designed using a star-schema or similar design, where there is a large central table of shared data (for example, in transactional data). If you select this option, click Select to specify the large dataset. Note that you can select only one large dataset. The following table summarizes which joins can be optimized using this method.

Table 1. Summary of join optimizations
Type of Join Can be optimized for a large input dataset?
Inner Yes
Partial Yes, if there are no incomplete records in the large dataset.
Full No
Anti-join Yes, if the large dataset is the first input.

All inputs are already sorted by key field(s). Select to indicate that the input data are already sorted by one or more of the key fields that you are using for the merge. Make sure all your input datasets are sorted.

Specify existing sort order. Specify the fields that are already sorted. Using the Select Fields dialog box, add fields to the list. You can select from only the key fields that are being used for the merge (specified in the Merge tab). In the Order column, specify whether each field is sorted in ascending or descending order. If you are specifying multiple fields, make sure that you list them in the correct sorting order. Use the arrows to the right of the list to arrange the fields in the correct order. If you make a mistake in specifying the correct existing sort order, an error will appear when you run the stream, displaying the record number where the sorting is inconsistent with what you specified.

Depending on the case sensitivity of the collation method used by the database, optimization may not function correctly where one or more inputs are sorted by the database. For example, if you have two inputs where one is case sensitive and the other is case insensitive, the results of sorting could be different. Merge optimization causes records to be processed using their sorted order. As a result, if inputs are sorted using different collation methods, the Merge node reports an error and displays the record number where sorting is inconsistent. When all inputs are from one source, or are sorted using mutually inclusive collations, records can be merged successfully.

Note: Merging speed may benefit from enabling parallel processing.