Order of Nodes
Even when you are not using SQL optimization, the order of nodes in a stream can affect performance. The general goal is to minimize downstream processing; therefore, when you have nodes that reduce the amount of data, place them near the beginning of the stream. IBM® SPSS® Modeler Server can apply some reordering rules automatically during compilation to bring forward certain nodes when it can be proven safe to do so. (This feature is enabled by default. Check with your system administrator to make sure it is enabled in your installation.)
When using SQL optimization, you want to maximize its availability and efficiency. Since optimization halts when the stream contains an operation that cannot be performed in the database, it is best to group SQL-optimized operations together at the beginning of the stream. This strategy keeps more of the processing in the database, so less data is carried into IBM SPSS Modeler.
The following operations can be done in most databases. Try to group them at the beginning of the stream:
- Merge by key (join)
- Select
- Aggregate
- Sort
- Sample
- Append
- Distinct operations in include mode, in which all fields are selected
- Filler operations
- Basic derive operations using standard arithmetic or string manipulation (depending on which operations are supported by the database)
- Set-to-flag
The following operations cannot be performed in most databases. They should be placed in the stream after the operations in the preceding list:
- Operations on any nondatabase data, such as flat files
- Merge by order
- Balance
- Distinct operations in discard mode or where only a subset of fields are selected as distinct
- Any operation that requires accessing data from records other than the one being processed
- State and count field derivations
- History node operations
- Operations involving "
@
" (time-series) functions - Type-checking modes Warn and Abort
- Model construction, application, and analysis
Note: Decision trees, rulesets, linear regression, and factor-generated models can generate SQL and can therefore be pushed back to the database.
- Data output to anywhere other than the same database that is processing the data