Tips for maximizing SQL generation

To get the best performance boost from SQL optimization, pay attention to the following items.

Stream order. SQL generation may be halted when the function of the node has no semantic equivalent in SQL because IBM® SPSS® Modeler’s data-mining functionality is richer than the traditional data-processing operations supported by standard SQL. When this happens, SQL generation is also suppressed for any downstream nodes. Therefore, you may be able to significantly improve performance by reordering nodes to put operations that halt SQL as far downstream as possible. The SQL optimizer can do a certain amount of reordering automatically (just make sure stream rewriting is enabled), but further improvements may be possible. A good candidate for this is the Select node, which can often be brought forward. See the topic Nodes supporting SQL generation for more information.

CLEM expressions. If a stream cannot be reordered, you may be able to change node options or CLEM expressions or otherwise recast the way the operation is performed, so that it no longer inhibits SQL generation. Derive, Select, and similar nodes can commonly be rendered into SQL, provided that all of the CLEM expression operators have SQL equivalents. Most operators can be rendered, but there are a number of operators that inhibit SQL generation (in particular, the sequence functions [“@ functions”]). Sometimes generation is halted because the generated query has become too complex for the database to handle. See the topic CLEM Expressions and Operators Supporting SQL Generation for more information.

Multiple source nodes. Where a stream has multiple database source nodes, SQL generation is applied to each input branch independently. If generation is halted on one branch, it can continue on another. Where two branches merge (and both branches can be expressed in SQL up to the merge), the merge itself can often be replaced with a database join, and generation can be continued downstream.

Database algorithms. Model estimation is always performed on IBM SPSS Modeler Server rather than the database, except when using database-native algorithms from Microsoft, IBM, or Oracle.

Scoring models. In-database scoring is supported for some models by rendering the generated model into SQL. However, some models generate extremely complex SQL expressions that aren’t always evaluated effectively within the database. For this reason, SQL generation must be enabled separately for each model node. If you find that a model node is inhibiting SQL generation, go to the Settings tab on the node dialog box and select Generate SQL for this model (with some models, you may have additional options controlling generation). Run tests to confirm that the option is beneficial for your application. See the topic Nodes supporting SQL generation for more information.

When testing modeling nodes to see if SQL generation for models works effectively, we recommend first saving all streams from IBM SPSS Modeler. Some database systems may hang while trying to process the (potentially complex) generated SQL, requiring IBM SPSS Modeler to be closed from the Windows task manager.

Database caching. If you are using a node cache to save data at critical points in the stream (for example, following a Merge or Aggregate node), make sure that database caching is enabled along with SQL optimization. This will allow data to be cached to a temporary table in the database (rather than the file system) in most cases. See the topic Configuring SQL Optimization for more information.

Vendor-specific SQL. Most of the generated SQL is standards-conforming (SQL-92), but some nonstandard, vendor-specific features are exploited where practical. The degree of SQL optimization can vary, depending on the database source.

SQL configuration option. By default, SPSS Modeler considers SQL queries written in an ODBC source node to be non-replayable, meaning that the query is considered to return different results when being executed multiple times. However, in some cases, this may prevent SPSS Modeler from generating SQL for downstream nodes. You can override this behavior by changing the following value to Y in the odbc-db2-custom-properties.cfg. The file is located in the SPSS Modeler config directory.

assume_custom_sql_replayable, Y