How SQL generation works

The initial fragments of a stream leading from the database source nodes are the main targets for SQL generation. When a node is encountered that cannot be compiled to SQL, the data are extracted from the database and subsequent processing is performed by IBM® SPSS® Modeler Server.

During stream preparation and prior to execution, the SQL generation process happens as follows:

  • The server reorders streams to move downstream nodes into the “SQL zone” where it can be proven safe to do so. (This feature can be disabled on the server.)
  • Working from the source nodes toward the terminal nodes, SQL expressions are constructed incrementally. This phase stops when a node is encountered that cannot be converted to SQL or when the terminal node (for example, Table node or Graph node) is converted to SQL. At the end of this phase, each node is labeled with an SQL statement if the node and its predecessors have an SQL equivalent.
  • Working from the nodes with the most complicated SQL equivalents back toward the source nodes, the SQL is checked for validity. The SQL that was successfully validated is chosen for execution.
  • Nodes for which all operations have generated SQL are highlighted in purple on the stream canvas. Based on the results, you may want to further reorganize your stream where appropriate to take full advantage of database execution. See the topic Tips for maximizing SQL generation for more information.

Where Improvements Occur

SQL optimization improves performance in a number of data operations:

  • Joins (merge by key). Join operations can increase optimization within databases.
  • Aggregation. The Aggregate, Distribution, and Web nodes all use aggregation to produce their results. Summarized data uses considerably less bandwidth than the original data.
  • Selection. Choosing records based on certain criteria reduces the quantity of records.
  • Sorting. Sorting records is a resource-intensive activity that is performed more efficiently in a database.
  • Field derivation. New fields are generated more efficiently in a database.
  • Field projection. IBM SPSS Modeler Server extracts only fields that are required for subsequent processing from the database, which minimizes bandwidth and memory requirements. The same is also true for superfluous fields in flat files: although the server must read the superfluous fields, it does not allocate any storage for them.
  • Scoring. SQL can be generated from decision trees, rulesets, linear regression, and factor-generated models.