SQL optimization
One of the most powerful capabilities of IBM® SPSS® Modeler is the ability to perform many data preparation and mining operations directly in the database. By generating SQL code that can be pushed back to the database for execution, many operations, such as sampling, sorting, deriving new fields, and certain types of graphing, can be performed in the database rather than on the IBM SPSS Modeler or IBM SPSS Modeler Server computer. When you are working with large data sets, these pushbacks can dramatically enhance performance in several ways:
- By reducing the size of the result set to be transferred from the DBMS to IBM SPSS Modeler. When large result sets are read through an ODBC driver, network I/O or driver inefficiencies can result. For this reason, the operations that benefit most from SQL optimization are row and column selection and aggregation (Select, Sample, Aggregate nodes). These operations typically reduce the size of the data set to be transferred. Data can also be cached to a temporary table in the database at critical points in the stream (after a Merge or Select node, for example) to further improve performance.
- By using the performance and scalability of the database. Efficiency is increased because a DBMS can often take advantage of parallel processing, more powerful hardware, more sophisticated management of disk storage, and the presence of indexes.
Given these advantages, IBM SPSS Modeler is designed to maximize the amount of SQL generated by each stream so that only those operations that can't be compiled to SQL will be run by IBM SPSS Modeler Server. Because of limitations in what can be expressed in standard SQL (SQL-92), however, certain operations may not be supported. For more information, see Tips for maximizing SQL generation.
- Because of minor differences in SQL implementation, streams that run in a database can return slightly different results when they run in IBM SPSS Modeler. These differences may also vary depending on the database vendor, for similar reasons. For example, depending on the database configuration for case sensitivity in string comparison and string collation, IBM SPSS Modeler streams that run by using SQL pushback might produce different results from those that run without SQL pushback. Contact your database administrator for advice on configuring your database. To maximize compatibility with IBM SPSS Modeler, we recommend making sure database string comparisons are case-sensitive.
- Database modeling and SQL optimization require that IBM SPSS Modeler Server connectivity is enabled on the
IBM SPSS Modeler computer. With this setting
enabled, you can access database algorithms, push back SQL directly from IBM SPSS Modeler, and access IBM SPSS Modeler Server. To verify the current license
status, from the IBM SPSS Modeler menu, go to:
-
If connectivity is enabled, you see the option Server Enablement in the License Status tab.
See Connecting to IBM SPSS Modeler Server for more information.
-
- When using IBM SPSS Modeler to generate SQL, the result that uses SQL push back might not be consistent with IBM SPSS Modeler native on some platforms (Linux/zLinux, for example). The reason is that floating point is handled differently on different platforms.
Database requirements
For more information about which databases and ODBC drivers are supported and tested for use with IBM SPSS Modeler, see the product compatibility matrices on the corporate Support site at http://www.ibm.com/support.
You might gain more performance improvements by using database modeling.
ODBC driver setup
To ensure that time details (such as HH:MM:SS) are processed correctly when using SQL 2012 on Windows 32bit systems, when setting up your ODBC SQL Server Wire Protocol Driver, select both the Enable Quoted Identifiers and Fetch TWFS as Time options.