Writing SQL Queries

When using the Database node, you should pay special attention to any SQL queries that result in a dataset with duplicate column names. These duplicate names often prevent SQL optimization for any downstream nodes.

IBM® SPSS® Modeler uses nested SELECT statements to push back SQL for streams that use an SQL query in the Database source node. In other words, the stream nests the query specified in the Database source node inside of one or more SELECT statements generated during the optimization of downstream nodes. Therefore, if the result set of a query contains duplicate column names, the statement cannot be nested by the RDBMS. Nesting difficulties occur most often during a table join where a column with the same name is selected in more than one of the joined tables. For example, consider this query in the source node:

SELECT e.ID, e.LAST_NAME, d.*
FROM EMP e RIGHT OUTER JOIN
DEPT d ON e.ID = d.ID;

The query will prevent subsequent SQL optimization, since this SELECT statement would result in a dataset with two columns called ID.

In order to allow full SQL optimization, you should be more explicit when writing SQL queries and specify column aliases when a situation with duplicate column names arises. The statement below illustrates a more explicit query:

SELECT e.ID AS ID1, e.LAST_NAME, d.*
FROM EMP e RIGHT OUTER JOIN
DEPT d ON e.ID = d.ID;