Writing SQL Queries
When using the Database node, you should pay special attention to any SQL queries that result in a dataset with duplicate column names. These duplicate names often prevent SQL optimization for any downstream nodes.
IBM® SPSS® Modeler uses nested
SELECT
statements to push back SQL for streams that use an SQL query in the
Database source node. In other words, the stream nests the query specified in the Database source
node inside of one or more SELECT
statements generated during the optimization of
downstream nodes. Therefore, if the result set of a query contains duplicate column names, the
statement cannot be nested by the RDBMS. Nesting difficulties occur most often during a table join
where a column with the same name is selected in more than one of the joined tables. For example,
consider this query in the source node:
SELECT e.ID, e.LAST_NAME, d.*
FROM EMP e RIGHT OUTER JOIN
DEPT d ON e.ID = d.ID;
The query will prevent subsequent SQL optimization, since this
SELECT
statement would result in a dataset with two columns called
ID
.
In order to allow full SQL optimization, you should be more explicit when writing SQL queries and specify column aliases when a situation with duplicate column names arises. The statement below illustrates a more explicit query:
SELECT e.ID AS ID1, e.LAST_NAME, d.*
FROM EMP e RIGHT OUTER JOIN
DEPT d ON e.ID = d.ID;