I have a profound statement to make about SQL skills - on average most DB2® developers are seven years behind. The reason I can make this claim is that I have traveled all over the globe in the past 12 months (almost 100,000 miles) and have spoken to over 5,000 DB2 developers on the subject of advanced SQL. From all those contacts, there hasn't been one person that did not learn one new SQL feature, or a use of and old SQL feature, from my presentation or seminar. The one feature that I find especially underutilized is table expressions. These are called in-line views on the non-OS/390 platforms and both features have been around for seven years. These are nothing more than SELECT statements that are nested in the FROM clause and are surrounded by parentheses. If they are used correctly, they can be very powerful. I don't blame developers for the lack of knowledge on this feature. I believe it is because they have been sent to a JavaTM class or two instead of DB2 Version x Transition classes.
Although table expressions seem like a simple feature syntactically, they can perform many practical and powerful processes including:
- Pre-filtering full outer joins.
- Pre-filtering NULL supplying tables in LEFT/RIGHT joins.
- Separating GROUP BY work.
- Generating data from existing sources.
- Using table expressions to help tune queries.
Some of these uses of table expressions you may be familiar with, others you may not. I will discuss examples of each table expression use to show off the power.
I am on a quest to update the SQL skills worldwide and I am confident that you will learn at least one new use of table expressions that will make an impact at your site, so please read on.
Pre-filtering full outer joins
Full outer joins have been enhanced for performance in recent years with work file removal and parallelism; however, the syntax did not receive the same magic that the left and right joins received, such as automatic predicate pushdown and predicate transitive closure. The full outer join shown in Figure 1 does not produce the expected results, e.g., give me all the orders within the specified date range whether or not they have descriptions less than 40 characters, and give me all the descriptions less than 40 characters whether or not there are orders for them:
Figure 1. This statement does not produce the expected results
SELECT COALESCE(O.ORDER_NO, 'Order Number not Available')
, COALESCE(D.DESC, 'Description not Available')
FROM ORDER O
FULL JOIN
DESCRIPTIONS D
ON O.ORDER_NO = D.ORDER_NO
WHERE O.ORDER_DATE BETWEEN '1999-01-01' AND CURRENT DATE
AND LENGTH(D.DESC < 40)
|
Instead, the results will look like an inner join because the syntax used in Figure 1 forces the local filtering to being applied in stage 2 -- in other words, after the join. This means that many rows will be participating in the join only to be eliminated in a later stage. To force the local filtering to be applied, before the join (stage 1), you can use table expressions, as shown in Figure 2.
Figure 2. Forcing filtering before the join to obtain the correct results
SELECT COALESCE(O.ORDER_NO, 'Order Number not Available')
, COALESCE(D.DESC, 'Description not Available')
FROM (SELECT O.ORDER
FROM ORDER O
WHERE O.ORDER_DATE BETWEEN '1999-01-01' AND CURRENT DATE) AS O
FULL JOIN
(SELECT D.DESC, D.ORDER_NO
FROM DESCRIPTIONS D
WHERE LENGTH(D.DESC < 40) AS D
ON O.ORDER_NO = D.ORDER_NO
|
The 'O' table expression explicitly pre-filters the outer table, and the 'D' table expression explicitly pre-filters the inner table. This syntax gives me all the orders within the specified date range whether or not they have descriptions less than 40 characters, and give me all the descriptions less than 40 characters whether or not there are orders for them. The before join conditions ensure that only final result rows will be processed in the join.
Pre-filtering null-supplying tables of left and right joins
For left and right joins, if the null-supplying table is filtered in the WHERE clause, then it's really an inner join, which probably isn't what you wanted when you coded the join. If you really want a left or right join, then you have two choices:
- Place the local filtering of the null-supplying table in the ON clause.
This will move the filtering to a during-join condition; however, it will not eliminate final result rows. During-join conditions on null-supplying tables only serve to "null out" (the row remains but is filled with nulls), the final result rows, including values from the protected table, that are predetermined by the before join conditions. See Terry Purcell's previous Meet the Expert column for excellent examples of during-join conditions.
- Place the local filtering of the null-supplying table in a table expression.
This will only "null out" rows from null-supplying table. The query in Figure 3 demonstrates the use of table expressions, and most likely, what developers need:
Figure 3. Using a table expression to force local filtering before the join
SELECT O.ORDER_NO
, COALESCE(D.DESC, 'Short Description not Available')
FROM ORDER O
LEFT JOIN
(SELECT D.DESC, D.ORDER_NO
FROM DESCRIPTIONS D
WHERE LENGTH(D.DESC) < 40) AS D
ON O.ORDER_NO = D.ORDER_NO
WHERE O.ORDER_DATE BETWEEN '1999-01-01' AND CURRENT DATE
|
The query in Figure 3 returns all the orders within the specified date range whether or not they have descriptions less than 40 characters.
If you develop on DB2 on platforms other than 0S/390 and z/0S, you can skip this section because your optimizer will either automatically rewrite the GROUP BY statement to separate the GROUP BY work and eliminate the sort, or you may have constructed automatic summary tables, ASTs, that can eliminate the sort. For all mainframers, you need to think about rewriting your own queries.
A typical report quite often has to supply detail information, as well as summary information. GROUP BY is a very picky clause in DB2:
- The first rule of GROUP BY is that you must SELECT what you are grouping by.
- The second rule is that you must aggregate everything else with a built-in column function.
This combination of rules usually forces you to join tables so that the detail and summary information can be merged in one result row as in the following query:
SELECT C.CUST_ID, MIN(C.CUST_NAME) AS CUST_NAME, MIN(C.CUST_PHONE)AS CUST_PHONE, SUM(S.SALES)AS TOTAL_SALES FROM CUSTOMER C, SALES S WHERE C.CUST_ID = S.CUST_ID AND S.SALES_DATE BETWEEN :date-lo AND :date-hi GROUP BY C.CUST_ID |
This usually means the detail rows that qualify from each table are joined prior to the GROUP BY processing. Let say in this example there were 1000 CUSTOMER rows joined to 200,000 SALES rows and then the GROUP BY was processed. Instead, you can use table expressions to force the optimizer to process the aggregations on a table-by-table basis. The query shown in Figure 4, explicitly influences the optimizer toward earlier GROUP BY processing:
Figure 4. Moving GROUP BY inside a table expression to reduce the number of rows joined
SELECT C.CUST_NAME, C.CUST_PHONE, S.TOTAL_SALES FROM CUSTOMER C, , (SELECT S.CUST_ID, SUM(S.SALES) AS TOTAL_SALES FROM SALES S WHERE S.SALES_DATE BETWEEN :date-lo AND :date-hi GROUP BY S.CUST_ID) AS S WHERE C.CUST_ID = S.CUST_ID |
The table expression 'S' in this example is explicitly separating the GROUP BY work. Quite often the aggregation is only needed on one or two tables and the detail comes from different tables. This syntax allows the optimizer to search for index support on the CUST_ID column of the SALES table to perform the GROUP BY. If available, the sort will be avoided. This greatly reduces the number of rows involved in the join regardless if a sort is performed. Instead of a 1000 CUSTOMER rows joined to 200,000 SALES rows in the previous example, it is now 1000 CUSTOMER rows joined to only 10,000 sales summary rows. The GROUP BY processing compresses the table expression result to summary rows.
Generating data from existing sources
By far the most powerful use of table expressions can be explained using a quote from a client, "When we need data not available from DB2, we use table expressions to manufacture it." The query shown in Figure 5 contains a table expression that calculates sales for groupings that are generated in the table expression.
Figure 5. Generating subcategory data in the table expression
SELECT C.SUBCATEGORY, SUM(S.SALES) AS S.TOTAL_SALES
FROM SALES S ,
(SELECT C.CUST_ID, SUBSTR(C.COLX, :hvstart, :hvlngth) AS SUBCATEGORY
FROM CUSTOMER C
WHERE C.CUST_REGION BETWEEN :reg-lo AND :reg-high) AS C
WHERE C.CUST_ID = S.CUST_ID
GROUP BY C.SUBCATEGORY
|
This query is effective for periodic reporting even if a sort is required. It can also be good for finding out if there is hidden meaning in column values. However, if end users are frequently requesting a common starting position and length, i.e. SUBSTR(C.COLX, 2, 4), that portion of the column should be added to the table as its own column. This would allow an index to be provided on that new column to avoid the sort.
Figure 6 is an example of using a table expression to generate the correct quarters for a quarterly report.
Figure 6. Generating subcategory data for a quarterly report
SELECT C.SUBCATEGORY, SUM(S.SALES) AS S.TOTAL_SALES FROM SALES S , (SELECT C.CUST_ID, (CASE WHEN MONTH(SALE_DT) BETWEEN 2 AND 4 THEN 'Q1' WHEN MONTH(SALE_DT) BETWEEN 5 AND 7 THEN 'Q2' WHEN MONTH(SALE_DT) BETWEEN 8 AND 10 THEN 'Q3' ELSE 'Q4' END) AS SUBCATEGORY FROM CUSTOMER C WHERE C.CUST_REGION BETWEEN :reg-lo AND :reg-high) AS C WHERE C.CUST_ID = S.CUST_ID GROUP BY C.SUBCATEGORY |
You are allowed to refer to the generated data anywhere outside the table expressions as long as the data you are generating is named using the AS identifier inside the table expression. This means that you can also order by that identifier, include it in calculations, and put it anywhere an expression can be placed. The example in Figure 6 can be extended to handle more options by using host variables, as shown in Figure 7.
Figure 7. Using host variables for extended flexibility
SELECT C.SUBCATEGORY, SUM(S.SALES) AS S.TOTAL_SALES FROM SALES S , (SELECT C.CUST_ID, (CASE WHEN MONTH(SALE_DT) BETWEEN :hv1a AND :hv1b THEN :hv1 WHEN MONTH(SALE_DT) BETWEEN :hv2a AND :hv2b THEN :hv2 WHEN MONTH(SALE_DT) BETWEEN :hv3a AND :hv3b THEN :hv3 WHEN MONTH(SALE_DT) BETWEEN :hv4a AND :hv4b THEN :hv4 WHEN MONTH(SALE_DT) BETWEEN :hv5a AND :hv5b THEN :hv5 WHEN MONTH(SALE_DT) BETWEEN :hv6a AND :hv6b THEN :hv6 WHEN MONTH(SALE_DT) BETWEEN :hv7a AND :hv7b THEN :hv7 WHEN MONTH(SALE_DT) BETWEEN :hv8a AND :hv8b THEN :hv8 WHEN MONTH(SALE_DT) BETWEEN :hv9a AND :hv9b THEN :hv9 WHEN MONTH(SALE_DT) BETWEEN :hv10a AND :hv10b THEN :hv10 WHEN MONTH(SALE_DT) BETWEEN :hv11a AND :hv11b THEN :hv11 ELSE :hv12 END) AS SUBCATEGORY FROM CUSTOMER C WHERE C.CUST_REGION BETWEEN :reg-lo AND :reg-high) AS C WHERE C.CUST_ID = S.CUST_ID GROUP BY C.SUBCATEGORY |
Now the query can handle endless requests for monthly groupings - What do you want to GROUP BY today? How about the first 4 months in one period, the next two in another, months 6 + 12 together and the remaining as the fourth grouping? Keep this in mind when building applications. It would take hundreds of views to emulate the same options. This is why at most of my clients, views have been moved out of production applications and allowed only in end user environments.
For even greater power and flexibility, add a user-defined function, as shown in Figure 8.
Figure 8. Using a UDF to enable access to nonrelational data
SELECT C.SUBCATEGORY, SUM(S.SALES) AS S.TOTAL_SALES FROM SALES S , (SELECT C.CUST_ID, UDFUNC2(:hvparm1, :hvparm2) AS SUBCATEGORY FROM CUSTOMER C WHERE C.CUST_REGION BETWEEN :reg-lo AND :reg-high) AS C WHERE C.CUST_ID = S.CUST_ID GROUP BY C.SUBCATEGORY |
Now the query can access data from any source available from a mainframe platform. For example, UDFUNC2 can access data in a flat file stored in distributed environment, pull it in to table expression 'C', call it SUBCATEGORY, and use GROUP BY on it. What do you really want to GROUP BY today? Think of using table expressions any time you need to pull in or create data that is not living and breathing natively in your DB2 environment but which needs to be included in a relational operation (join, union, subquery, ORDER BY, GROUP BY, etc.) and only for one execution of the query. For multiple executions or iterations, I would recommend global temporary tables (a topic for another day).
Using table expressions to help tune queries
Query tuning is a very large subject so I will narrow it down by just discussing the role that table expressions can play in tuning. First of all, moving views out of SYSIBM.SYSVIEWS and placing them in-line in the FROM clause helps query tuners explicitly see more of what the query is trying to accomplish. This is important if you are the one who is beeped at 0200 hours. The indirect benefit of the move is allowing for more extensibility with the use of host variables.
To demonstrate proactive tuning with table expressions, I will show you a query and walk you through the performance problems before I show the solution. This first query is the following 5-table join:
SELECT Columns .. FROM TABX, TABY, TABZ, TAB1, TAB2 WHERE join conditions |
The problem with the previous query is the table join sequence. The optimizer analyzes the numbers of rows participating in the join from each table, costs out different sequencing and join methods, and keeps starting with the TABX, TABY, and TABZ combination. The desired join sequence starts with TAB1 and TAB2. There are many ways to influence join sequencing for static queries; however, dynamic queries are not so lucky. Table expressions to the rescue! The table expression works by forcing the optimizer to pre-join tables through the use of the DISTINCT keyword, which also requests a sort, as follows:
SELECT Columns
FROM TABX, TABY, TABZ
,(SELECT DISTINCT COL1, COL2
..
FROM TAB1, TAB2
WHERE join conditions) AS PREJOIN
WHERE remaining join conditions
|
This query now:
- Joins TAB1 and TAB2.
- Sorts to remove possible duplicates.
- Processes the other tables.
The only drawback to this technique is the added cost of the sort and creation and scanning of a logical work file. Most of the time, this drawback is compensated for with the optimal join sequence.
This next scenario is more likely on a non-OS/390 platform, but the solution can work for all platforms. In the example below, the optimizer chooses to materialize the entire table expression T2 and then joins to T1 using a merge scan join technique:
SELECT Columns from either set of data
FROM table1 T1,
(SELECT T2.STUDENT_YEAR, SUM(C8) AS SUM8, MAX(C9) AS MAX9
FROM table2 T2
GROUP BY T2.STUDENT_YEAR) AS T2
WHERE T1.MAJOR_ID = T2.MAJOR_ID
|
The solution requires that you insert the keyword TABLE just before the table expression. This enables the table expression to be correlated to the outer table. The correlation influences the optimizer toward nested loop joins and can greatly reduce the materialization. In the following example, a table expression was made correlated by moving the join condition inside the table expression:
SELECT Columns from either set of data
FROM table1 T1,
TABLE(SELECT T2.STUDENT_YEAR, SUM(C8) AS SUM8, MAX(C9) AS MAX9
FROM table2 T2
WHERE T1.MAJOR_ID = T2.MAJOR_ID
GROUP BY T2.STUDENT_YEAR) AS T2
|
The query now processes one T1.MAJOR_ID at a time. Only the rows that qualify for that MAJOR_ID are materialized in a logical work file. The smaller the optimizer thinks the work file will be, the more likely nested loop join will be chosen. Be alert if you try this on DB2 for OS/390 or z/OSTM, because this solution does not always improve performance. Check EXPLAIN output, use estimation tools, or run benchmarks to verify the success of this tuning effort.
If your goal is reduced program code and increased processing throughput, the powerful but practical table expressions can really help. It takes much less program code when a table expression can be used to generate new data and stay in the DB2 engine to perform joining, grouping, ordering and calculations, instead of declaring multiple cursors. Increasing a query's extensibility is easy when host variables are added to table expressions. Tuning queries, separating the GROUP BY work, and pre-filtering tables for outer joins, all help to increase processing throughput. Take your pick. The technology has been there for a while but it is up to you to take advantage of it. I hope you will assist me on my quest to update SQL skills worldwide by spreading your new discoveries throughout your site.

Sheryl Larsen is an internationally recognized researcher, consultant and lecturer, specializing in DB2 and is known for her extensive expertise in SQL. Sheryl has over 16 years experience in DB2, has published many articles, several popular DB2 posters, and co-authored a book, DB2 Answers, published by Osborne-McGraw-Hill. She was voted Best Overall Speaker at the 1999 & 2001 IDUG conferences and the Executive Editor of the IDUG Solutions Journal magazine 1997-2000. Currently, she is President of the Midwest Database Users Group (mwdug.org), a member of IBM's DB2 Gold Consultants program, and is President of Sheryl M. Larsen, Inc.(www.smlsql.com), a firm specializing in Advanced SQL Consulting and Education. She can be contacted at SherylMLarsen@cs.com.
Comments (Undergoing maintenance)





