Meet the experts: Terry Purcell on coding predicates in outer joins: Advanced outer join constructs

Part 2 of 2

In this second of two parts, Terry Purcell, an IBM Gold Consultant with Yevich, Lawson, and Associates, and the the winner of the Best Overall Speaker Award at the 2001 IDUG Asia Pacific conference in Canberra, Australia, offers more advice on coding predicates in outer joins.

Terry Purcell (terry_purcell@ylassoc.com), Senior Consultant, Yevich, Lawson and Associates

Terry Purcell is a Senior Consultant with YL&A (Yevich, Lawson & Associates) and is an industry recognized authority in DB2 SQL. Terry has presented at numerous conferences on complex SQL and published many articles on SQL performance including an IBM white paper entitled "Evolution of Star Joins." He is a regular author for the DB2 Performance Journal and frequent contributor to the DB2 List Server. Terry has been working with DB2 for over 10 years in both database administration and application development. He is also an IBM Certified Solutions Expert in DB2 V7 Database Administration for OS/390 and a member of the IBM Gold Consultants program for DB2. Terry can be reached at Terry_Purcell@ylassoc.com.



17 January 2002

© 2002 International Business Machines Corporation. All rights reserved.

Introduction

This is part two of my column in which I attempt to make it easier for you to understand and use the powerful outer join feature of the SQL language.

Part one provided a simple comparison of inner and outer joins, and also introduced new terminology for explaining the tables in the outer join operations. Finally, I discussed the different predicate types and at what stages DB2 ® is able to apply them.

In this part, I will provide background on coding predicates on the NULL-supplying table, how DB2 may simplify your query to improve performance, and what steps you can take to ensure that you get the results you expect.


Left and right outer join NULL-supplying table predicates

Outer join simplification

To recap some terminology that I introduced in Part 1:

  • The preserved row table is the table for which rows are preserved for an unmatched row in the join operation.
  • The NULL-supplying table is the table for which rows are supplied with NULLs for an unmatched row in the join.

Whereas the most overwhelming attribute of a WHERE clause predicate applied to the preserved-row table is that DB2 can apply the predicate before or after the join; a WHERE clause predicate applied to the NULL-supplying table has a very different impact, in that it causes DB2 to simplify the join if the predicate negates NULLs introduced by the outer join.

To explain what I mean by this, see Figure 13 , which shows an example of a WHERE clause predicate that references the NULL-supplying table.

Left Outer Join -- join simplification
Left Outer Join -- join simplification

A WHERE clause predicate must evaluate to TRUE for a row to qualify. If there is an unmatched row in the join, the columns from the NULL-supplying table are NULL. When DB2 compares the WHERE clause predicate D.DEPTNAME NOT LIKE '%CENTER%' to a NULL, the result is neither TRUE nor FALSE, but UNKNOWN. This row is not returned because it does not evaluate to TRUE. Thus, NULLs that are supplied by the left outer join are negated by the WHERE clause predicate. This causes DB2 to determine that a left outer join is unnecessary, and it causes the query to be rewritten as an inner join, which may or may not be what you intended when you coded the query.

The advantage to having DB2 rewrite the left outer as an inner join is that performance might improve. The predicate that is applied to table D can now be applied before the join, rather than after, since it now references a non-NULL-supplying table. Preserved-row and NULL-supplying are unrelated terms for inner joins; given that there will never be an unmatched row that is returned in the join, that is, both tables are non-NULL-supplying.

Preserving NULLs

If this outer join simplification does not give you the results you wanted -- that is, you require that the NULL (or optional) rows be returned -- then use OR D.DEPTNAME IS NULL to preserve the NULLs in the answer set.

An example of this is shown in Figure 14.

Left Outer Join: NULL preservation
Left Outer Join -- NULL preservation

DB2 must apply the WHERE clause predicate after the join, because it is not known until after the join whether the row has been matched (and thus the first part of the predicate -- D.DEPTNAME NOT LIKE '%CENTER%' -- applies) or unmatched (and thus the second part of the predicate -- OR D.DEPTNAME IS NULL -- applies).

Before-join predicates

What happens if you choose to code the WHERE clause predicate to be applied by DB2 upon the NULL-supplying table before the join?

If you do this, there is no WHERE clause predicate to limit rows on the preserved-row table or in the final result. You are only limiting rows from the NULL-supplying table. Figure 15 shows the result of this.

Predicates on NULL-supplying table
Predicates on NULL-supplying table

Figure 16 shows that if you recode the before-join predicate to an ON clause predicate, the same results are returned.

Predicates on NULL-supplying table: simplified
Predicates on NULL-supplying table -- simplified

Whether the rows from the NULL-supplying table are filtered before or during the join is a matter for performance only, and DB2 decides this based on the join method used. For producing the correct result, either method is valid. Because the row does not match in the join, removing the row before the join does not affect the output.

DB2 is able (from Version 6 on) to merge any unnecessary nested table expressions such as those on the NULL-supplying table ( Figure 15 ) to be rewritten as a single query block and applied as a during-join or before-join predicate ( Figure 16 ).


Full outer join NULL-supplying table predicates

Outer join simplification

The rule for outer join simplification that is applicable to left and right outer joins is also valid for full joins. A WHERE clause predicate applied to the NULL-supplying table that causes the NULLs to be negated causes DB2 to simplify the join. Because both tables supply NULLs, this clause negates NULLs regardless of which table it is applied to.

Figure 17 shows an example of a WHERE clause predicate applied to a full outer join, which causes DB2 to simplify the join.

Full Outer Join: join simplification
Full Outer Join -- join simplification

Without join simplification, the predicate is applied as a totally-after-join predicate. Given that the optimizer can determine that the predicate negates NULLs, it can rewrite the query as a left outer join. A full outer join can be rewritten as either a left or right outer join, depending on the table join sequence chosen.

Rewriting the query as a left outer join means that the WHERE predicate can now be applied as a before-join predicate, resulting in fewer rows being joined. If the WHERE clause predicates negate the NULLs from both tables, then simplification causes the query to be rewritten as an inner join. Executing the query as an inner join allows predicates on both tables to be applied before the join.

Join simplification can be identified by the JOIN_TYPE column of the explain (plan table) output. A value of 'L' indicates that the join has been simplified to a left outer join (as there is no runtime right outer join), and a 'blank' indicates an inner join.

Preserving NULLs

If you require that the NULLs from both tables be returned in the result set, alter the WHERE clause predicate to ensure that they are not removed.

Figure 18 shows an example of a WHERE clause predicate that does preserves NULLs. COALESCE returns the first value in the list that is not NULL. Thus, the WHERE comparison is always against an actual value (unless the column is defined to allow NULLs).

Full Outer Join: NULL preservation
Full Outer Join -- NULL preservation

In the example shown in Figure 18 , the WHERE clause predicate is applied totally after the join, because the WHERE predicate depends on columns from both tables.

Is there a better alternative?

Before-join predicates

You can get better performance by coding the query to apply these predicates before the join (as shown in Figure 19 ).

Full Outer Join: NULL preservation alternative
Full Outer Join -- NULL preservation alternative

Be careful when coding multiple outer joins

Missing rows

In Part one , I mentioned that obtaining the correct result is the major focus when writing any SQL statement. And with multiple joins involving outer joins, it is very easy to accidentally lose rows based upon the source of the join predicates.

Of course, I have already explained that columns from a NULL-supplied table will be NULL for an unmatched row. If a column from a NULL-supplied table is referenced as a join predicate in a subsequent join, then NULL will never equate, and thus a further match will not be made. Without interrogating the data, there may be no indication that these rows are missing, because it is possible that outer join simplification was not necessary by DB2. Outer join simplification is at least identifiable in the plan table (based on JOIN_TYPE column).

Figure 20 shows an example of a NULL-supplied column that is used in a subsequent join.

Multiple Outer Joins: Missing Rows
Multiple Outer Joins -- Missing Rows

Step 1 applies the before-join predicate. Step 2 performs the left outer join. The left outer join from the department and employee tables (step 2) does not produce a matching row in this example. Thus, when step 3 is performed (subsequent left outer join to the project table), the value of the join predicate from the previous join is NULL, because it was from the NULL-supplied table. The row is still preserved, however, because it is a left outer join. For an inner join, the row would not have been preserved.

Finding missing rows

If you ensure that subsequent join predicates always refer to columns from preserved-row tables, the true value will be available for the later join. This is an extremely important point to ensure that the correct rows are returned in the join.

Figure 21 shows an example whereby the second join refers to a join predicate from the preserved-row table.

Multiple Outer Joins: Finding Missing Rows
Multiple Outer Joins -- Finding Missing Rows

In this example, I have corrected the error from the previous example ( Figure 20 ). The resultant data from step 2 (first left outer join) contains the actual value 'D01' from the preserved row table. This value is used in step 3 (subsequent left outer join) to match against the project table, rather than against NULL.

The dependence on join sequence for this query requires that the department table be accessed first. The DB2 for z/OS TM optimizer (from Version 6 on) is able to determine the remaining join sequences based on the lowest cost, rather than the coding sequence. Each of the joins highlighted as step 2 and 3 depend on the department table only, and not on each other.

For full outer joins, it is more difficult to specify a join column from the preserved-row (or non-NULL-supplying) table, because both tables can supply NULLs. You must always specify the COALESCE clause to ensure that a preserved value is always specified by the join. For example:

ON P.DEPTNO = COALESCE (D.DEPTNO, E.WORKDEPT)


Nesting outer joins

Outer join simplification

I have described the situation in which a WHERE clause causes DB2 to simplify the outer join. But there is also a situation in which the ON clause can cause this simplification to occur, and that is when an outer join is nested within another outer join. This type of simplification performed by DB2 is only briefly explained in the DB2 Application Programming and SQL Guide .

Nesting joins appear to defy the following rule for coding an ON clause: the ON clause must immediately succeed the join for which it relates. This is shown in Figure 22 .

Outer Join Nesting -- join simplification
Outer Join Nesting -- join simplification

In Figure 22 , table D is"left joined" to the result of the full outer join of table P and E. Thus, the ON clause dictates that the join between P and E occurs first. This becomes more logical if you strategically place parentheses and indent the query for readability, such as:

SELECT D.DEPTNO, D.DEPTNAME, P.DEPTNO, P.PROJNAME, E.EMPNO

FROMDEPARTMENT D LEFT OUTER JOIN

(PROJECT P FULL OUTER JOIN EMPLOYEE EM

ONP.DEPTNO = E.WORKDEPT)

OND.DEPTNO = P.DEPTNO

ORDER BY D.DEPTNO

The parentheses more easily show that the full outer join is performed first, and that the resultant table is the right table of a left outer join.

After the full outer join (step 1), the result contains NULLs that are supplied from either the left or right table for any unmatched rows. Next, the Department table is left joined to that result set (step 2). For step 2, there are no rows in D that can equal the supplied NULLs, so these rows are not joined. The consequence is that the NULLs are negated by the final ON clause. Therefore, DB2 realizes that there is no requirement for a full join to be coded initially.

The full outer join from the previous example can be rewritten by the DB2 optimizer as a left outer join (as in Figure 23 ).

Outer Join Nesting: query rewrite
Outer Join Nesting -- query rewrite

Now there are no NULL-supplied rows negated by the ON clause.

Preserving NULLs

To avoid DB2 performing this simplification, or more importantly, to maintain all rows introduced by the full outer join from Figure 22 , use COALESCE to ensure that the resultant join column is not NULL.

Outer Join Nesting: NULL preservation
Outer Join Nesting -- NULL preservation

The presence of COALESCE ensures that the subsequent join column cannot be NULL. COALESCE dictates that when the column is NULL-supplied take the value from the preserved row table. For example, when the left table is preserved, the right is NULL-supplied, and vice versa.

Now, because of COALESCE, there aren't any NULLs introduced on the subsequent join column, and thus no rows are inadvertently negated by the ON clause.

Important: As I mentioned earlier, any subsequent table join after a full join must contain the COALESCE in the ON clause to ensure that NULL-supplied rows are not negated by the join.


Recapping what we've learned about predicates

From the examples shown in Part 1 and 2 of this article, we can draw the following conclusions about how DB2 evaluates predicates:

WHERE clause predicates:

  • Applied to a preserved-row table:
    • Filter rows as either:
      • Before-join predicates
      • After-join-step predicates
      • Totally-after-join predicates.
    • Explicitly coded in a nested table expression can be written in the WHERE clause.
  • Applied to a NULL-supplying table:
    • Optimizer simplifies the join, unless the predicate includes either OR colx IS NULL or COALESCE to preserve NULLs.
    • Explicitly coded in a nested table expression can be written in ON clause.

ON clause predicates:

  • Only filter rows on the NULL-supplying table as with during-join predicates.
  • Do not filter rows on the preserved-row table.

Conclusion

In this part of my discussion about outer joins I specifically focused coding predicates on the NULL-supplied table.

On the top of the list was outer join simplification, whereby a SQL coder inadvertently writes a WHERE or ON clause predicate that negates NULL-supplied rows. If a user explicitly codes the outer join, then it is unlikely that it was intentional to negate these rows. Of course, producing the correct result is the objective when coding any SQL statement.

With this newfound knowledge I hope that you are able to write your outer join queries to produce the result you expect, with the performance you desire. If nothing else I hope to have opened your eyes to the power of DB2's implementation of outer joins and given you something to think about when you code your queries.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=13817
ArticleTitle=Meet the experts: Terry Purcell on coding predicates in outer joins: Advanced outer join constructs
publish-date=01172002