Further understand column group statistics in DB2

Leverage the extended use of multi-column statistics in DB2 9.5 to improve cardinality estimates

With multi-column statistics in IBM® DB2® for Linux®, UNIX®, and Windows® (DB2), the optimizer can determine a better query access plan and improve query performance when there is correlation between multiple predicates. In this article, learn how to use multi-column statistics to take advantage of the enhancements to the optimizer in DB2 9.5 that extend their use to a broader range of predicates.

Samir Kapoor, DB2 Advanced Support Analyst, IBM

Samir Kapoor photoSamir Kapoor is an IBM Certified Advanced Technical Expert for DB2. Samir currently works with the DB2 UDB Advanced Support -- Down system division (DSD) team and has in-depth knowledge in the engine area.



Vincent Corvinelli, DB2 Optimizer Developer, IBM

Vincent Corvinelli is an advisory software developer in the DB2 Query Optimizer Development team at the IBM Toronto Lab.



04 September 2008

Also available in Chinese

Introduction

The article "Understand column group statistics in DB2" (developerWorks, December 2006) describes the importance of collecting column group statistics and how the DB2 SQL Optimizer (referred to as optimizer hereafter) makes use of these multi-column statistics to detect a statistical correlation between two or more local or join equality predicates. In DB2 9.5, the optimizer further extended the use of multi-column statistics to a broader range of predicates.

The optimizer depends on accurate cardinality estimates to properly compute the cost of each query access plan considered. Cardinality estimation is a process by which the optimizer uses statistics to determine the size of partial query results after predicates are applied or aggregation is performed. At each operator in the access plan, the optimizer estimates the cardinality output from the operator. The application of one or more predicates may reduce the output stream cardinality.

It is common practice to assume the predicates are independent of each other when computing their combined filtering effect on the cardinality estimate. However, the predicates can be statistically correlated. Treating multiple predicates independently typically results in the optimizer under-estimating the cardinality. Under-estimating the cardinality could lead the optimizer to choose a sub-optimal access plan.

The optimizer considers using multi-column statistics to detect a statistical correlation and estimate more accurately the combined filtering effect of multiple predicates. This article describes how the optimizer makes use of multi-column statistics to detect a statistical correlation and estimate more accurately the combined filtering effect of multiple predicates for SQL statements that apply at least two local IN, OR, and equality predicates, and the filtering effect of predicates for SQL statements that apply some classes of OR predicates. "Understand column group statistics in DB2" describes how the optimizer makes use of multi-column statistics to detect a correlation between two or more local equality predicates and for the join of two or more tables that apply at least two equality join predicates between the pair of tables. The RUNSTATS command options, as described in that article, are used in the same manner, so those command options will not be described in this article.


Statistical correlation of multiple local equality and local IN predicates

If the WHERE clause of an SQL statement applies multiple predicates, as follows:

    	        C1=? AND C2 IN ( ?, ?, ? )

and multi-column statistics on (C1, C2) are collected, then the optimizer attempts to detect a statistical correlation between the predicates in order to improve the cardinality estimates. This does not apply to:

  • Join predicates with IN or OR operators
  • Local predicates with inequality, LIKE, or IS NULL operators
  • Predicates with subqueries

The C1=? predicate is an example of a local equality predicate, which is an equality predicate applied to a single table and is described as follows:

	        COLUMN = literal

where the literal can be any one of these:

  • A constant value
  • A parameter marker or host variable
  • A special register (for example, CURRENT DATE)

The C2 IN ( ?, ?, ? ) predicate is an example of a local IN predicate, which is a predicate applied to the same single table that the equality predicate is applied to, and is described as follows:

          COLUMN IN ( <VALUE LIST> )

where the <VALUE LIST> is a comma separated list of one or more literals, as described for the local equality predicate.

An OR predicate that is equivalent to an IN predicate can be specified in the SQL statement instead of the IN predicate, and the optimizer will treat it in the same manner when accounting for statistical correlation; that is,

    	    COL IN  ( literal_1, literal_2, ..., literal_n )

is equivalent to

          COL=literal_1 OR COL=literal_2 OR ... OR COL=literal_n

The following are some examples for which the optimizer tries to detect a correlation between local IN, OR, and equality predicates:

a) COL_1 IN ( <VALUE LIST> ) AND COL_2=literal AND COL_3=literal
b) (COL_1=literal_1 OR COL_1=literal_2 OR ... OR COL_1=liternal_n) AND COL_2=literal AND ... AND COL_m=literal
c) COL_1 IN ( <VALUE LIST> ) AND COL_2 IN ( <VALUE LIST> ) AND ... AND COL_m IN ( <VALUE LIST> )
d) (COL_1=literal_1 OR COL_1=literal_2) AND (COL_2=literal_1 OR COL_2=literal_2) AND ... AND (COL_m=literal_1 OR COL_M=literal_2)
e) COL_1 IN ( <VALUE LIST> ) AND ... And COL_m IN ( <VALUE LIST> ) AND COL_1_2=literal AND ... AND COL_1_k=literal
f) (COL_1=literal_1 OR COL_1=literal_2) AND COL_2=literal AND COL_3=literal
g) (C)L_1=literal_1 OR COL_1=literal_2) AND (COL_2=literal_1 OR COL_2=literal_2) AND COL_3=literal

The following are some examples of predicates that are not considered for statistical correlation detection by the optimizer:

a) (COL_1=literal AND COL_2=literal) OR (COL_1=literal AND COL_2=literal AND COL_3=literal)
b)((COL_1=literal AND COL_2=literal) OR (COL_1=literal AND COL_2=literal)) AND COL_3=literal
c)( COL_1 IN ( <VALUE LIST> ) OR (COL_2 IN ( <VALUE LIST> ) ) AND COL_3=literal

Example 1: C1 IN ( <VALUE LIST> ) AND C2 = literal

Note: Please replace SKAPOOR with your own schema in all the examples described in this article.

These examples were tested in the following environment, using the SAMPLE database, which can be created by executing db2sampl:

Listing 1. Testing environment for samples
      DB21085I  Instance "skapoor" uses "64" bits and DB2 code release "SQL09051"
      with level identifier "03020107".
      Informational tokens are "DB2 v9.5.0.1", "s080328", "U814639", and Fix Pack"1".
      Product is installed at "/home2/skapoor/sqllib".

      Configuration:  (as displayed by the db2exfmt tool)

        Database Context:
        ----------------
             Parallelism:            None
             CPU Speed:              4.000000e-05
             Comm Speed:             100
             Buffer Pool size:       1000
             Sort Heap size:         256
             Database Heap size:     1200
             Lock List size:         100
             Maximum Lock List:      10
             Average Applications:   1
             Locks Available:        640

        Package Context:
        ---------------
             SQL Type:               Dynamic
             Optimization Level:     5
             Blocking:               Block All Cursors
             Isolation Level:        Cursor Stability


        STMTHEAP: (Statement heap size)
                  6402

Consider the following query on the EMPLOYEE table in the SAMPLE database:

Listing 2. Query on the EMPLOYEE table in the SAMPLE database
      SELECT FIRSTNME, LASTNAME, JOB, WORKDEPT, SALARY 
        FROM EMPLOYEE 
       WHERE JOB IN ('CLERK', 'SALESREP') AND 
	           WORKDEPT = 'A00' 
      ORDER BY JOB, SALARY

It returns four records from the EMPLOYEE table:

Listing 3. Records returned from the EMPLOYEE table
FIRSTNME     LASTNAME        JOB      WORKDEPT SALARY
------------ --------------- -------- -------- -----------
GREG         ORLANDO         CLERK    A00         39250.00
SEAN         O'CONNELL       CLERK    A00         49250.00
DIAN         HEMMINGER       SALESREP A00         46500.00
VINCENZO     LUCCHESSI       SALESREP A00         66500.00

4 record(s) selected.

The EXPLAIN tool, which requires the existence of the EXPLAIN tables, can be used to view the query access plan chosen by the optimizer. To create the EXPLAIN tables, execute:

      db2 -tvf $DB2PATH/misc/EXPLAIN.DDL

When the SAMPLE database is initially created, statistics are not collected on the tables. To collect statistics on the EMPLOYEE table, the RUNSTATS tool can be used. The following RUNSTATS command collects statistics on each column, including distribution statistics, and detailed statistics on all indexes defined in the EMPLOYEE table, if any:

RUNSTATS ON TABLE SKAPOOR.EMPLOYEE 
    WITH DISTRIBUTION AND DETAILED INDEXES ALL

Once the EXPLAIN tables are created and the statistics are collected, the SET CURRENT EXPLAIN MODE statement can be used to insert the query access plan details for one or more statements into the EXPLAIN tables, as follows:

Listing 4. Insert the query access plan details into the EXPLAIN tables
SET CURRENT EXPLAIN MODE EXPLAIN;

SELECT FIRSTNME, LASTNAME, JOB, WORKDEPT, SALARY 
  FROM EMPLOYEE 
  WHERE JOB IN ('CLERK', 'SALESREP') AND 
         WORKDEPT = 'A00' 
  ORDER BY JOB, SALARY;
     
SET CURRENT EXPLAIN MODE NO;

The db2exfmt tool reads the data in the EXPLAIN tables, and formats the query access plan in a text file:

db2exfmt -d SAMPLE -1 -g -o exfmt_example1.out

The file exfmt_example1.out contains a query access plan similar to the following, with an estimated cardinality of 1:

Listing 5. Query access plan
                  Rows
                 RETURN
                 (   1)
                  Cost
                   I/O
                   |
                 1.19048
                 TBSCAN
                 (   2)
                 10.7902
                    1
                   |
                 1.19048
                 SORT
                 (   3)
                 10.7387
                    1
                   |
                 1.19048
                 FETCH
                 (   4)
                 10.6299
                    1
                /---+---\
              5           42
           IXSCAN   TABLE: SKAPOOR
           (   5)      EMPLOYEE
           2.27828
              0
             |
             42
       INDEX: SKAPOOR
            XEMP2

The cardinality estimate of 1 does not match the actual result of 4. The optimizer assumes the two predicates are independent because relevant index or column group statistics do not exist. The RUNSTATS tool can be used to collect column group statistics on the group (JOB,WORKDEPT) to provide the optimizer with the appropriate information to detect a statistical correlation, if any, between the two columns:

RUNSTATS ON TABLE SKAPOOR.EMPLOYEE ON ALL COLUMNS 
	AND COLUMNS ((JOB,WORKDEPT)) WITH DISTRIBUTION 
	AND DETAILED INDEXES ALL

After repeating the above steps to explain the query again to generate the query access plan, the optimizer computes a better cardinality estimate as a result of collecting column group statistics on the two columns:

Listing 6. Query access plan, with better cardinality estimate
                  Rows
                 RETURN
                 (   1)
                  Cost
                   I/O
                   |
                    5
                 TBSCAN
                 (   2)
                 10.8458
                    1
                   |
                    5
                 SORT
                 (   3)
                 10.7944
                    1
                   |
                    5
                 FETCH
                 (   4)
                 10.6299
                    1
                /---+---\
              5           42
           IXSCAN   TABLE: SKAPOOR
           (   5)      EMPLOYEE
           2.27828
              0
             |
             42
       INDEX: SKAPOOR
            XEMP2

The cardinality estimate is slightly higher than the actual value of 4 since the column group statistic is a uniform distribution statistic. You may have noticed that the query access plan itself did not change with the increase in cardinality estimate. The examples described in this article are simple in order to illustrate how to improve the cardinality estimate. Statements involving larger tables and joins of two or more tables are more likely to exhibit a change in query access plan as a result of the improved cardinality estimate.

Example 2: C1 IN ( <VALUE LIST> ) AND C2 IN ( <VALUE LIST> )

This example illustrates the effect of column group statistics on two IN predicates. Consider the following query that retrieves the bonus and salaries for managers and designers in certain departments:

Listing 7. Bonus and salaries query
      SELECT FIRSTNME, LASTNAME, WORKDEPT, JOB, BONUS, SALARY
        FROM EMPLOYEE
       WHERE WORKDEPT IN ('D11','D21') AND 
	           JOB IN ('MANAGER','DESIGNER') 
      ORDER BY WORKDEPT, SALARY

This query returns 12 records from the EMPLOYEE table:

Listing 8. Records returned from the EMPLOYEE table
    FIRSTNME     LASTNAME        WORKDEPT JOB      BONUS       SALARY
    ------------ --------------- -------- -------- ----------- -----------
    MASATOSHI    YOSHIMURA       D11      DESIGNER      500.00    44680.00
    JENNIFER     LUTZ            D11      DESIGNER      600.00    49840.00
    JAMES        WALKER          D11      DESIGNER      400.00    50450.00
    MARILYN      SCOUTTEN        D11      DESIGNER      500.00    51340.00
    BRUCE        ADAMSON         D11      DESIGNER      500.00    55280.00
    DAVID        BROWN           D11      DESIGNER      600.00    57740.00
    ELIZABETH    PIANKA          D11      DESIGNER      400.00    62250.00
    KIYOSHI      YAMAMOTO        D11      DESIGNER      500.00    64680.00
    WILLIAM      JONES           D11      DESIGNER      400.00    68270.00
    REBA         JOHN            D11      DESIGNER      600.00    69840.00
    IRVING       STERN           D11      MANAGER       500.00    72250.00
    EVA          PULASKI         D21      MANAGER       700.00    96170.00

      12 record(s) selected.

First, examine the query access plan and cardinality estimates without the column group statistic on (JOB,WORKDEPT). This is accomplished by executing another RUNSTATS command on the EMPLOYEE table as follows:

RUNSTATS ON TABLE SKAPOOR.EMPLOYEE 
   WITH DISTRIBUTION AND DETAILED INDEXES ALL

The previous statistics collected are cleared by the latest RUNSTATS command, so the column group statistics collected earlier are no longer kept. Generating the query access plan using EXPLAIN and the db2exfmt tool, as in Example 1, you can examine the estimated cardinality by the optimizer:

Listing 9. Insert the query access plan details into the EXPLAIN tables
      SET CURRENT EXPLAIN MODE EXPLAIN;

      SELECT FIRSTNME, LASTNAME, WORKDEPT, JOB, BONUS, SALARY
        FROM EMPLOYEE
       WHERE WORKDEPT IN ('D11','D21') AND 
	           JOB IN ('MANAGER','DESIGNER') 
      ORDER BY WORKDEPT, SALARY
      
      SET CURRENT EXPLAIN MODE NO;


      db2exfmt -d SAMPLE -1 -g -o exfmt_example2.out

The file exfmt_example2.out should contain a query access plan similar to the following, with an estimated cardinality of 7:

Listing 10. Query access plan
                Rows
               RETURN
               (   1)
                Cost
                 I/O
                 |
               7.28572
               TBSCAN
               (   2)
               13.7066
                  1
                 |
               7.28572
               SORT
               (   3)
               13.5723
                  1
                 |
               7.28572
               NLJOIN
               (   4)
               13.1318
                  1
           /------+------\
         2               3.64286
      TBSCAN             FETCH
      (   5)             (   6)
       0.006             11.0934
         0                  1
        |               /---+---\
         2            9           42
 TABFNC: SYSIBM    IXSCAN   TABLE: SKAPOOR
      GENROW       (   7)      EMPLOYEE
                   2.55364
                      0
                     |
                     42
               INDEX: SKAPOOR
                    XEMP2

In the query access plan shown in Listing 9, notice a join between the table EMPLOYEE and a table function, GENROW. When an IN predicate (or an equivalent OR predicate) is used, the optimizer considers an IN-to-JOIN transformation, converting the IN predicate to a join predicate. The GENROW table function produces the values listed in the <VALUE LIST> of the IN predicate. When the IN predicate is used in its join form, the optimizer still considers it for statistical correlation detection.

The cardinality estimate of 7 does not match the actual result of 12. As in Example 1, collecting column group statistics on the columns (JOB,WORKDEPT) provides the necessary information for the optimizer to account for a statistical correlation when computing the combined filtering effect of the two IN predicates:

RUNSTATS ON TABLE SKAPOOR.EMPLOYEE 
   ON ALL COLUMNS AND COLUMNS ((JOB,WORKDEPT)) 
   WITH DISTRIBUTION AND DETAILED INDEXES ALL

After repeating the above steps to explain the query again to generate the query access plan, the optimizer computes a better cardinality estimate that is very close to the actual result:

Listing 11. Query access plan with more accurate cardinality estimate
                Rows
               RETURN
               (   1)
                Cost
                 I/O
                 |
                11.2
               TBSCAN
               (   2)
               13.9768
                  1
                 |
                11.2
               SORT
               (   3)
               13.8033
                  1
                 |
                11.2
               NLJOIN
               (   4)
               13.1318
                  1
           /------+------\
         2                 5.6
      TBSCAN             FETCH
      (   5)             (   6)
       0.006             11.0934
         0                  1
        |               /---+---\
         2            9           42
 TABFNC: SYSIBM    IXSCAN   TABLE: SKAPOOR
      GENROW       (   7)      EMPLOYEE
                   2.55364
                      0
                     |
                     42
               INDEX: SKAPOOR
                    XEMP2

Example 3: C1 IN ( <VALUE LIST> ) AND C2 IN ( <VALUE LIST> ) AND C3=literal

In this example, you add a third predicate to the query in Example 2 to determine which employees received a bonus of $500:

Listing 12. Add a third predicate to find $500 bonus
SELECT FIRSTNME, LASTNAME, WORKDEPT, JOB, BONUS, SALARY
  FROM EMPLOYEE
  WHERE WORKDEPT IN ('D11','D21') AND 
        JOB IN ('MANAGER','DESIGNER') AND
         BONUS = 500
  ORDER BY WORKDEPT, SALARY

This query returns five records from the EMPLOYEE table:

Listing 13. Records returned from EMPLOYEE table
FIRSTNME     LASTNAME        WORKDEPT JOB      BONUS       SALARY
------------ --------------- -------- -------- ----------- -----------
MASATOSHI    YOSHIMURA       D11      DESIGNER      500.00    44680.00
MARILYN      SCOUTTEN        D11      DESIGNER      500.00    51340.00
BRUCE        ADAMSON         D11      DESIGNER      500.00    55280.00
KIYOSHI      YAMAMOTO        D11      DESIGNER      500.00    64680.00
IRVING       STERN           D11      MANAGER       500.00    72250.00

  5 record(s) selected.

If you re-collect the statistics without the column group statistics using:

RUNSTATS ON TABLE SKAPOOR.EMPLOYEE 
   WITH DISTRIBUTION AND DETAILED INDEXES ALL

A query access plan similar to the following is chosen by the optimizer, with a cardinality estimate of 2:

Listing 14. Query access plan
                Rows
               RETURN
               (   1)
                Cost
                 I/O
                 |
               2.42857
               TBSCAN
               (   2)
               13.8494
                  1
                 |
               2.42857
               SORT
               (   3)
               13.7636
                  1
                 |
               2.42857
               NLJOIN
               (   4)
               13.5765
                  1
           /------+------\
         2               1.21429
      TBSCAN             FETCH
      (   5)             (   6)
       0.006             11.3158
         0                  1
        |               /---+---\
         2            9           42
 TABFNC: SYSIBM    IXSCAN   TABLE: SKAPOOR
      GENROW       (   7)      EMPLOYEE
                   2.55364
                      0
                     |
                     42
               INDEX: SKAPOOR
                    XEMP2

With three predicates applied in the WHERE clause, assuming they are independent results in the optimizer underestimating the cardinality. To illustrate how the optimizer can use index statistics, as well as column group statistics, to detect a statistical correlation, create an index covering the three columns (JOB,WORKDEPT,BONUS) that are referenced in the predicates, and collect statistics:

Listing 15. Create index and collect statistics
CREATE INDEX JOB_DEPT_BONUS ON EMPLOYEE (JOB,WORKDEPT,BONUS)
      
-- The RUNSTATS command provides the option to collect statistics on a set of
-- indexes only, without affecting the statistics previously collected.
RUNSTATS ON TABLE SKAPOOR.EMPLOYEE F
OR DETAILED INDEXES SKAPOOR.JOB_DEPT_BONUS

With the new index created, and statistics collected on it, the optimizer corrects the cardinality estimate of the query access plan:

Listing 16. A corrected cardinality estimate from the query access plan
                Rows
               RETURN
               (   1)
                Cost
                 I/O
                 |
                5.25
               TBSCAN
               (   2)
               13.5227
                  1
                 |
                5.25
               SORT
               (   3)
               13.4087
                  1
                 |
                5.25
               NLJOIN
               (   4)
               13.0875
                  1
           /------+------\
         2                2.625
      TBSCAN             FETCH
      (   5)             (   6)
       0.006             11.0713
         0                  1
        |               /---+---\
         2          2.625         42
 TABFNC: SYSIBM    IXSCAN   TABLE: SKAPOOR
      GENROW       (   7)      EMPLOYEE
                   2.85933
                      0
                     |
                     42
               INDEX: SKAPOOR
               JOB_DEPT_BONUS

Example 4: (C1=literal OR C1=literal2) AND (C2=literal OR C2=literal2) AND C3=literal

This example is equivalent to Example 3, using equivalent OR predicates to replace the IN predicates:

Listing 17. Equivalent OR predicates to replace the IN predicates
      SELECT FIRSTNME, LASTNAME, WORKDEPT, JOB, BONUS, SALARY
        FROM EMPLOYEE
       WHERE (WORKDEPT = 'D11' OR WORKDEPT = 'D21') AND 
	           (JOB = 'MANAGER' OR JOB = 'DESIGNER') AND
	           BONUS = 500
      ORDER BY WORKDEPT, SALARY

This query returns the same result set as in Example 3. This example illustrates the effect that partial statistics have on the ability of the optimizer to estimate the cardinality. Drop the index created in Example 3 and re-collect the statistics with column group statistics on the group ((JOB,WORKDEPT)) only:

DROP INDEX JOB_DEPT_BONUS
RUNSTATS ON TABLE SKAPOOR.EMPLOYEE 
   ON ALL COLUMNS AND COLUMNS 
     ((JOB,WORKDEPT)) 
   WITH DISTRIBUTION AND DETAILED INDEXES ALL

With column group statistics collected on a subset of the columns referenced by the eligible IN, OR, and equality predicates, the optimizer estimates a cardinality that is close to the actual result, but not as accurate as shown in Example 3 when column group statistics were collected on all three columns:

Listing 18. Query access plan
                Rows
               RETURN
               (   1)
                Cost
                 I/O
                 |
               3.73333
               TBSCAN
               (   2)
               13.9174
                  1
                 |
               3.73333
               SORT
               (   3)
               13.8186
                  1
                 |
               3.73333
               NLJOIN
               (   4)
               13.5765
                  1
           /------+------\
         2               1.86667
      TBSCAN             FETCH
      (   5)             (   6)
       0.006             11.3158
         0                  1
        |               /---+---\
         2            9           42
 TABFNC: SYSIBM    IXSCAN   TABLE: SKAPOOR
      GENROW       (   7)      EMPLOYEE
                   2.55364
                      0
                     |
                     42
               INDEX: SKAPOOR
                    XEMP2

The optimizer used the column group statistic on (JOB,WORKDEPT) to account for a statistical correlation between the two OR predicates, but without including BONUS in the column group, it considered the BONUS=500 predicate as independent of the two OR predicates, resulting in the slightly underestimated final cardinality.

Note: if you analyze the Optimized Statement section of the db2exfmt output for the above query, you may notice that the OR predicates were converted to their equivalent IN predicates:

Listing 19. OR predicates converted to their equivalent IN predicates
Optimized Statement:
-------------------
SELECT Q5.FIRSTNME AS "FIRSTNME", Q5.LASTNAME AS "LASTNAME", Q5.WORKDEPT AS
        "WORKDEPT", Q5.JOB AS "JOB", +0000500.00 AS "BONUS", Q5.SALARY AS
        "SALARY"
FROM SKAPOOR.EMPLOYEE AS Q5
WHERE (Q5.BONUS = +0000500.00) AND Q5.JOB IN ('MANAGER ', 'DESIGNER') AND
        Q5.WORKDEPT IN ('D11', 'D21')
ORDER BY Q5.WORKDEPT, Q5.SALARY

Collecting column group statistics on all three columns result in the same cardinality estimate as in Example 3. In this case, you still collect the column group statistics on the previous two columns (JOB,WORKDEPT) and include the full set of three columns (JOB,WORKDEPT,BONUS):

RUNSTATS ON TABLE SKAPOOR.EMPLOYEE 
   ON ALL COLUMNS AND COLUMNS
    ((JOB,WORKDEPT), (JOB,WORKDEPT,BONUS)) 
  WITH DISTRIBUTION AND DETAILED INDEXES ALL

As described in "Understand column group statistics in DB2", you can gather one or more column group statistics between the same sets of columns. The query access plan produced after collecting these statistics is the same as the final plan in Example 3. It is left as an exercise for you to verify this is the case.

Example 5: Index oring

This example illustrates how collecting column group statistics can also improve the cardinality estimate of index oring access plans. Consider the following query on the EMPLOYEE table that retrieves all clerks and sales representatives that belong to department A00:

Listing 20. Query on the EMPLOYEE table that retrieves all clerks and sales representatives that belong to department A00
      SELECT FIRSTNME, LASTNAME, JOB, WORKDEPT, SALARY
        FROM EMPLOYEE
       WHERE JOB IN ('CLERK', 'SALESREP') AND
             WORKDEPT='A00'
      ORDER BY JOB, SALARY

This query returns four records from the EMPLOYEE table:

Listing 21. Query returns four records
    FIRSTNME     LASTNAME        JOB      WORKDEPT SALARY
    ------------ --------------- -------- -------- -----------
    GREG         ORLANDO         CLERK    A00         39250.00
    SEAN         O'CONNELL       CLERK    A00         49250.00
    DIAN         HEMMINGER       SALESREP A00         46500.00
    VINCENZO     LUCCHESSI       SALESREP A00         66500.00

      4 record(s) selected.

To better illustrate the improvement in cardinality estimation, drop all the existing indexes on the EMPLOYEE table except the primary key index:

      DROP INDEX XEMP2

and create the following index that includes both columns referenced by predicates in the WHERE clause of the above query, separated by the SALARY column:

 CREATE INDEX IND2 ON EMPLOYEE (JOB,SALARY,WORKDEPT)

Statistics are re-collected on the EMPLOYEE table and its new and remaining indexes:

RUNSTATS ON TABLE SKAPOOR.EMPLOYEE 
    WITH DISTRIBUTION AND DETAILED INDEXES ALL

In this example, the optimizer is forced to choose an index oring table access operation by using the optimization profile feature. To do so, the optimization profile includes two optimizer guidelines:

  1. A guideline to disable the transformation of the IN predicate to a join
  2. A guideline to force the optimizer to choose the index oring operation to access the EMPLOYEE table

The first step in creating the optimization profile is to create an XML file, called example5.xml, that contains the following contents:

Listing 22. XML file contents
<?xml version="1.0" encoding="UTF-8"?>

<OPTPROFILE VERSION="9.5.1">
  <STMTPROFILE ID="Example 5 Index oring test">
    <STMTKEY>
    <![CDATA[SELECT FIRSTNME, LASTNAME, JOB, WORKDEPT, SALARY
        FROM EMPLOYEE
       WHERE JOB IN ('CLERK', 'SALESREP') AND
             WORKDEPT='A00'
      ORDER BY JOB, SALARY]]>
   </STMTKEY>

    <OPTGUIDELINES>
      <INLIST2JOIN OPTION="DISABLE" TABLE="EMPLOYEE" COLUMN="JOB"/>
      <IXOR TABLE="EMPLOYEE" INDEX="IND2"/>
    </OPTGUIDELINES>
  </STMTPROFILE>
</OPTPROFILE>

The second step involves creating a del file, called example5.del, that contains the following contents:

 "SKAPOOR","IXORPLAN","example5.xml"

where SKAPOOR is the schema for the profile, IXORPLAN is the name you associated to the profile, and example5.xml is the XML file created in the first step, which contains the contents describing the profile.

The third step requires placing both the example5.xml and the example5.del files in the same location and issuing the following commands:

Listing 23. Commands to use with example5.xml and example5.del
-- Create the OPT_PROFILE table, if it does not already exist
CREATE TABLE SYSTOOLS.OPT_PROFILE (
  SCHEMA VARCHAR(128) NOT NULL,
  NAME VARCHAR(128) NOT NULL,
  PROFILE BLOB (2M) NOT NULL,
    PRIMARY KEY ( SCHEMA, NAME ))
         
-- Add an entry to OPT_PROFILE table for our index-oring guideline
IMPORT FROM example5.del OF DEL 
   MODIFIED BY LOBSINFILE 
   INSERT INTO SYSTOOLS.OPT_PROFILE

To view the query access plan using the optimization profile created, the SET CURRENT OPTIMIZATION PROFILE statement can be used in combination with the SET CURRENT EXPLAIN MODE statement, as follows:

Listing 24. View the query access plan using the optimization profile
      -- use the IXORPLAN profile
      SET CURRENT OPTIMIZATION PROFILE="IXORPLAN"

      SET CURRENT EXPLAIN MODE EXPLAIN

      SELECT FIRSTNME, LASTNAME, JOB, WORKDEPT, SALARY
        FROM EMPLOYEE
       WHERE JOB IN ('CLERK', 'SALESREP') AND
             WORKDEPT='A00'
      ORDER BY JOB, SALARY

      SET CURRENT EXPLAIN MODE NO

A query access plan similar to the following is chosen by the optimizer:

Listing 25. Query access plan
                    Rows
                   RETURN
                   (   1)
                    Cost
                     I/O
                     |
                   1.19048
                   TBSCAN
                   (   2)
                   13.0404
                  0.963719
                     |
                   1.19048
                   SORT
                   (   3)
                   12.967
                  0.963719
                     |
                   1.19048
                   FETCH
                   (   4)
                   12.8278
                  0.963719
                  /---+---\
             1.19048        42
             RIDSCN   TABLE: SKAPOOR
             (   5)      EMPLOYEE
             4.89404
                0
          /-----+-----\
    0.952381         0.238095
     SORT             SORT
     (   6)           (   8)
     2.7352           2.21536
        0                0
       |                |
    0.952381         0.238095
     IXSCAN           IXSCAN
     (   7)           (   9)
     2.6272           2.10736
        0                0
       |                |
       42               42
 INDEX: SKAPOOR   INDEX: SKAPOOR
      IND2             IND2

If the generated query access plan is not the index oring plan shown above, then there is a problem with your optimization profile setup. In the db2exfmt output, the following is seen if the optimizer used the optimization profile:

    Profile Information:
    --------------------
    OPT_PROF: (Optimization Profile Name)
            SKAPOOR.IXORPLAN
    STMTPROF: (Statement Profile Name)
            Example 5 Index oring test

It is left as an exercise to the reader to determine the appropriate method to collect a column group statistic on the columns (JOB,WORKDEPT). Once the column group statistic is collected, the query access plan displays improved cardinality estimates:

Listing 26. Improved cardinality estimates in query access plan
                    Rows
                   RETURN
                   (   1)
                    Cost
                     I/O
                     |
                      5
                   TBSCAN
                   (   2)
                   13.878
                      1
                     |
                      5
                   SORT
                   (   3)
                   13.7665
                      1
                     |
                      5
                   FETCH
                   (   4)
                   13.4746
                      1
                  /---+---\
                5           42
             RIDSCN   TABLE: SKAPOOR
             (   5)      EMPLOYEE
             4.89404
                0
          /-----+-----\
        4                1
     SORT             SORT
     (   6)           (   8)
     2.7352           2.21536
        0                0
       |                |
        4                1
     IXSCAN           IXSCAN
     (   7)           (   9)
     2.6272           2.10736
        0                0
       |                |
       42               42
 INDEX: SKAPOOR   INDEX: SKAPOOR
      IND2             IND2

At each IXSCAN operator, the cardinality is corrected to account for a correlation between the predicates:

  • JOB='CLERK' AND WORKDEPT='A00'
  • JOB='SALESREP' AND WORKDEPT='A00'

and the cardinality is corrected at the RIDSCN and FETCH operators, which accounts for the statistical correlation between the IN and equality predicates.


Statistical correlation of multiple local equality predicates within subterms of OR operators

If the WHERE clause of an SQL statement applies OR operators with multiple local predicates within each subterm, as follows:

	        (C1=literal_1 AND C2=literal_2) OR 
	        (C1=literal_3 AND C2=literal_4) OR 
	        (C1=literal_5 AND C2=literal_6)

and multi-column statistics on (C1,C2) are collected, then the optimizer will attempt to detect a statistical correlation between the predicates in order to improve the filtering effect of the OR predicate. In this article, the above OR operators are described as a single OR predicate with three subterms:

  1. (C1=literal_1 AND C2=literal_2)
  2. (C1=literal_3 AND C2=literal_4)
  3. (C1=literal_5 AND C2=literal_6)

This does not apply if the OR predicate contains any of the following:

  • Non-local equality predicates in any of the subterms
  • Different sets of columns referenced in two or more subterms

The following are some examples for which the optimizer tries to detect a correlation between local IN, OR, and equality predicates:

          a) (COL_1=literal_1 AND COL_2=literal_2) OR 
             (COL_1=literal_3 AND COL_2=literal_4) OR 
                              ...                  OR 
             (COL_1=literal_n AND COL_2=literal_m)

The following are some examples of predicates that are not considered for statistical correlation detection by the optimizer:

          a) (COL_1=literal_1 AND COL_2=literal_2) OR 
             (COL_1=literal_3 AND COL_2=literal_4 AND COL_3=literal_5)
             
          b) (COL_1=literal_1 AND COL_2=literal_2) OR 
             (COL_1=literal_3 AND COL_2=literal_4) OR 
             (COL_1=literal_5 AND COL_2=literal_6 AND COL_3=literal_7)

Example 6: (C1=LITERAL1 AND C2=LITERAL2) OR (C1=LITERAL3 AND C2=LITERAL4)

This example illustrates the effect of column group statistics on a qualifying OR predicate. Consider the following query on the EMPLOYEE table:

Listing 27. Query on the EMPLOYEE table
SELECT FIRSTNME, LASTNAME, WORKDEPT, JOB, BONUS, SALARY
  FROM EMPLOYEE
  WHERE ( WORKDEPT='E21' AND JOB='FIELDREP' ) OR 
      ( WORKDEPT='D21' AND JOB='MANAGER' )
  ORDER BY WORKDEPT, SALARY

This query returns six records from the EMPLOYEE table:

Listing 28. Query results from the EMPLOYEE table
FIRSTNME     LASTNAME        WORKDEPT JOB      BONUS       SALARY
------------ --------------- -------- -------- ----------- -----------
EVA          PULASKI         D21      MANAGER       700.00    96170.00
ROY          ALONZO          E21      FIELDREP      500.00    31840.00
HELENA       WONG            E21      FIELDREP      500.00    35370.00
RAMLAL       MEHTA           E21      FIELDREP      400.00    39950.00
JASON        GOUNOT          E21      FIELDREP      500.00    43840.00
WING         LEE             E21      FIELDREP      500.00    45370.00

 6 record(s) selected.

If you re-collect the statistics without the column group statistics using:

RUNSTATS ON TABLE SKAPOOR.EMPLOYEE 
   WITH DISTRIBUTION AND DETAILED INDEXES ALL

a query access plan similar to the following is chosen by the optimizer, with a cardinality estimate under 2:

Listing 29. Query access plan similar to the one chosen by the optimizer
           Rows
          RETURN
          (   1)
           Cost
            I/O
            |
          1.88095
          TBSCAN
          (   2)
          16.1786
             1
            |
          1.88095
          SORT
          (   3)
          16.1272
             1
            |
          1.88095
          TBSCAN
          (   4)
          16.0113
             1
            |
            42
      TABLE: SKAPOOR
         EMPLOYEE

Collecting a column group statistic on the columns (JOB,WORKDEPT) allows the optimizer to better estimate the filtering effect of the OR predicate, since each subterm of the OR predicate applies a set of local equality predicates on the columns JOB and WORKDEPT. It is left as an exercise for you to determine the appropriate RUNSTATS statement to collect a column group statistic. Once collected, a query access plan similar to the following is chosen by the optimizer, with an improved cardinality estimate that is very close to the actual result of six rows:

Listing 30. Query access plan with more accurate cardinality estimate
           Rows
          RETURN
          (   1)
           Cost
            I/O
            |
            5.6
          TBSCAN
          (   2)
          16.2651
             1
            |
            5.6
          SORT
          (   3)
          16.2136
             1
            |
            5.6
          TBSCAN
          (   4)
          16.0113
             1
            |
            42
      TABLE: SKAPOOR
         EMPLOYEE

Conclusion

The optimizer is dependent on accurate cardinality estimates to properly compute the cost of each query access plan considered. You can leverage the extended use of multi-column statistics in DB2 9.5 to provide the optimizer more information to better estimate the cardinality in order to choose an optimal query access plan.

Resources

Learn

Get products and technologies

  • Download a free trial version of DB2 Enterprise 9.
  • Now you can use DB2 for free. Download DB2 Express-C a no-charge version of DB2 Express Edition for the community that offers the same core data features as DB2 Express Edtion and provides a solid base to build and deploy applications.
  • Download IBM product evaluation versions and get your hands on application development tools and middleware products from IBM Information Management, Lotus®, Rational®, Tivoli®, and WebSphere®.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=335477
ArticleTitle=Further understand column group statistics in DB2
publish-date=09042008