Understand column group statistics in DB2

Improve cardinality estimates in an access plan

With column group statistics in IBM DB2® for Linux®, UNIX®, and Windows® (DB2), the optimizer can determine a better query access plan and improve query performance when there is a correlation between multiple equality local predicates or equality join predicates. In this article, learn all about how to use column group statistics.

Share:

Samir Kapoor, DB2 UDB Advanced Support Analyst, IBM

Samir Kapoor photoSamir Kapoor is an IBM Certified Advanced Technical Expert for DB2. Samir currently works with the DB2 UDB Advanced Support -- Down system division (DSD) team and has in-depth knowledge in the engine area.



Vincent Corvinelli, Advisory Software Developer, IBM

Vincent Corvinelli is an advisory software developer in the DB2® for Linux®, UNIX®, and Windows® (DB2) Optimizer group in IBM Canada. He has extensive experience in the diagnosis and resolution of customer problems.



21 December 2006

Also available in Chinese

Introduction

The DB2 SQL optimizer (referred to as optimizer hereafter) estimates the execution costs of each access plan alternative and selects an optimal access plan based on its estimates. An access plan specifies the order of operations required to resolve an SQL statement.

To properly determine the cost of each access plan, the DB2 optimizer requires accurate cardinality estimates. Cardinality estimation is a process by which the optimizer uses statistics to determine the size of partial query results after predicates are applied or aggregation is performed. At each operator in the access plan, the optimizer estimates the cardinality output from the operator. The application of one or more predicates may reduce the output stream cardinality.

It is common practice to assume the predicates are independent of each other when computing their combined filtering effect on the cardinality estimate. However, the predicates can be statistically correlated. Treating them independently typically results in the optimizer under-estimating the cardinality. Under-estimating the cardinality could lead the optimizer to choose a sub-optimal access plan.

The optimizer considers using multi-column statistics to detect statistical correlation and estimate more accurately the combined filtering effect of multiple predicates for SQL statements that apply at least two local equality predicates. The optimizer may also use multi-column statistics for SQL statements that join two or more tables, and for joins that apply at least two equality join predicates between a pair of tables.

A local equality predicate is an equality predicate applied to a single table, and is described as follows:

COLUMN = literal

where literal can be any one of these:

  • a constant value;
  • a parameter marker or host variable;
  • a special register (for example, CURRENT DATE)

An equality join predicate is described as follows, for a join between two tables T1 and T2:

T1.COLUMN = T2.COLUMN

DB2 Version 8.2 uses the following multi-column statistics:

  • Index keycard statistics: FIRST2KEYCARD, FIRST3KEYCARD, FIRST4KEYCARD, and FULLKEYCARD
  • Column group statistics: column group cardinality

These statistics describe the number of distinct groupings for a set of two or more columns.

Prior to DB2 Version 8.2, only index keycard statistics were used, and only under the following restrictions:

  • The index must be fully qualified. An index is fully qualified if all columns in the key are referenced by equality predicates (join or local, but not mixed join and local).
  • For join predicates, the index must also be unique.

In DB2 Version 8.2, the DB2 SQL optimizer further extended the use of multi-column statistics by considering all index keycard statistics without requiring the index to be fully qualified. It also considers any column group statistics collected by the user. This article discusses how the optimizer makes use of these statistics, and how users can identify which column group statistics to collect.


Statistical correlation of multiple local equality predicates

The DB2 SQL optimizer attempts to detect statistical correlation among multiple local equality predicates.

Example 1: Consider a table, SHOW_LISTINGS, which contains the following columns:

Table 1. Description of SHOW_LISTINGS table
COLUMN NAMEDESCRIPTION
SHOW_IDForeign key to table containing information about each SHOW listing
CHANNEL_IDForeign key to table containing information about each CHANNEL on which the show is playing
STATION_IDForeign key to table containing information about each STATION associated to the CHANNEL
CITY_IDForeign key to table containing information about each CITY in which the SHOW is available
DAYThe DAY the SHOW is playing
TIMEThe TIME of DAY the SHOW is playing
<other columns>Other columns describing other attributes of SHOW listings

Since a SHOW only plays on a certain CHANNEL and STATION, during a specific TIME in the day, the columns (SHOW_ID, CHANNEL_ID, STATION_ID, and TIME) are not independent of each other. The DAY column is independent of the TIME column, but might not be independent of SHOW_ID for all SHOW listings.

Consider an SQL statement that applies the following predicates:

P1: SHOW_ID = ?
P2: CHANNEL_ID = ?
P3: STATION_ID = ?
P4: TIME = ?

If an index exists, where the key consists of all four columns referenced by predicates P1-P4, or the leading four columns in the key contain the columns referenced, the optimizer uses the FIRST4KEYCARD index statistics (if index statistics are collected) to detect statistical correlation between the predicates P1-P4. The optimizer computes a more accurate cardinality estimate after the four predicates are applied. For example, any of the following indexes can be used to detect statistical correlation among the four predicates:

IX1 ON SHOW_LISTINGS(SHOW_ID, CHANNEL_ID, STATION_ID, TIME)
IX2 ON SHOW_LISTINGS(SHOW_ID, CHANNEL_ID, STATION_ID, TIME, CITY_ID)
IX3 ON SHOW_LISTINGS(SHOW_ID, CHANNEL_ID, STATION_ID, TIME, DAY, CITY_ID)

The optimizer uses the FULLKEYCARD and FIRST4KEYCARD statistics of IX1 to detect the correlation in all four predicates. Likewise, it can use the FIRST4KEYCARD of IX2 and IX3.

The following index cannot be used:

IX4 ON SHOW_LISTINGS(CITY_ID, SHOW_ID, CHANNEL_ID, STATION_ID, TIME)

since each keycard statistic will include the column CITY_ID, on which a local equality predicate is not defined.

If an index with the required key does not exist, then column group statistics can be collected on SHOW_ID, CHANNEL_ID, STATION_ID, and TIME. The optimizer uses these column group statistics to detect the statistical correlation between the four predicates in the same manner as the FIRST4KEYCARD statistic from the index.

The DB2 optimizer also considers indexes and column group statistics that cover a subset of the predicates. For example, the following index:

IX5 ON SHOW_LISTINGS(SHOW_ID, STATION_ID, CITY_ID, CHANNEL_ID)

can be used to detect the statistical correlation on the predicates P1 and P3, using the FIRST2KEYCARD statistic. Although this only partially corrects for the statistical correlation among the predicates P1-P4, it might be enough to allow the optimizer to choose a good query execution plan. Furthermore, even though the full key in index IX5 includes the column CHANNEL_ID, the FULLKEYCARD statistic also cannot be used to detect for correlation with P2 because it includes the CITY_ID column, which is not referenced in a predicate.


Statistical correlation of multiple equality join predicates

The DB2 SQL optimizer also attempts to detect the statistical correlation between multiple equality join predicates between two tables.

Example 2: Consider the SHOW_LISTINGS table, described in Example 1, and the following table, RATINGS, which contains these columns:

Table 2. Description of RATINGS table
COLUMN NAMEDESCRIPTION
SHOW_IDForeign key to table containing information about each SHOW listing
CHANNEL_IDForeign key to table containing information about each CHANNEL on which the show is playing
STATION_IDForeign key to table containing information about each STATION associated to the CHANNEL
CITY_IDForeign key to table containing information about each CITY in which the SHOW is available
DAYThe DAY the SHOW is playing
TIMEThe TIME of DAY the SHOW is playing
RATINGThe RATING for a SHOW listing on a specific CHANNEL and STATION in a specific CITY for a certain TIME in the DAY

The RATINGS table contains information describing the rating of SHOW listings for various cities, and a user may need to query the join of the RATINGS table with the SHOW_LISTINGS table to retrieve attributes from the two tables. The set of predicates applied by a query on the join could include the following:

P1: SHOW_LISTINGS.SHOW_ID = RATINGS.SHOW_ID
P2: SHOW_LISTINGS.CHANNEL_ID = RATINGS.CHANNEL_ID
P3: SHOW_LISTINGS.STATION_ID = RATINGS.STATION_ID

The three predicates, P1-P3, may not be independent of each other; the optimizer attempts to detect a correlation between the predicates using any available multi-column statistics.

The optimizer only detects and accounts for the statistical correlation between multiple join predicates on each pair of tables joined. For example, if an SQL statement contains the following predicates:

P1: T1.A = T2.A
P2: T1.B = T2.B
P3: T1.C = T3.C

The optimizer only attempts to detect the statistical correlation between P1 and P2, since they apply to the join of the same two tables, T1 and T2. P3 applies to the join between T1 and T3; a different join. If a fourth predicate is added, such as:

P4: T1.D = T2.D

The optimizer further attempts to detect the statistical correlation between P3 and P4.

To account for statistical correlation between equality join predicates, the optimizer uses the multi-column statistics from one of the two tables involved in the join; the table that is deemed the parent of the join. As a result, column group statistics only need to be collected on the parent table in the join. The other table in the join is denoted as the child. If a parent table cannot be identified, then the optimizer does not use the multi-column statistics to detect the statistical correlation between the join predicates.

Appendix B provides examples and describes in detail how the optimizer determines the parent and child of a two table join. Identifying the parent of the join, or whether the join has a parent, is useful to avoid collecting useless column group statistics. If a suitable index is not already available, simply collect column group statistics on each table in the join.


Collect multi-column statistics

The optimizer considers using all index keycard statistics available. If an index keycard statistic is not available on the set of columns referenced by the statistically correlated predicates, you can collect column group statistics using the RUNSTATS command. Below is the syntax taken from DB2 V82.

Note: For DB2 9, please check the syntax in the DB2 9 documentation within the Info Center.

>>-RUNSTATS--ON TABLE--table name--+-USE PROFILE------------+--->
                                   '-| Statistics Options |-'
Statistics Options:

                                 .-ALLOW WRITE ACCESS-.
|--+--------------------------+--+--------------------+--------->
   '-| Table Object Options |-'  '-ALLOW READ ACCESS--'


Table Object Options:

|--+-FOR--| Index Clause |----------------------------------+---|
      '-+-------------------------+--+-----------------------+-'
          '-| Column Stats Clause |-'  '-AND--| Index Clause |-'

Column Stats Clause:

|--+-ON--| Cols Clause |------------------------------+---------|
      '-+---------------------+--| Distribution Clause |-'
          '-ON--| Cols Clause |-'

On Cols Clause:

   	.-ON ALL COLUMNS-------------------------------------------------.
   	|                                           .-,-------------.    |
   	|                                           V               |    |
   |--+-ON--+----------------------+--COLUMNS--(----Column Option-+--)-+--|
      |     '-+-ALL-+--COLUMNS AND-'                                      |
   	|       '-KEY-'                                                  |
   	'-ON KEY COLUMNS-------------------------------------------------'

The "ON COLUMNS" clause allows you to specify a list of columns for which to collect statistics. If you specify a group of columns, the number of distinct values for the group are collected. Statistics for columns that you do not list are cleared. You can use this clause in the "on-cols-clause" and the "on-dist-cols-clause".

Note: Collecting distribution statistics for a group of columns is not currently supported.

Note: If automatic runstats is enabled and column group statistics have been collected using the RUNSTATS command, the automatic runstats override the statistics, and the column group statistics are discarded.

The AUTO_RUNSTATS database configuration setting indicates whether automatic runstats is enabled:

Automatic runstats (AUTO_RUNSTATS)=ON

If it is set to "ON", then automatic runstats is enabled.

AUTO_RUNSTATS uses the "with distribution and sampled detailed indexes all" RUNSTATS option by default.

You can use statistic profiling to override the default RUNSTATS options. You should add any column group statistics you want to collect to this profile to avoid automatic runstats overriding them.

"Statistics Profiling" provides details about statistics profiling.

For database partitioning feature (DPF) environments, automatic runstats always collect statistics on the catalog partition. If the table does not exist on the catalog node, the first partition in the nodegroup that the table resides within is used.

Column group statistics are collected using the "ON COLUMNS" option of the RUNSTATS command. For example, using the SHOW_LISTINGS table to collect the column group statistics on SHOW_LISTINGS.SHOW_ID, SHOW_LISTINGS.CHANNEL_ID and SHOW_LISTINGS.STATION_ID, issue the following RUNSTATS command:

RUNSTATS ON TABLE schema_name.SHOW_LISTINGS ON COLUMNS ((SHOW_ID, CHANNEL_ID, 
	STATION_ID))

To collect column statistics only on the DAY column and column group statistics for columns SHOW_ID, CHANNEL_ID, STATION_ID and TIME, issue the following RUNSTATS command:

RUNSTATS ON TABLE schema_name.SHOW_LISTINGS ON COLUMNS ((SHOW_ID, CHANNEL_ID, 
	STATION_ID, TIME), DAY)

Note: Column statistics are collected on all columns listed within a column group. In the above command, column statistics are also collected on each of the columns (SHOW_ID, CHANNEL_ID, STATION_ID, and TIME).

To gather statistics on all columns and column group statistics on SHOW_ID, CHANNEL_ID, STATION_ID, and TIME, issue the following RUNSTATS command:

RUNSTATS ON TABLE schema_name.SHOW_LISTINGS ON ALL COLUMNS AND COLUMNS 
	((SHOW_ID, CHANNEL_ID, STATION_ID, TIME))

To gather multiple multi-column statistics, you can provide a set of groups. The following RUNSTATS command gathers multi-column statistics on the groups (SHOW_ID, CHANNEL_ID, and STATION_ID) and (SHOW_ID, CHANNEL_ID, STATION_ID, and TIME); as well as collecting column statistics on the DAY column:

RUNSTATS ON TABLE schema_name.SHOW_LISTINGS ON COLUMNS ((SHOW_ID, 
	CHANNEL_ID, STATION_ID), (SHOW_ID, CHANNEL_ID, STATION_ID, TIME), DAY)

Note: Unlike index statistics, where multiple multi-column statistics are collected for the first two leading columns, the first three leading columns, and the first four leading columns of the index key, a single column group cardinality statistic is collected for each group of columns specified.


Identify when to collect column group statistics

Determining when to collect column group statistics and which column group statistics to collect can be a daunting task. This section identifies ways for you to detect when column group statistics are required.

The examples in this section use the SAMPLE database, which can be created by executing "db2sampl", and uses db2level:

Informational tokens are "DB2 v8.1.1.120", "s060801", "U808888", and FixPak "13".

Example 3: Local equality predicates

When the SAMPLE database is initially created, statistics are not collected on the tables. First, statistics need to be collected on the EMPLOYEE table:

RUNSTATS ON TABLE SCHEMA_NAME.EMPLOYEE WITH DISTRIBUTION;

Consider the following query on the EMPLOYEE table in the SAMPLE database:

SELECT EMPNO, FIRSTNME, LASTNAME, WORKDEPT, SEX, JOB, SALARY
FROM EMPLOYEE
WHERE JOB='ANALYST';

This query returns two records from the EMPLOYEE table:

EMPNO  FIRSTNME     LASTNAME        WORKDEPT SEX JOB      SALARY     
------ ------------ --------------- -------- --- -------- -----------
000130 DOLORES      QUINTANA        C01      F   ANALYST     23800.00
000140 HEATHER      NICHOLLS        C01      F   ANALYST     28420.00

 2 record(s) selected.

Take a look at the query execution plan chosen for this query.

To do so, the EXPLAIN tool is used, which requires the existence of the EXPLAIN tables.

  1. To create the EXPLAIN tables execute:
    db2 -tvf $DB2PATH/misc/EXPLAIN.DDL
  2. Once the EXPLAIN tables are created, EXPLAIN the query as follows:
    SET CURRENT EXPLAIN MODE EXPLAIN;
    
    SELECT EMPNO, FIRSTNME, LASTNAME, WORKDEPT, SEX, JOB, SALARY
    FROM EMPLOYEE
    WHERE JOB='ANALYST';
    
    SET CURRENT EXPLAIN MODE NO;

    and view the query execution plan using the db2exfmt tool:
    db2exfmt -d <DBNAME> -1 -g -o <FILENAME>
  3. Using the text editor of your choice, you should see a query execution plan that looks like the following:
            2 
         TBSCAN 
         (   2) 
         30.8464 
            2 
           |
           32 
     TABLE: SKAPOOR 
        EMPLOYEE

    The cardinality estimate of 2 matches the actual result.

  4. Add a couple redundant equality predicates to the query:
    SELECT EMPNO, FIRSTNME, LASTNAME, WORKDEPT, SEX, JOB, SALARY
    FROM EMPLOYEE
    WHERE JOB='ANALYST' AND SEX='F' AND WORKDEPT='C01';

    This query returns the same result set as above. But taking a look at the query execution plan generated with the EXPLAIN tool, the cardinality estimate doesn't match the actual result:

        0.0761719 
         TBSCAN 
         (   2) 
         31.4115 
            2 
           |
           32 
     TABLE: SKAPOOR 
        EMPLOYEE

    The DB2 optimizer assumes the three predicates are independent because a relevant index or column group statistics do not exist.

  5. Collect column group statistics on columns JOB, WORKDEPT, and SEX.
    RUNSTATS ON TABLE <SCHEMA_NAME>.EMPLOYEE ON ALL COLUMNS AND COLUMNS 
    	((JOB, WORKDEPT, SEX)) WITH DISTRIBUTION;

    With the column group statistics on the three columns, the DB2 optimizer computes a better cardinality estimate:

         1.77778 
         TBSCAN 
         (   2) 
         31.4214 
            2 
           |
           32 
     TABLE: SKAPOOR 
        EMPLOYEE

    The cardinality estimate is not computed exactly as 2 as the single predicate query since the column group statistic is a uniform distribution statistic.

Example 4: Equality join predicates

This example focuses on a join between tables ORG and STAFF. First, statistics need to be collected on the tables. For now, basic statistics are collected:

RUNSTATS ON TABLE <SCHEMA_NAME>.ORG;
RUNSTATS ON TABLE <SCHEMA_NAME>.STAFF;

Consider the following query joining the ORG and STAFF tables:

SELECT STAFF.NAME, ORG.DEPTNUMB, ORG.DEPTNAME, STAFF.SALARY
FROM ORG, STAFF
WHERE ORG.MANAGER = STAFF.ID AND ORG.DEPTNUMB = STAFF.DEPT
ORDER BY ORG.DEPTNUMB;

This query returns eight records:

NAME      DEPTNUMB DEPTNAME       SALARY   
--------- -------- -------------- ---------
Molinare        10 Head Office     22959.20
Hanes           15 New England     20659.80
Sanders         20 Mid Atlantic    18357.50
Marenghi        38 South Atlantic  17506.75
Plotz           42 Great Lakes     18352.80
Fraye           51 Plains          21150.00
Lea             66 Pacific         18555.50
Quill           84 Mountain        19818.00

  8 record(s) selected.
  1. Using the EXPLAIN tool, view the query execution plan:
                    1 
                 TBSCAN 
                 (   2) 
                 33.2225 
                    2 
                   |
                    1 
                 SORT   
                 (   3) 
                 33.151 
                    2 
                   |
                    1 
                 HSJOIN 
                 (   4) 
                 33.0248 
                    2 
              /-----+-----\
           35                8 
         TBSCAN           TBSCAN 
         (   5)           (   6) 
         17.2334          15.3736 
            1                1 
           |                |
           35                8 
     TABLE: SKAPOOR  TABLE: SKAPOOR 
          STAFF             ORG

    This example uses the naive method of collecting column group statistics. Appendix B contains the same example, with further analysis to determine the parent of the join.

  2. The join is on columns ID and DEPT of the STAFF table, and the MANAGER and DEPTNUMB columns of the ORG table, so collect column group statistics on those two sets of columns:
    RUNSTATS ON TABLE <SCHEMA_NAME>.STAFF ON ALL COLUMNS AND COLUMNS ((ID, DEPT));
    RUNSTATS ON TABLE <SCHEMA_NAME>.ORG ON ALL COLUMNS AND COLUMNS ((MANAGER, DEPTNUMB));

With the column group statistics, the DB2 optimizer correctly estimates the cardinality:

               8
             TBSCAN 
             (   2) 
             33.5658 
                2 
               |
               8
             SORT   
             (   3) 
             33.4243 
                2 
               |
               8      
             HSJOIN 
             (   4) 
             33.0363 
                2 
          /-----+-----\
       35                8 
     TBSCAN           TBSCAN 
     (   5)           (   6) 
     17.2334          15.3736 
        1                1 
       |                |
       35                8 
 TABLE: SKAPOOR  TABLE: SKAPOOR 
      STAFF             ORG

Example 5: View the multi-column statistics

The optimizer makes use of two types of multi-column statistics: index keycard statistics and column group statistics. This example provides the steps to view the multi-column statistics available on a table.

Option 1. Use the db2look tool

The db2look tool is used to generate DDL statements to recreate the objects defined in a database. You can use the -m option to show statistics collected for those objects.

  1. Collect column group statistics and index statistics on the ORG table:
    CREATE INDEX IX1 ON ORG (DEPTNUMB, DEPTNAME, MANAGER);
    
    RUNSTATS ON TABLE <SCHEMA_NAME>.ORG ON ALL COLUMNS
             AND COLUMNS ((DEPTNUMB, DEPTNAME, MANAGER, DIVISION, LOCATION))
             WITH DISTRIBUTION AND INDEXES ALL;
  2. Use the db2look tool to view the statistics on the ORG table:
    db2look -d sample -e -a -m -t ORG -o org.ddl

    Note: Use the -h option to view information about the DB2look tool.

  3. Look at the output in the org.ddl file. It should contain the following UPDATE statement for the column group statistics:
    UPDATE SYSSTAT.COLGROUPS SET colgroupcard = 8
            WHERE colgroupid IN (SELECT colgroupid
            FROM SYSCAT.COLGROUPCOLS
            WHERE tabschema = 'SKAPOOR ' AND tabname = 'ORG'
                  AND colname = 'DEPTNUMB' AND oridnal = 1)
            AND colgroupid IN (SELECT colgroupid
            FROM SYSCAT.COLGROUPCOLS
            WHERE tabschema = 'SKAPOOR ' AND tabname = 'ORG'
                  AND colname = 'DEPTNAME' AND oridnal = 2)
            AND colgroupid IN (SELECT colgroupid
            FROM SYSCAT.COLGROUPCOLS
            WHERE tabschema = 'SKAPOOR ' AND tabname = 'ORG'
                  AND colname = 'MANAGER' AND oridnal = 3)
            AND colgroupid IN (SELECT colgroupid
            FROM SYSCAT.COLGROUPCOLS
            WHERE tabschema = 'SKAPOOR ' AND tabname = 'ORG'
                  AND colname = 'DIVISION' AND oridnal = 4)
            AND colgroupid IN (SELECT colgroupid
            FROM SYSCAT.COLGROUPCOLS
            WHERE tabschema = 'SKAPOOR ' AND tabname = 'ORG'
                  AND colname = 'LOCATION' AND oridnal = 5)
            AND colgroupid NOT IN (SELECT colgroupid
            FROM SYSCAT.COLGROUPCOLS
            WHERE tabschema = 'SKAPOOR ' 
                  AND tabname = 'ORG' AND oridnal = 6) ;

    Note: The column group statistics information is added to the db2look tool in Version 8 FixPak 13.

    The above update statement lists all the columns from the SYSCAT.COLGROUPCOLS view and the associated multi-column statistic from the SYSSTAT.COLGROUPS, which represents the number of distinct groups in that set of columns. In this example, the above statement describes the set of columns (DEPTNUMB, DEPTNAME, MANAGER, DIVISION, and LOCATION) having eight distinct groups.

    The org.ddl file also contains the following statement for the index statistics:

    UPDATE SYSSTAT.INDEXES
    SET NLEAF=1,
        NLEVELS=1,
        FIRSTKEYCARD=8,
        FIRST2KEYCARD=8,
        FIRST3KEYCARD=8,
        FIRST4KEYCARD=-1,
        FULLKEYCARD=8,
        CLUSTERFACTOR=-1.000000,
        CLUSTERRATIO=100,
        SEQUENTIAL_PAGES=0,
        DENSITY=0,
        AVERAGE_SEQUENCE_GAP=0.000000,
        AVERAGE_SEQUENCE_FETCH_GAP=0.000000,
        AVERAGE_SEQUENCE_PAGES=0.000000,
        AVERAGE_SEQUENCE_FETCH_PAGES=0.000000,
        AVERAGE_RANDOM_PAGES=1.000000,
        AVERAGE_RANDOM_FETCH_PAGES=0.000000,
        NUMRIDS=8,
        NUMRIDS_DELETED=0,
        NUM_EMPTY_LEAFS=0
    WHERE INDNAME = 'IX1' AND INDSCHEMA = 'SKAPOOR'
          AND TABNAME = 'ORG' AND TABSCHEMA = 'SKAPOOR';

    The above update statement describes the following multi-column statistics. The FIRST2KEYCARD statistic describes the number of distinct groups in the columns (DEPTNUMB,DEPTNAME) and the FIRST3KEYCARD and FULLKEYCARD describe the number of distinct groups in the full set of columns (DEPTNUMB,DEPTNAME,MANAGER). The FIRST4KEYCARD has a value of -1 since the index only has 3 columns in the key.

Option 2. Query the catalog tables

You can query the same information described in the DB2look tool output from the DB2 catalog tables.

  1. Create the index (as in Step 1 of Option 1 above), if not already created, and collect multiple column group statistics on multiple tables:
    RUNSTATS ON TABLE <SCHEMA_NAME>.ORG ON ALL COLUMNS
             AND COLUMNS ((DEPTNUMB, DEPTNAME, MANAGER, DIVISION, LOCATION), 
             (DEPTNUMB, DEPTNAME), (MANAGER, DIVISION)) WITH DISTRIBUTION AND INDEXES ALL;
    
    RUNSTATS ON TABLE <SCHEMA_NAME>.EMPLOYEE ON ALL COLUMNS 
             AND COLUMNS ((EMPNO, WORKDEPT), (EMPNO, WORKDEPT, JOB));
  2. Query the catalog tables to retrieve column group statistics. Note that the following query is a recursive SQL statement that results in an SQL0347W warning that can be ignored. You can suppress it using the "UPDATE COMMAND OPTIONS USING W OFF" option.
    WITH TMP(ID, NUM) AS
    (
       SELECT COLGROUPID, MAX(ORIDNAL)
         FROM SYSCAT.COLGROUPCOLS
       GROUP BY COLGROUPID
    ),
    TMP2 (ID, NAME, NUM, TNAME, TSCHEMA) AS
    (
       SELECT ID, CAST(RTRIM(COLNAME) AS CHAR(128)), NUM, TABNAME, TABSCHEMA
         FROM TMP Y, SYSCAT.COLGROUPCOLS X
        WHERE X.COLGROUPID = Y.ID
          AND Y.NUM = X.ORIDNAL
      UNION ALL
       SELECT Y.ID, CAST(RTRIM(X.COLNAME) || ',' || Y.NAME AS CHAR(128)), Y.NUM-1,
       	 TNAME, TSCHEMA
         FROM TMP2 Y, SYSCAT.COLGROUPCOLS X
        WHERE Y.ID=X.COLGROUPID
          AND X.ORIDNAL=Y.NUM-1
          AND Y.NUM > 1
          AND TNAME = TABNAME
          AND TSCHEMA = TABSCHEMA
    )
    SELECT SUBSTR(TSCHEMA,1,10) TABSCHEMA, SUBSTR(TNAME,1,10) TABNAME, 
    	NAME COLS, COLGROUPCARD 
    FROM TMP2, SYSSTAT.COLGROUPS
    WHERE ID = COLGROUPID
    AND NUM = 1
    ORDER BY TABSCHEMA, TABNAME
    ;

    The above query returns the following records:

    Note: The value in the TABSCHEMA column will differ. Also, the COLS result column is cast to CHAR(128), which could truncate the result if it is larger than 128 characters. You may need to modify the CAST to a larger string in this case.

    TABSCHEMA  TABNAME    COLS                                   COLGROUPCARD        
    ---------- ---------- ------------------------------//------ -----------------
    SKAPOOR    EMPLOYEE   EMPNO,WORKDEPT                                        32
    SKAPOOR    EMPLOYEE   EMPNO,WORKDEPT,JOB                                    32
    SKAPOOR    ORG        DEPTNUMB,DEPTNAME                                      8
    SKAPOOR    ORG        MANAGER,DIVISION                                       8
    SKAPOOR    ORG        DEPTNUMB,DEPTNAME,MANAGER,DIVISION,LOCATION            8
    
      5 record(s) selected with 1 warning messages suppressed.

    The records describe the two-column group statistics on the EMPLOYEE table and the three-column group statistics on the ORG table.

    Note: In the above query, notice the column named "ORIDNAL" in the SYSCAT.SYSCOLGROUPCOLS view. In DB2 9, the spelling is corrected to "ORDINAL", so the query needs to be updated as follows to be used in DB2 9:

    WITH TMP(ID, NUM) AS
    (
       SELECT COLGROUPID, MAX(ORDINAL)
         FROM SYSCAT.COLGROUPCOLS
       GROUP BY COLGROUPID
    ),
    TMP2 (ID, NAME, NUM, TNAME, TSCHEMA) AS
    (
       SELECT ID, CAST(RTRIM(COLNAME) AS CHAR(128)), NUM, TABNAME, TABSCHEMA
         FROM TMP Y, SYSCAT.COLGROUPCOLS X
        WHERE X.COLGROUPID = Y.ID
          AND Y.NUM = X.ORDINAL
      UNION ALL
       SELECT Y.ID, CAST(RTRIM(X.COLNAME) || ',' || Y.NAME AS CHAR(128)), Y.NUM-1,
       	 TNAME, TSCHEMA
         FROM TMP2 Y, SYSCAT.COLGROUPCOLS X
        WHERE Y.ID=X.COLGROUPID
          AND X.ORDINAL=Y.NUM-1
          AND Y.NUM > 1
          AND TNAME = TABNAME
          AND TSCHEMA = TABSCHEMA
    )
    SELECT SUBSTR(TSCHEMA,1,10) TABSCHEMA, SUBSTR(TNAME,1,10) TABNAME, NAME COLS,
    	 COLGROUPCARD 
    FROM TMP2, SYSSTAT.COLGROUPS
    WHERE ID = COLGROUPID
    AND NUM = 1
    ORDER BY TABSCHEMA, TABNAME
    ;
  3. Query the catalog tables to retrieve index statistics.

    Note: Please replace the value provided in the TABSCHEMA='SKAPOOR' predicate with the appropriate schema name.

    SELECT SUBSTR(COLNAMES, 1, 30) AS COLS, FIRST2KEYCARD, FIRST3KEYCARD,
    	 FIRST4KEYCARD, FULLKEYCARD
    FROM SYSSTAT.INDEXES
    WHERE TABSCHEMA='SKAPOOR' AND TABNAME='ORG';

    The above query returns the following record:

    COLS                       FIRST2KEYCARD FIRST3KEYCARD FIRST4KEYCARD FULLKEYCARD
    -------------------------- ------------- ------------- ------------- -----------
    +DEPTNUMB+DEPTNAME+MANAGER             8             8            -1           8
    
      1 record(s) selected.

    The FIRST2KEYCARD statistic describes the number of distinct groups in the columns (DEPTNUMB and DEPTNAME), and the FIRST3KEYCARD and FULLKEYCARD describe the number of distinct groups in the full set of columns (DEPTNUMB, DEPTNAME, and MANAGER). The FIRST4KEYCARD has a value of -1 since the index only has three columns in the key.


Exercises

In both Examples 3 and 4, the optimizer can use index statistics, instead of column group statistics, to correct the cardinality estimate by accounting for statistical correlation.

Try these steps:

  1. Create the following indexes:
    IXEMP_1 ON EMPLOYEE(JOB, WORKDEPT, SEX);
    IXSTAFF_1 ON STAFF(ID, DEPT);
    IXORG_1 ON ORG(MANAGER, DEPTNUMB);
  2. Collect statistics on the indexes, and do not collect any column group statistics:
    RUNSTATS ON TABLE <SCHEMA_NAME>.EMPLOYEE AND INDEXES ALL;
    RUNSTATS ON TABLE <SCHEMA_NAME>.STAFF AND DETAILED INDEXES ALL;
    RUNSTATS ON TABLE <SCHEMA_NAME>.ORG AND INDEXES ALL;
  3. EXPLAIN the queries from Examples 3 and 4. The resulting cardinality estimate matches those in the query execution plans above; however, the actual plan can differ due to the indexes available. Note that collecting DETAILED index statistics does not affect the cardinality estimate.
  4. Drop the two indexes IXEMP_1 and IXSTAFF_1 created in Step 1, and create two new indexes:
    IXEMP_1 ON EMPLOYEE(JOB, WORKDEPT, SEX, EMPNO);
    IXSTAFF_1 ON STAFF(ID, DEPT,NAME,JOB);
  5. Repeat Steps 2 and 3. The resulting cardinality estimate is the same.

Appendix A. Automatic statistics profiling

The RUNSTATS utility provides an option to register and use a statistics profile. The DB2 automatic statistics profiling feature can also generate statistics profiles automatically. When this feature is enabled, information about database activity is collected and stored in a query feedback warehouse. Based on this data, a statistics profile is generated. This feature will also recommend two-column column group statistics for local equality predicates. This may be a useful tool to use in a test environment to identify useful column group statistics.

The Info Center contains further information about using automatic statistics profiling.

Note: Automatic statistics profile generation can only be activated in DB2 serial mode, and is blocked for queries in federated, and DPF environments, as well as when using intra-partition parallelism.


Appendix B. Determine the parent of a two-table join

The Statistical correlation of multiple equality predicates section discusses the naive method of collecting multi-column statistics to account for the statistical correlation between equality join predicates. This section describes how the optimizer determines the parent table, if any, in a two-table join. Identifying whether the optimizer will detect a parent table, and if so which one, is useful in order to understand when collecting column group statistics is valuable and on which tables.

The optimizer identifies the parent table, in a set of two-table join predicates, as the table that is the parent of at least one of the join predicates, and not the child of any of the other join predicate. The parent of a join predicate is determined as the table with the larger set of distinct values in the predicate column. For example, in the join predicate SHOW_LISTINGS.SHOW_ID = RATINGS.SHOW_ID, if the COLCARD statistic of SHOW_LISTINGS.SHOW_ID is larger than the COLCARD statistic of RATINGS.SHOW_ID, then SHOW_LISTINGS is the parent in the join predicate, and RATINGS is the child.

Furthermore, the optimizer attempts to use the range statistics (HIGH2KEY and LOW2KEY) to verify the parent by ensuring the set of values in the child is a subset of the parent. For example, in the join predicate SHOW_LISTINGS.SHOW_ID = RATINGS.SHOW_ID if:

  • the COLCARD statistic of SHOW_LISTINGS.SHOW_ID is larger than the COLCARD statistic of RATINGS.SHOW_ID, and
  • the HIGH2KEY of SHOW_LISTINGS.SHOW_ID is larger than or equal to the HIGH2KEY of RATINGS.SHOW_ID, and
  • the LOW2KEY of RATINGS.SHOW_ID is less than or equal to the LOW2KEY of RATINGS.SHOW_ID

then SHOW_LISTINGS is the parent in the join predicate, and RATINGS is the child.

Example B.1

Considering this set of predicates:

P1: SHOW_LISTINGS.SHOW_ID = RATINGS.SHOW_ID
P2: SHOW_LISTINGS.CHANNEL_ID = RATINGS.CHANNEL_ID
P3: SHOW_LISTINGS.STATION_ID = RATINGS.STATION_ID

In P1, assume COLCARD(SHOW_LISTINGS.SHOW_ID) > COLCARD(RATINGS.SHOW_ID).

In P2, assume COLCARD(SHOW_LISTINGS.CHANNEL _ID) > COLCARD(RATINGS.CHANNEL_ID).

In P3, assume COLCARD(SHOW_LISTINGS.STATION_ID) > COLCARD(RATINGS.STATION_ID).

the optimizer identifies the SHOW_LISTINGS table as the parent in all three predicates, so it also identifies SHOW_LISTINGS as the parent of the join. As a result, the optimizer uses the multi-column statistics available on the columns (SHOW_ID, CHANNEL_ID, and STATION_ID) from the SHOW_LISTINGS table.

Example B.2

Using the predicates from Example B.1:

In P1, assume COLCARD(SHOW_LISTINGS.SHOW_ID) > COLCARD(RATINGS.SHOW_ID).

In P2, assume COLCARD(SHOW_LISTINGS.CHANNEL _ID) > COLCARD(RATINGS.CHANNEL_ID).

In P3, assume COLCARD(SHOW_LISTINGS.STATION_ID) = COLCARD(RATINGS.STATION_ID).

the optimizer identifies the SHOW_LISTINGS table as the parent in the predicates P1 and P2, and neither a child nor parent of the predicate P3. So, it also identifies SHOW_LISTINGS as the parent of the join. As a result, the optimizer uses the multi-column statistics available on the columns (SHOW_ID, CHANNEL_ID, and STATION_ID) from the SHOW_LISTINGS table.

Example B.3

Using the predicates from Example B.1:

In P1, assume COLCARD(SHOW_LISTINGS.SHOW_ID) > COLCARD(RATINGS.SHOW_ID).

In P2, assume COLCARD(SHOW_LISTINGS.CHANNEL _ID) < COLCARD(RATINGS.CHANNEL_ID).

In P3, assume COLCARD(SHOW_LISTINGS.STATION_ID) = COLCARD(RATINGS.STATION_ID).

the optimizer identifies the SHOW_LISTINGS table as the parent in the predicate P1, as the child in predicate P2, and neither a child nor a parent of the predicate P3. So, it does not identify SHOW_LISTINGS as the parent of the join (nor RATINGS). As a result, the optimizer does not attempt to detect the statistical correlation between these three join predicates.

The database partitioning feature

In a DPF environment, if the tables are partitioned across multiple database partitions, you can only use the range statistics if the following conditions hold:

  1. both tables are partitioned on the same columns
  2. the tables are collocated; that is, they are in the same nodegroup
  3. the join predicates reference all the partitioning key columns
  4. statistics are gathered on the same node

If the range statistics can't be used, then the COLCARD is the only statistic the optimizer uses to determine the parent table in the join. In a DPF environment, statistics are collected on a single node, which could lead to inaccuracies in the COLCARD statistic. As a result, the optimizer allows for some tolerance (1%) in the COLCARD statistic when determining the parent. However, this tolerance is based on the multi-column statistic, so the naive approach to collecting column group statistics should be used in DPF if the join is not neutral, and neither table in the join is a clear parent.

Example B.4

1. Determine if both tables are partitioned on the same columns

You cannot use the range statistics (HIGH2KEY and LOW2KEY) unless the data is distributed equivalently for both tables. As a result, if the tables are not partitioned on the same columns, the optimizer won't infer that the data is distributed equivalently for both tables.

Consider two tables, T1 and T2. T1 is partitioned on multiple database partitions, with a partitioning key on columns (C1 and C2). Likewise, T2 is also partitioned, with a partitioning key on columns (C2 and C3).

Case 1.1 Using the following predicates to join T1 and T2:

P1:  T1.C1=T2.C2
P2:  T1.C2=T2.C3

the tables are considered to be partitioned on the same columns since the join predicates are applied on the columns in the same order as their respective partitioning keys.

Case 1.2 Using these predicates:

P1:  T1.C1=T2.C3
P2:  T1.C2=T2.C2

the predicates aren't applied on the columns in the same order as their respective partitioning keys, so the tables are not considered to be partitioned on the same columns.

Case 1.3 Using these predicates:

P1: T1.C1=T2.C1
P2: T1.C2=T2.C2
P3: T1.C3=T2.C3

the tables are not considered to be partitioned on the same columns.

Case 1.4 Let's look at a situation using the same predicates as in example 1.3, but assuming T2 is partitioned on columns (C1, C2, and C3).

Although the table T1 is partitioned on the same columns a the two leading columns in T2's partitioning key, since T2 is further partitioned on C3, it could lead to the data being distributed differently than if partitioned on (C1 and C2) only. As a result, the two tables are not considered to be partitioned on the same columns.

2. Determine if the two tables are collocated

Consider two tables, T1 and T2, with T1 belonging in nodegroup N0 and T2 belonging in nodegroup N1, where N0 contains partitions numbered 0 and 1 and N1 contains partitions numbered 1 and 2. Since the tables are not in the same nodegroup, they are not considered collocated.

3. Determine if the join predicates reference all the partitioning key columns

Consider two tables, T1 and T2. T1 is partitioned on multiple database partitions, with a partitioning key on columns (C1 and C2). Likewise, T2 is also partitioned, with partitioning key on columns (C2 and C3).

3.1 Using these predicates to join T1 and T2:

P1:  T1.C1=T2.C2
P2:  T1.C2=T2.C3

both predicates cover all the partitioning key columns.

Case 3.2 If a third predicate, P3, is added to the predicates in example 3.1, as T1.C3=T2.C1, then the three predicates together still cover all the partitioning key columns in both tables. However, if we take only P1 and P3, the partitioning key columns are not covered by all three predicates.

Case 3.3 If a third predicate, P3, is added to the predicates in example 3.1, as T1.C3=T2.C3, then the three predicates together still cover all the partitioning key columns in both tables. However, if we take only P1 and P3, the partitioning key columns of T2 are covered, but not those of T1, so the condition doesn't hold.

Fully qualified unique index

If neither table is denoted as the parent of the join based on the above conditions, the optimizer further checks for a fully qualified unique index. If the join predicates fully qualify a unique index, the optimizer uses the FULLKEYCARD statistics of the unique index to detect and account for the statistical correlation between the equality join predicates.

Neutral Joins

The optimizer identifies a join between two tables as neutral if it does not identify a parent in any of the equality join predicates between the two tables. A parent is not identified in an equality join predicate if the COLCARD of both columns are the same and they both cover the same range of values (HIGH2KEY and LOW2KEY statistics are equivalent).

In DB2 Version 8, in FixPaks 13 and earlier, the optimizer does not account for the statistical correlation between a set of equality join predicates for a join of two tables if it is a neutral NPK join. In Version 8 FixPak 14 and DB2 9, the optimizer's statistical correlation detection is extended to consider accounting for the statistical correlation of neutral NPK joins as well.

A self-join, where both tables reference the same base table, is a special case of a neutral join. The optimizer accounts for the statistical correlation for this specific case starting in Version 8.2.

Example B.5: Collect column group statistics on the parent table

This example focuses on a join between tables ORG and STAFF. First, statistics need to be collected on the tables. For now, basic statistics are collected:

RUNSTATS ON TABLE <SCHEMA_NAME>.ORG;
RUNSTATS ON TABLE <SCHEMA_NAME>.STAFF;

Consider the following query joining the ORG and STAFF tables:

SELECT STAFF.NAME, ORG.DEPTNUMB, ORG.DEPTNAME, STAFF.SALARY
FROM ORG, STAFF
WHERE ORG.MANAGER = STAFF.ID AND ORG.DEPTNUMB = STAFF.DEPT
ORDER BY ORG.DEPTNUMB;

This query returns eight records:

NAME      DEPTNUMB DEPTNAME       SALARY   
--------- -------- -------------- ---------
Molinare        10 Head Office     22959.20
Hanes           15 New England     20659.80
Sanders         20 Mid Atlantic    18357.50
Marenghi        38 South Atlantic  17506.75
Plotz           42 Great Lakes     18352.80
Fraye           51 Plains          21150.00
Lea             66 Pacific         18555.50
Quill           84 Mountain        19818.00

 8 record(s) selected.

1. Using the EXPLAIN tool, view the query execution plan:

                1 
             TBSCAN 
             (   2) 
             33.2225 
                2 
               |
                1 
             SORT   
             (   3) 
             33.151 
                2 
               |
                1 
             HSJOIN 
             (   4) 
             33.0248 
                2 
          /-----+-----\
       35                8 
     TBSCAN           TBSCAN 
     (   5)           (   6) 
     17.2334          15.3736 
        1                1 
       |                |
       35                8 
 TABLE: SKAPOOR  TABLE: SKAPOOR 
      STAFF             ORG

2. Look at the column statistics of the columns referenced in the join predicates. The following query retrieves the interesting column statistics from the DB2 catalog tables for the columns in the ORG and STAFF tables referenced in the two join predicates:

SELECT SUBSTR(COLNAME,1,20) AS COLNAME, COLCARD,
       SUBSTR(HIGH2KEY,1,10) AS HIGH2KEY,
       SUBSTR(LOW2KEY,1,10) AS LOW2KEY
FROM SYSSTAT.COLUMNS
WHERE TABNAME IN ('ORG', 'STAFF')
AND COLNAME IN ('MANAGER', 'DEPTNUMB', 'ID', 'DEPT')
ORDER BY TABNAME, COLNAME;
Table B.5 Column statistics
COLNAMECOLCARDHIGH2KEYLOW2KEY
DEPTNUMB86615
MANAGER827030
DEPT86615
ID3534020

3. Compute the cardinality estimate.

The filter factor estimate of an equality join predicate is computed as

1/max(colcard_LHS,colcard_RHS).

where LHS is the lefthand side of the join predicate and RHS is the righthand side of the join predicate. In this case, you have the following two join predicates:

P1: org.manager=staff.id
P2: org.deptnumb=staff.dept

and the filter factors (ff) of P1 and P2 are computed as:

ff(P1) = 1 / max (8,35) = 1/35 = 0.0285714
ff(P2) = 1/max(8,8) = 1/8 = 0.125

Using the filter factor estimates and the cardinality of the two tables, the join cardinality is computed as:

JoinCard  = Cardinality(ORG) * cardinality(STAFF) * ff(P1) * ff(P2)
          = 35 * 8 * 0.0285714 * 0.125
          = 1

4. Identify the parent in the join.

In Table B.5, the column statistics show that STAFF is the parent in the join because the following conditions are satisfied:

For the predicate org.manager=staff.id:

colcard(MANAGER)  < colcard(ID)
high2key(MANAGER) < high2key(ID)
low2key(MANAGER) > low2key(ID)

Therefore, the parent of this predicate is STAFF.

For the predicate org.deptnumb=staff.dept, the colcard, high2key and low2key statistics are equivalent. Therefore, a parent does not exist for this predicate; it is deemed "neutral".

5. Collect column group statistics on the parent table.

In Step 4, STAFF is identified as the parent, and ORG the child of the join, so a column group statistic should be collected on the columns (ID, DEPT) in the STAFF table:

RUNSTATS ON TABLE <SCHEMA_NAME>.STAFF ON ALL COLUMNS AND COLUMNS ((ID,DEPT));

With the column group statistic on the two columns, the DB2 optimizer correctly estimates the cardinality:

               8  
             TBSCAN 
             (   2) 
             33.5658 
               2 
               |
               8
              SORT   
             (   3) 
             33.4243 
               2 
               |
               8
             HSJOIN 
             (   4) 
             33.0363 
                2 
          /-----+-----\
       35                8 
     TBSCAN           TBSCAN 
     (   5)           (   6) 
     17.2334          15.3736 
        1                1 
       |                |
       35                8 
 TABLE: SKAPOOR  TABLE: SKAPOOR 
      STAFF             ORG

Example B.6 Neutral Join

Consider the join of two tables, T1 and T2, on the predicates:

P1: T1.C1 = T2.C1
P2: T1.C2 = T2.C2
P3: T1.C3 = T2.C3

Assume the tables have the following statistics:

Table B.6 Neutral join statistics
TABLENAMECOLNAMECOLCARDHIGH2KEYLOW2KEYTABLENAMECOLNAMECOLCARDHIGH2KEYLOW2KEY
T1C1100992T2C1100992
T1C21019991900T2C21019991900
T1C35'Y''B'T2C35'Y''B'

All predicates are denoted as neutral since the stats are equivalent for both tables. As a result, in V8 FixPak 14 and DB2 9, you should collect column group statistics on columns (C1,C2, and C3) on either table (not required on both) to allow the optimizer to detect the statistical correlation between the join predicates.

Example B.7

Using the same predicates as in example B.6, assume the tables have the following statistics:

Table B.7 Determine the parent table
TABLENAMECOLNAMECOLCARDHIGH2KEYLOW2KEYTABLENAMECOLNAMECOLCARDHIGH2KEYLOW2KEY
T1C1100992T2C110992
T1C21019991900T2C21019991900
T1C35'Y''B'T2C35'Y''B'

In this scenario, predicates P2 and P3 are denoted as neutral since the statistics for columns C2 and C3 are equivalent for both tables. However, the statistics for C1 indicate that T1 is the parent of P1 since the COLCARD of T1.C1 is greater than the COLCARD of T2.C1, and both columns have equivalent HIGH2KEY and LOW2KEY statistics. As a result, you should collect column group statistics on columns (C1, C2, and C3) on T1 to allow the optimizer to detect the statistical correlation between the join predicates.

Example B.8

Using the same predicates as in example B.6, assume the tables have the following statistics:

Table B.8 Statistics
TABLENAMECOLNAMECOLCARDHIGH2KEYLOW2KEYTABLENAMECOLNAMECOLCARDHIGH2KEYLOW2KEY
T1C1100992T2C1109992
T1C21019991900T2C21019991900
T1C35'Y''B'T2C35'Y''B'

In this scenario, predicates P2 and P3 are denoted as neutral since the statistics for columns C2 and C3 are equivalent for both tables. However, the statistics for C1 indicate that neither T1 nor T2 is the parent, nor is it considered neutral because T1.C1 has a higher COLCARD but T2.C1 has a higher HIGH2KEY statistic. As a result, you should collect column group statistics on columns (C2 and C3) on either T1 or T2 to allow the optimizer to detect the statistical correlation between the join predicates P2 and P3.

Example B.9 Using DPF and range statistics

Consider the join of two tables, T1 and T2, both partitioned across multiple database partitions. They are collocated, statistics are gathered on the same node, the partitioning key of T1 and T2 is on columns (C1 and C2), and the join is on the predicates:

P1: T1.C1 = T2.C1
P2: T1.C2 = T2.C2
P3: T1.C3 = T2.C3

Assume the tables have the following statistics:

Table B.9
TABLENAMECOLNAMECOLCARDHIGH2KEYLOW2KEYTABLENAMECOLNAMECOLCARDHIGH2KEYLOW2KEY
T1C1100992T2C1109992
T1C21019991900T2C21019991900
T1C35'Y''B'T2C35'Y''B'

This join qualifies the DPF conditions, so the optimizer can use range statistics to identify the parent side. In this scenario, predicates P2 and P3 are denoted as neutral since the statistics for columns C2 and C3 are equivalent for both tables. However, the statistics for C1 indicate that neither T1 nor T2 is the parent, nor is it considered neutral because T1.C1 has a higher COLCARD but T2.C1 has a higher HIGH2KEY statistic. As a result, you should collect column group statistics on columns (C2 and C3) on either T1 or T2 to allow the optimizer to detect the statistical correlation between the join predicates P2 and P3.

Example B.10 DPF and range statistics cannot be used

Consider the same scenario as in B.9, but the partitioning key of T1 is on columns (C1 and C2), whereas the partitioning key T2 is on columns (C2 and C1). This join fails the first DPF condition since the tables are not considered to be partitioned on the same columns; the join predicates T1.C1=T2.C1 references the first column in T1's partitioning key but the second column in T2's partitioning key. As a result, the optimizer cannot use the range statistics to identify the parent side, and only the COLCARD statistics are used to determine the parent. Based on the COLCARD statistics alone, the join is considered neutral.

Example B.10.1 DPF and range statistics cannot be used

Consider the same scenario as in B.10, but with the following statistics:

Table B.10
TABLENAMECOLNAMECOLCARDHIGH2KEYLOW2KEYTABLENAMECOLNAMECOLCARDHIGH2KEYLOW2KEY
T1C1100992T2C1989992
T1C21019991900T2C21119991900
T1C35'Y''B'T2C35'Y''B'

Based on the COLCARD statistics, T1 is the parent in predicate P1 but T2 is the parent in predicate P2. Since the COLCARD statistics are close, the multi-column statistic may fall within the 1% tolerance that the optimizer considers in a DPF environment. It can use the naive approach here, and should collect column group statistics on both tables on all three columns. After collecting the column group statistics, if the optimizer does not compute a different cardinality estimate, then the 1% tolerance was not achieved.


Appendix C. Further details concerning local predicates

The multi-column statistics (index and column group) only provide information on the number of distinct groupings for a set of columns. As a result, the optimizer assumes the data is distributed uniformly when accounting for statistical correlation. However, local equality predicate selectivity estimates are computed using distribution statistics (if available) to account for a skew in the data. To avoid overestimating the combined selectivity estimate of local equality predicates by assuming uniformity when there is skew, the statistical correlation detection in the optimizer was further improved in FixPak 10.

Resources

Learn

Get products and technologies

  • Download a free trial version of DB2 Enterprise 9.
  • Now you can use DB2 for free. Download DB2 Express-C, a no-charge version of DB2 Express Edition for the community that offers the same core data features as DB2 Express Edition and provides a solid base from which to build and deploy applications.
  • Build your next development project with IBM trial software, available for download directly from developerWorks.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=185029
ArticleTitle=Understand column group statistics in DB2
publish-date=12212006