Improving filter factors by collecting cardinality and frequency statistics

You can improve the filter factors that Db2 uses for access path selection by collecting cardinality and frequency statistics for columns and columns groups that are used in predicates.

Before you begin

About this task

Db2 needs an accurate estimate of the number of rows that qualify each predicate is applied to determine optimal access paths. When multiple tables are accessed, filtering also affects the join order, the join method, and the cost of the join.

The SYSIBM.SYSCOLUMNS and SYSIBM.SYSCOLDIST catalog tables are the main source of statistics for calculating predicate filter factors.

Cardinality statistics
The COLCARDF column of the SYSCOLUMNS catalog table indicates the cardinality of a column. A positive value is an estimate of the number of distinct values in the column. When no statistics are collected for the column, the value is -1, and a default filter factor might be used.

The value of the COLCARDF column that the RUNSTATS TABLESPACE utility generates is an estimate that is determined by a sampling method. If you know a more accurate value, you can supply it by updating the catalog. If the column is the first column of a non-DPSI index, the value that the RUNSTATS INDEX utility generates is exact.

Frequency or distribution statistics
Columns in the SYSCOLDIST catalog table contain frequency statistics (sometimes also called distribution statistics) for the values in a single column.

When frequency statistics do not exist, Db2 assumes that the data is uniformly distributed and all values in the column occur with the same frequency. This assumption can lead to an inaccurate estimate of the number of qualifying rows if the data is skewed, which can result in performance problems.

For example, assume that a column (AGE_CATEGORY) contains five distinct values (COLCARDF), each of which occur with the following frequencies:
AGE_CATEGORY             FREQUENCY
------------             ---------
INFANT                    5%
CHILD                    15%
ADOLESCENT               25%
ADULT                    40%
SENIOR                   15%

Without this frequency information, Db2 must use a default filter factor of 1/5 (or 1/COLCARDF), or 20%, to estimate the number of rows that qualify for predicate AGE_CATEGORY=ADULT. However, the actual frequency of that age category is 40%. Thus, the number of qualifying rows is underestimated by 50%.

Procedure

To improve predicate filter factors by collecting frequency and statistics, use any of the following approaches:

  • When you collect statistics at the table level, collect statistics only for columns or column groups that might be used as search conditions in WHERE clauses of queries.
    • Specify the COLUMN option to collect statistics for specified columns only.
    • Specify the COLGROUP option to collect statistics for specified groups of columns only. You can also specify the FREQVAL option with the COLGROUP option to collect distribution statistics for the column group.
    When you collect statistics on groups of columns that are used in predicates, the improved accuracy of the filter factor estimate can lead to improved query performance. However, collecting statistics on all columns of a table is costly and might be unnecessary.
  • Collect cardinality statistics on all columns that are used as predicates in WHERE clauses.
  • Collect frequency statistics for all columns that have low cardinality values and are used in COL op constant predicates.
  • Collect frequency statistics for columns that can contain default data if the default data is skewed and the column is used in a COL op constant predicate.
  • Collect column group statistics on all columns that are used in join predicates.
  • Collect column statistics periodically for columns that contain data with frequently changing ranges, such as date-time values.
    These types of columns can result in old values in the HIGH2KEY and LOW2KEY columns in the catalog. By periodically collecting column statistics on these changing columns, you can make the values in HIGH2KEY and LOW2KEY accurately reflect the range of data values. With these accurate values, Db2 can obtain accurate filter factors for range predicates.
  • When you collect statistics for indexes, you can specify the FREQVAL option to specify whether distribution statistics are collected, and number of concatenated index columns to collect.
    When you collect statistics about indexes, cardinality statistics (including intermediate key cardinality values) are automatically collected on the specified indexes. You can also specify the FREQVAL option to specify whether distribution statistics are collected, and number of concatenated index columns to collect. By default, distribution statistics are collected on the first column of each index for the 10 most frequently occurring values. FIRSTKEYCARDF and FULLKEYCARDF are also collected by default.
  • For DPSI type indexes, consider updating the FULLKECARDF value in the catalog if you know the accurate value.
    The FULLKEYCARDF value that is generated by Db2 utilities is an estimate that is based on a sampling method.