Histogram statistics filter factors

When histogram statistics are available, Db2 can more accurately interpolate the distribution of values across a large range.

Begin program-specific programming interface information. RUNSTATS normally collects frequency statistics for a single-column or single multi-column data set. Because catalog space and bind time performance concerns make the collection of these types of statistics on every distinct value found in the target column or columns very impractical, such frequency statistics are commonly collected only on the most frequent or least frequent, and therefore most biased, values.

However, such limited statistics often do not provide an accurate prediction of the value distribution because they require a rough interpolation across the entire range of values. For example, suppose that the YRS_OF_EXPERIENCE column on an EMPLOYEE table contains the following value frequencies:

Table 1. Example frequency statistics for values on the YRS_OF_EXPERIENCE column in an EMPLOYEE table
VALUE FREQUENCY
2 10%
25 15%
26 15%
27 7%
12 0.02%
13 0.01%
40 0.0001%
41 0.00001%

Example predicates that can benefit from histogram statistics

Some example predicates on values in this table include:

  • Equality predicate with unmatched value:
    SELECT EMPID FROM EMPLOYEE T 
    WHERE T.YRS_OF_EXPERIENCE = 6;
  • Range predicate:
    SELECT T.EMPID FROM EMPLOYEE T 
    WHERE T.YRS_OF_EXPERIENCE BETWEEN 5 AND 10;
  • Non-local predicate:
    SELECT T1.EMPID FROM EMPLOYEE T1, OPENJOBS T2 
    WHERE T1.SPECIALTY = T2.AREA AND T1.YRS_OF_EXPERIENCE > T2.YRS_OF_EXPERIENCE;

For each of the above predicates, distribution statistics for any single value cannot help Db2 to estimate predicate selectivity, other than by uniform interpolation of filter factors over the uncollected part of the value range. The result of such interpolation might lead to inaccurate estimation and undesirable access path selection.

How Db2 uses histogram statistics

Db2 creates a number of intervals such that each interval contains approximately the same percentage of rows from the data set. The number of intervals is specified by the value of NUMQUANTILES when you use the HISTOGRAM option of RUNSTATS. Each interval has an identifier value QUANTILENO, and values, the LOWVALUE and HIGHVALUE columns, that bound the interval. Db2 collects distribution statistics for each interval.

When you use RUNSTATS to collect statistics on a column that contains such wide-ranging frequency values, specify the HISTORGRAM option to collect more granular distribution statistics that account for the distribution of values across the entire range of values. The following table shows the result of collecting histogram statistics for the years of experience values in the employee table. In this example, the statistics have been collected with 7 intervals:

Table 2. Histogram statistics for the column YRS_OF_EXPERIENCE in an EMPLOYEE table.
QUANTILENO LOWVALUE HIGHVALUE CARDF FREQUENCYF
1 0 3 4 14%
2 4 15 8 14%
3 18 24 7 12%
4 25 25 1 15%
5 26 26 1 15%
6 27 30 4 16%
7 35 40 6 14%

End program-specific programming interface information.