Histogram statistics

Histogram statistics enable Db2 to improve access path selection by estimating predicate selectivity from value-distribution statistics that are collected over the entire range of values in a data set.

Restriction: RUNSTATS cannot collect histogram statistics on randomized key columns.

Db2 chooses the best access path for a query based on predicate selectivity estimation, which in turn relies heavily on data distribution statistics. Histogram statistics summarize data distribution on an interval scale by dividing the entire range of possible values within a data set into a number of intervals.

The following values define the intervals:

QUANTILENO
An ordinary sequence number that identifies the interval.
HIGHVALUE
The value that serves as the upper bound for the interval.
LOWVALUE
A value that serves as the lower bound for the interval.

Db2 creates histogram statistics in the following manner:

  • The first interval contains the lowest value for a column or column group.
  • The last interval contains the highest value for a column or column group.
  • If the column is nullable and contains NULL values, the NULL value and second-highest value are each placed in separate intervals.
  • The remaining intervals contain approximately equal-depth histogram statistics, which means that the whole range of values is divided into intervals that each contain about the same percentage of the total number rows.

Histogram statistics intervals

Histogram statistics intervals have the following characteristics:

  • A highly frequent single value might occupy an interval by itself.
  • A single value is never broken into more than one interval, meaning that the maximum number of intervals is equal to the number of distinct values on the column. The maximum number of intervals cannot exceed 100, which is the maximum number that Db2 supports.
  • Adjacent intervals sometime skip values that do not appear in the table, especially when doing so avoids a large range of skipped values within an interval. For example, if the value 30 above has 1% frequency, placing it in the seventh interval would balance the percentage of rows in the 6th and 7th intervals. However, doing so would introduce a large skipped range to the seventh interval.
  • HIGHVALUE and LOWVALUE can be inclusive or exclusive, but an interval generally represents a non-overlapped value range.
  • NULL values, if any exist, occupy a single interval.
  • Because Db2 cannot break any single value into two different intervals, the maximum number of intervals is limited to the number of distinct values in the column, and cannot exceed the supported maximum of 100 intervals.