DB2 Version 9.7 for Linux, UNIX, and Windows

Distribution statistics

You can collect two kinds of data distribution statistics: frequent-value statistics and quantile statistics.

Note: The runstats utility consumes more processing resources and memory (specified by the stat_heap_sz database configuration parameter) if larger num_freqvalues and num_quantiles values are used.

When to collect distribution statistics

To decide whether distribution statistics for a table or statistical view would be helpful, first determine:
  • Whether the queries in an application use host variables.

    Distribution statistics are most useful for dynamic and static queries that do not use host variables. The optimizer makes limited use of distribution statistics when assessing queries that contain host variables.

  • Whether the data in columns is uniformly distributed.
    Create distribution statistics if at least one column in the table has a highly "nonuniform" distribution of data, and the column appears frequently in equality or range predicates; that is, in clauses such as the following:
       where c1 = key;
       where c1 in (key1, key2, key3);
       where (c1 = key1) or (c1 = key2) or (c1 = key3);
       where c1 <= key;
       where c1 between key1 and key2;
Two types of nonuniform data distribution can occur, and possibly together.
  • Data might be highly clustered instead of being evenly spread out between the highest and lowest data value. Consider the following column, in which the data is clustered in the range (5,10):
      0.0
      5.1
      6.3
      7.1
      8.2
      8.4
      8.5
      9.1
     93.6
    100.0
    Quantile statistics help the optimizer to deal with this kind of data distribution.
    Queries can help you to determine whether column data is not uniformly distributed. For example:
       select c1, count(*) as occurrences
         from t1
       group by c1
       order by occurrences desc
  • Duplicate data values might often occur. Consider a column in which the data is distributed with the following frequencies:
    Table 1. Frequency of data values in a column
    Data Value Frequency
    20 5
    30 10
    40 10
    50 25
    60 25
    70 20
    80 5
    Both frequent-value and quantile statistics help the optimizer to deal with numerous duplicate values.

When to collect index statistics only

You might consider collecting statistics that are based only on index data in the following situations:
  • A new index has been created since the runstats utility was run, and you do not want to collect statistics again on the table data.
  • There have been many changes to the data that affect the first column of an index.

What level of statistical precision to specify

Use the num_quantiles and num_freqvalues database configuration parameters to specify the precision with which distribution statistics are stored. You can also specify the precision with corresponding RUNSTATS command options when you collect statistics for a table or for columns. The higher you set these values, the greater the precision that the runstats utility uses when it creates and updates distribution statistics. However, greater precision requires more resources, both during the runstats operation itself, and for storing more data in the catalog tables.

For most databases, specify between 10 and 100 as the value of the num_freqvalues database configuration parameter. Ideally, frequent-value statistics should be created in such a way that the frequencies of the remaining values are either approximately equal to one another or negligible when compared to the frequencies of the most frequent values. The database manager might collect fewer than this number, because these statistics will only be collected for data values that occur more than once. If you need to collect only quantile statistics, set the value of num_freqvalues to zero.

To specify the number of quantiles, set the num_quantiles database configuration parameter to a value between 20 and 50.
  • First determine the maximum acceptable error when estimating the number of rows for any range query, as a percentage P.
  • The number of quantiles should be approximately 100/P for BETWEEN predicates, and 50/P for any other type of range predicate (<, <=, >, or >=).

For example, 25 quantiles should result in a maximum estimate error of 4% for BETWEEN predicates and 2% for ">" predicates. In general, specify at least 10 quantiles. More than 50 quantiles should be necessary only for extremely nonuniform data. If you need only frequent-value statistics, set num_quantiles to 0. If you set this parameter to 1, because the entire range of values fits within one quantile, no quantile statistics are collected.