Distribution statistics

You can collect two kinds of data distribution statistics: frequent-value statistics and quantile statistics.

  • Frequent-value statistics provide information about a column and the data value with the highest number of duplicates, the value with the second highest number of duplicates, and so on, to the level that is specified by the value of the num_freqvalues database configuration parameter. To disable the collection of frequent-value statistics, set num_freqvalues to 0. You can also use the NUM_FREQVALUES clause on the RUNSTATS command for a specific table, statistical view, or column.
  • Quantile statistics provide information about how data values are distributed in relation to other values. Called K-quantiles, these statistics represent the value V at or less than which at least K values lie. You can compute a K-quantile by sorting the values in ascending order. The K-quantile value is the value in the Kth position from the low end of the range.

    To specify the number of sections (quantiles) into which the column data values should be grouped, set the num_quantiles database configuration parameter to a value between 2 and 32 767. The default value is 20, which ensures a maximum optimizer estimation error of plus or minus 2.5% for any equality, less-than, or greater-than predicate, and a maximum error of plus or minus 5% for any BETWEEN predicate. To disable the collection of quantile statistics, set num_quantiles to 0 or 1.

    You can set num_quantiles for a specific table, statistical view, or column.

Note: The RUNSTATS utility consumes more processing resources and memory (specified by the stat_heap_sz database configuration parameter) if larger num_freqvalues and num_quantiles values are used.

When to collect distribution statistics

To decide whether distribution statistics for a table or statistical view would be helpful, first determine:
  • Whether the queries in an application use host variables.

    Distribution statistics are most useful for dynamic and static queries that do not use host variables. The optimizer makes limited use of distribution statistics when assessing queries that contain host variables.

  • Whether the data in columns is uniformly distributed.
    Create distribution statistics if at least one column in the table has a highly nonuniform distribution of data, and the column appears frequently in equality or range predicates; that is, in clauses such as the following:
       where c1 = key;
       where c1 in (key1, key2, key3);
       where (c1 = key1) or (c1 = key2) or (c1 = key3);
       where c1 <= key;
       where c1 between key1 and key2;
Two types of nonuniform data distribution can occur, and possibly together.
  • Data might be highly clustered instead of being evenly spread out between the highest and lowest data value. Consider the following column, in which the data is clustered in the range (5,10):
    Quantile statistics help the optimizer to deal with this kind of data distribution.
    Queries can help you to determine whether column data is not uniformly distributed. For example:
       select c1, count(*) as occurrences
         from t1
       group by c1
       order by occurrences desc
  • Duplicate data values might often occur. Consider a column in which the data is distributed with the following frequencies:
    Table 1. Frequency of data values in a column
    Data Value Frequency
    20 5
    30 10
    40 10
    50 25
    60 25
    70 20
    80 5
    Both frequent-value and quantile statistics help the optimizer to deal with numerous duplicate values.

When to collect index statistics only

You might consider collecting statistics that are based only on index data in the following situations:
  • A new index was created since the RUNSTATS utility was run, and you do not want to collect statistics again on the table data.
  • There were many changes to the data that affect the first column of an index.
    Note: Quantile statistics are not collected for the first column of an index with random ordering.

What level of statistical precision to specify

Use the num_quantiles and num_freqvalues database configuration parameters to specify the precision with which distribution statistics are stored. You can also specify the precision with corresponding RUNSTATS command options when you collect statistics for a table or for columns. The higher you set these values, the greater the precision that the RUNSTATS utility uses when it creates and updates distribution statistics. However, greater precision requires more resources, both during the RUNSTATS operation itself, and for storing more data in the catalog tables.

For most databases, specify between 10 and 100 as the value of the num_freqvalues database configuration parameter. Ideally, frequent-value statistics should be created in such a way that the frequencies of the remaining values are either approximately equal to one another or negligible when compared to the frequencies of the most frequent values. The database manager might collect fewer than this number, because these statistics are collected only for data values that occur more than once. If you need to collect only quantile statistics, set the value of num_freqvalues to zero.

To specify the number of quantiles, set the num_quantiles database configuration parameter to a value between 20 and 50.
  • First determine the maximum acceptable error when estimating the number of rows for any range query, as a percentage P.
  • The number of quantiles should be approximately 100/P for BETWEEN predicates, and 50/P for any other type of range predicate (<, <=, >, or >=).

For example, 25 quantiles should result in a maximum estimate error of 4% for BETWEEN predicates and 2% for > predicates. In general, specify at least 10 quantiles. More than 50 quantiles should be necessary only for extremely nonuniform data. If you need only frequent-value statistics, set num_quantiles to 0. If you set this parameter to 1, because the entire range of values fits within one quantile, no quantile statistics are collected.