Distribution statistics
You can collect two kinds of data distribution statistics: frequent-value statistics and quantile statistics.
- Frequent-value statistics provide information about a column and the data value with the highest number of duplicates, the value with the second highest number of duplicates, and so on, to the level that is specified by the value of the num_freqvalues database configuration parameter. To disable the collection of frequent-value statistics, set num_freqvalues to 0. You can also use the NUM_FREQVALUES clause on the RUNSTATS command for a specific table, statistical view, or column.
- Quantile statistics provide information about how
data values are distributed in relation to other values. Called K-quantiles,
these statistics represent the value V at or less
than which at least K values lie. You can compute
a K-quantile by sorting the values in ascending
order. The K-quantile value is the value in the Kth
position from the low end of the range.
To specify the number of
sections
(quantiles) into which the column data values should be grouped, set the num_quantiles database configuration parameter to a value between 2 and 32 767. The default value is 20, which ensures a maximum optimizer estimation error of plus or minus 2.5% for any equality, less-than, or greater-than predicate, and a maximum error of plus or minus 5% for any BETWEEN predicate. To disable the collection of quantile statistics, set num_quantiles to 0 or 1.You can set num_quantiles for a specific table, statistical view, or column.
When to collect distribution statistics
- Whether the queries in an application use host variables.
Distribution statistics are most useful for dynamic and static queries that do not use host variables. The optimizer makes limited use of distribution statistics when assessing queries that contain host variables.
- Whether the data in columns is uniformly distributed.Create distribution statistics if at least one column in the table has a highly
nonuniform
distribution of data, and the column appears frequently in equality or range predicates; that is, in clauses such as the following:where c1 = key; where c1 in (key1, key2, key3); where (c1 = key1) or (c1 = key2) or (c1 = key3); where c1 <= key; where c1 between key1 and key2;
- Data might be highly clustered instead of being evenly spread
out between the highest and lowest data value. Consider the following
column, in which the data is clustered in the range (5,10):
Quantile statistics help the optimizer to deal with this kind of data distribution.0.0 5.1 6.3 7.1 8.2 8.4 8.5 9.1 93.6 100.0
Queries can help you to determine whether column data is not uniformly distributed. For example:select c1, count(*) as occurrences from t1 group by c1 order by occurrences desc
- Duplicate data values might often occur. Consider a column in
which the data is distributed with the following frequencies:
Both frequent-value and quantile statistics help the optimizer to deal with numerous duplicate values.
Table 1. Frequency of data values in a column Data Value Frequency 20 5 30 10 40 10 50 25 60 25 70 20 80 5
When to collect index statistics only
- A new index was created since the RUNSTATS utility was run, and you do not want to collect statistics again on the table data.
- There were many changes to the data that affect the first column
of an index.Note: Quantile statistics are not collected for the first column of an index with random ordering.
What level of statistical precision to specify
Use the num_quantiles and num_freqvalues database configuration parameters to specify the precision with which distribution statistics are stored. You can also specify the precision with corresponding RUNSTATS command options when you collect statistics for a table or for columns. The higher you set these values, the greater the precision that the RUNSTATS utility uses when it creates and updates distribution statistics. However, greater precision requires more resources, both during the RUNSTATS operation itself, and for storing more data in the catalog tables.
For most databases, specify between 10 and 100 as the value of the num_freqvalues database configuration parameter. Ideally, frequent-value statistics should be created in such a way that the frequencies of the remaining values are either approximately equal to one another or negligible when compared to the frequencies of the most frequent values. The database manager might collect fewer than this number, because these statistics are collected only for data values that occur more than once. If you need to collect only quantile statistics, set the value of num_freqvalues to zero.
- First determine the maximum acceptable error when estimating the number of rows for any range query, as a percentage P.
- The number of quantiles should be approximately 100/P for BETWEEN predicates, and 50/P for any other type of range predicate (<, <=, >, or >=).
For example, 25 quantiles should result in a maximum estimate
error of 4% for BETWEEN predicates and 2% for >
predicates.
In general, specify at least 10 quantiles. More than 50 quantiles
should be necessary only for extremely nonuniform data. If you need
only frequent-value statistics, set num_quantiles to 0.
If you set this parameter to 1, because the entire
range of values fits within one quantile, no quantile statistics are
collected.