Extended examples of the use of distribution statistics

Distribution statistics provide information about the frequency and distribution of table data that helps the optimizer build query access plans when the data is not evenly distributed and there are many duplicates.

The following examples will help you to understand how the optimizer might use distribution statistics.

Example with frequent-value statistics

Consider a query that contains an equality predicate of the form C1 = KEY. If frequent-value statistics are available, the optimizer can use those statistics to choose an appropriate access plan, as follows:
  • If KEY is one of the N most frequent values, the optimizer uses the frequency of KEY that is stored in the catalog.
  • If KEY is not one of the N most frequent values, the optimizer estimates the number of rows that satisfy the predicate under the assumption that the (COLCARD - N) non-frequent values have a uniform distribution. That is, the number of rows is estimated by the following formula (1):
       CARD - NUM_FREQ_ROWS
       --------------------
           COLCARD - N
where CARD is the number of rows in the table, COLCARD is the cardinality of the column, and NUM_FREQ_ROWS is the total number of rows with a value equal to one of the N most frequent values.
For example, consider a column C1 whose data values exhibit the following frequencies:
Data Value Frequency
1 2
2 3
3 40
4 4
5 1

The number of rows in the table is 50 and the column cardinality is 5. Exactly 40 rows satisfy the predicate C1 = 3. If it is assumed that the data is evenly distributed, the optimizer estimates the number of rows that satisfy the predicate as 50/5 = 10, with an error of -75%. But if frequent-value statistics based on only the most frequent value (that is, N = 1) are available, the number of rows is estimated as 40, with no error.

Consider another example in which two rows satisfy the predicate C1 = 1. Without frequent-value statistics, the number of rows that satisfy the predicate is estimated as 10, an error of 400%:
   estimated rows - actual rows
   ----------------------------  X 100
            actual rows

   10 - 2
   ------  X 100 = 400%
      2
Using frequent-value statistics (N = 1), the optimizer estimates the number of rows containing this value using the formula (1) given previously as:
   (50 - 40)
   --------- = 3
    (5 - 1)
and the error is reduced by an order of magnitude:
    3 - 2
    ----- = 50%
      2

Example with quantile statistics

The following discussion of quantile statistics uses the term K-quantile. The K-quantile for a column is the smallest data value, V, such that at least K rows have data values that are less than or equal to V. To compute a K-quantile, sort the column values in ascending order; the K-quantile is the data value in the Kth row of the sorted column.

If quantile statistics are available, the optimizer can better estimate the number of rows that satisfy a range predicate, as illustrated by the following examples. Consider a column C1 that contains the following values:
  0.0
  5.1
  6.3
  7.1
  8.2
  8.4
  8.5
  9.1
 93.6
100.0
Suppose that K-quantiles are available for K = 1, 4, 7, and 10, as follows:
K K-quantile
1 0.0
4 7.1
7 8.5
10 100.0
  • Exactly seven rows satisfy the predicate C <= 8.5. Assuming a uniform data distribution, the following formula (2):
           KEY2 - KEY1
        ------------------  X CARD
        HIGH2KEY - LOW2KEY
    with LOW2KEY in place of KEY1, estimates the number of rows that satisfy the predicate as:
        8.5 - 5.1
       ----------  X 10 ≈ 0
       93.6 - 5.1
    where ≈ means approximately equal to. The error in this estimate is approximately -100%.

    If quantile statistics are available, the optimizer estimates the number of rows that satisfy this predicate by the value of K that corresponds to 8.5 (the highest value in one of the quantiles), which is 7. In this case, the error is reduced to 0.

  • Exactly eight rows satisfy the predicate C <= 10. If the optimizer assumes a uniform data distribution and uses formula (2), the number of rows that satisfy the predicate is estimated as 1, an error of -87.5%.
    Unlike the previous example, the value 10 is not one of the stored K-quantiles. However, the optimizer can use quantiles to estimate the number of rows that satisfy the predicate as r_1 + r_2, where r_1 is the number of rows satisfying the predicate C <= 8.5 and r_2 is the number of rows satisfying the predicate C > 8.5 AND C <= 10. As in the previous example, r_1 = 7. To estimate r_2, the optimizer uses linear interpolation:
              10 - 8.5   
       r_2 ≈ ----------  X (number of rows with value > 8.5 and <= 100.0)
              100 - 8.5
    
              10 - 8.5   
       r_2 ≈ ----------  X (10 - 7)
              100 - 8.5
    
              1.5   
       r_2 ≈ -----   X (3)
              91.5
    
       r_2 ≈ 0
    The final estimate is r_1 + r_2 ≈ 7, and the error is only -12.5%.

Quantiles improve the accuracy of the estimates in these examples because the real data values are clustered in a range from 5 to 10, but the standard estimation formulas assume that the data values are distributed evenly between 0 and 100.

The use of quantiles also improves accuracy when there are significant differences in the frequencies of different data values. Consider a column having data values with the following frequencies:

Data Value Frequency
20 5
30 5
40 15
50 50
60 15
70 5
80 5
Suppose that K-quantiles are available for K = 5, 25, 75, 95, and 100:
K K-quantile
5 20
25 40
75 50
95 70
100 80

Suppose also that frequent-value statistics are available, based on the three most frequent values.

Exactly 10 rows satisfy the predicate C BETWEEN 20 AND 30. Assuming a uniform data distribution and using formula (2), the number of rows that satisfy the predicate is estimated as:
   30 - 20
   -------  X 100 = 25
   70 - 30
an error of 150%.
Using frequent-value statistics and quantile statistics, the number of rows that satisfy the predicate is estimated as r_1 + r_2, where r_1 is the number of rows that satisfy the predicate (C = 20) and r_2 is the number of rows that satisfy the predicate C > 20 AND C <= 30. Using formula (1), r_1 is estimated as:
   100 - 80
   -------- = 5
     7 - 3
Using linear interpolation, r_2 is estimated as:
     30 - 20
     -------  X (number of rows with a value > 20 and <= 40)
     40 - 20

     30 - 20
   = -------  X (25 - 5)
     40 - 20

   = 10
This yields a final estimate of 15 and reduces the error by a factor of three.