Extended examples of the use of distribution statistics
Distribution statistics provide information about the frequency and distribution of table data that helps the optimizer build query access plans when the data is not evenly distributed and there are many duplicates.
The following examples will help you to understand how the optimizer might use distribution statistics.
Example with frequent-value statistics
C1
= KEY
. If frequent-value statistics are available, the optimizer
can use those statistics to choose an appropriate access plan, as
follows:- If
KEY
is one of the N most frequent values, the optimizer uses the frequency ofKEY
that is stored in the catalog. - If
KEY
is not one of the N most frequent values, the optimizer estimates the number of rows that satisfy the predicate under the assumption that the (COLCARD - N) non-frequent values have a uniform distribution. That is, the number of rows is estimated by the following formula (1):CARD - NUM_FREQ_ROWS -------------------- COLCARD - N
Data Value | Frequency |
---|---|
1 | 2 |
2 | 3 |
3 | 40 |
4 | 4 |
5 | 1 |
The number of rows in the table is 50 and the column
cardinality is 5. Exactly 40 rows satisfy the predicate C1
= 3
. If it is assumed that the data is evenly distributed,
the optimizer estimates the number of rows that satisfy the predicate
as 50/5 = 10, with an error of -75%. But if frequent-value statistics
based on only the most frequent value (that is, N =
1) are available, the number of rows is estimated as 40, with no error.
C1
= 1
. Without frequent-value statistics, the number of rows
that satisfy the predicate is estimated as 10, an error of 400%: estimated rows - actual rows
---------------------------- X 100
actual rows
10 - 2
------ X 100 = 400%
2
Using frequent-value statistics (N =
1), the optimizer estimates the number of rows containing this value
using the formula (1) given previously as: (50 - 40)
--------- = 3
(5 - 1)
and the error is reduced by an order of magnitude: 3 - 2
----- = 50%
2
Example with quantile statistics
The following
discussion of quantile statistics uses the term K-quantile
.
The K-quantile for a column is the smallest data value, V,
such that at least K rows have data values that
are less than or equal to V. To compute a K-quantile,
sort the column values in ascending order; the K-quantile
is the data value in the Kth row of the sorted
column.
0.0
5.1
6.3
7.1
8.2
8.4
8.5
9.1
93.6
100.0
K | K-quantile |
---|---|
1 | 0.0 |
4 | 7.1 |
7 | 8.5 |
10 | 100.0 |
- Exactly seven rows satisfy the predicate
C <= 8.5
. Assuming a uniform data distribution, the following formula (2):
with LOW2KEY in place of KEY1, estimates the number of rows that satisfy the predicate as:KEY2 - KEY1 ------------------ X CARD HIGH2KEY - LOW2KEY
where ≈ means8.5 - 5.1 ---------- X 10 ≈ 0 93.6 - 5.1
approximately equal to
. The error in this estimate is approximately -100%.If quantile statistics are available, the optimizer estimates the number of rows that satisfy this predicate by the value of K that corresponds to 8.5 (the highest value in one of the quantiles), which is 7. In this case, the error is reduced to 0.
- Exactly eight rows satisfy the predicate
C <= 10
. If the optimizer assumes a uniform data distribution and uses formula (2), the number of rows that satisfy the predicate is estimated as 1, an error of -87.5%.Unlike the previous example, the value 10 is not one of the stored K-quantiles. However, the optimizer can use quantiles to estimate the number of rows that satisfy the predicate asr_1 + r_2
, wherer_1
is the number of rows satisfying the predicateC <= 8.5
andr_2
is the number of rows satisfying the predicateC > 8.5 AND C <= 10
. As in the previous example,r_1 = 7
. To estimater_2
, the optimizer uses linear interpolation:
The final estimate is r_1 + r_2 ≈ 7, and the error is only -12.5%.10 - 8.5 r_2 ≈ ---------- X (number of rows with value > 8.5 and <= 100.0) 100 - 8.5 10 - 8.5 r_2 ≈ ---------- X (10 - 7) 100 - 8.5 1.5 r_2 ≈ ----- X (3) 91.5 r_2 ≈ 0
Quantiles improve the accuracy of the estimates in these examples
because the real data values are clustered
in a range from
5 to 10, but the standard estimation formulas assume that the data
values are distributed evenly between 0 and 100.
The use of quantiles also improves accuracy when there are significant differences in the frequencies of different data values. Consider a column having data values with the following frequencies:
Data Value | Frequency |
---|---|
20 | 5 |
30 | 5 |
40 | 15 |
50 | 50 |
60 | 15 |
70 | 5 |
80 | 5 |
K | K-quantile |
---|---|
5 | 20 |
25 | 40 |
75 | 50 |
95 | 70 |
100 | 80 |
Suppose also that frequent-value statistics are available, based on the three most frequent values.
C BETWEEN 20 AND 30
. Assuming a uniform
data distribution and using formula (2), the number of rows that satisfy
the predicate is estimated as: 30 - 20
------- X 100 = 25
70 - 30
an error of 150%.r_1 + r_2
, where r_1
is
the number of rows that satisfy the predicate (C = 20)
and r_2
is
the number of rows that satisfy the predicate C > 20 AND
C <= 30
. Using formula (1), r_1
is estimated
as: 100 - 80
-------- = 5
7 - 3
r_2
is
estimated as: 30 - 20
------- X (number of rows with a value > 20 and <= 40)
40 - 20
30 - 20
= ------- X (25 - 5)
40 - 20
= 10
This yields a final estimate of 15 and reduces
the error by a factor of three.