# Extended examples of the use of distribution statistics

Distribution statistics provide information about the frequency and distribution of table data that helps the optimizer build query access plans when the data is not evenly distributed and there are many duplicates.

The following examples will help you to understand how the optimizer might use distribution statistics.

## Example with frequent-value statistics

```
C1
= KEY
```

. If frequent-value statistics are available, the optimizer
can use those statistics to choose an appropriate access plan, as
follows:- If
`KEY`

is one of the`N`most frequent values, the optimizer uses the frequency of`KEY`

that is stored in the catalog. - If
`KEY`

is not one of the`N`most frequent values, the optimizer estimates the number of rows that satisfy the predicate under the assumption that the (COLCARD -`N`) non-frequent values have a uniform distribution. That is, the number of rows is estimated by the following formula (1):`CARD - NUM_FREQ_ROWS -------------------- COLCARD -`

`N`

`N`most frequent values.

Data Value | Frequency |
---|---|

1 | 2 |

2 | 3 |

3 | 40 |

4 | 4 |

5 | 1 |

The number of rows in the table is 50 and the column
cardinality is 5. Exactly 40 rows satisfy the predicate ```
C1
= 3
```

. If it is assumed that the data is evenly distributed,
the optimizer estimates the number of rows that satisfy the predicate
as 50/5 = 10, with an error of -75%. But if frequent-value statistics
based on only the most frequent value (that is, `N` =
1) are available, the number of rows is estimated as 40, with no error.

```
C1
= 1
```

. Without frequent-value statistics, the number of rows
that satisfy the predicate is estimated as 10, an error of 400%:```
estimated rows - actual rows
---------------------------- X 100
actual rows
10 - 2
------ X 100 = 400%
2
```

Using frequent-value statistics (`N`= 1), the optimizer estimates the number of rows containing this value using the formula (1) given previously as:

```
(50 - 40)
--------- = 3
(5 - 1)
```

and the error is reduced by an order of magnitude:```
3 - 2
----- = 50%
2
```

## Example with quantile statistics

The following
discussion of quantile statistics uses the term K-quantile

.
The K-quantile for a column is the smallest data value, `V`,
such that at least `K` rows have data values that
are less than or equal to `V`. To compute a K-quantile,
sort the column values in ascending order; the `K`-quantile
is the data value in the `K`th row of the sorted
column.

```
0.0
5.1
6.3
7.1
8.2
8.4
8.5
9.1
93.6
100.0
```

`K`-quantiles are available for

`K`= 1, 4, 7, and 10, as follows:

K |
K-quantile |
---|---|

1 | 0.0 |

4 | 7.1 |

7 | 8.5 |

10 | 100.0 |

- Exactly seven rows satisfy the predicate
`C <= 8.5`

. Assuming a uniform data distribution, the following formula (2):

with LOW2KEY in place of KEY1, estimates the number of rows that satisfy the predicate as:`KEY2 - KEY1 ------------------ X CARD HIGH2KEY - LOW2KEY`

where ≈ means`8.5 - 5.1 ---------- X 10 ≈ 0 93.6 - 5.1`

approximately equal to

. The error in this estimate is approximately -100%.If quantile statistics are available, the optimizer estimates the number of rows that satisfy this predicate by the value of

`K`that corresponds to 8.5 (the highest value in one of the quantiles), which is 7. In this case, the error is reduced to 0. - Exactly eight rows satisfy the predicate
`C <= 10`

. If the optimizer assumes a uniform data distribution and uses formula (2), the number of rows that satisfy the predicate is estimated as 1, an error of -87.5%.Unlike the previous example, the value 10 is not one of the stored`K`-quantiles. However, the optimizer can use quantiles to estimate the number of rows that satisfy the predicate as`r_1 + r_2`

, where`r_1`

is the number of rows satisfying the predicate`C <= 8.5`

and`r_2`

is the number of rows satisfying the predicate`C > 8.5 AND C <= 10`

. As in the previous example,`r_1 = 7`

. To estimate`r_2`

, the optimizer uses linear interpolation:

The final estimate is r_1 + r_2 ≈ 7, and the error is only -12.5%.`10 - 8.5 r_2 ≈ ---------- X (number of rows with value > 8.5 and <= 100.0) 100 - 8.5 10 - 8.5 r_2 ≈ ---------- X (10 - 7) 100 - 8.5 1.5 r_2 ≈ ----- X (3) 91.5 r_2 ≈ 0`

Quantiles improve the accuracy of the estimates in these examples
because the real data values are clustered

in a range from
5 to 10, but the standard estimation formulas assume that the data
values are distributed evenly between 0 and 100.

The use of quantiles also improves accuracy when there are significant differences in the frequencies of different data values. Consider a column having data values with the following frequencies:

Data Value | Frequency |
---|---|

20 | 5 |

30 | 5 |

40 | 15 |

50 | 50 |

60 | 15 |

70 | 5 |

80 | 5 |

`K`-quantiles are available for

`K`= 5, 25, 75, 95, and 100:

K |
K-quantile |
---|---|

5 | 20 |

25 | 40 |

75 | 50 |

95 | 70 |

100 | 80 |

Suppose also that frequent-value statistics are available, based on the three most frequent values.

`C BETWEEN 20 AND 30`

. Assuming a uniform
data distribution and using formula (2), the number of rows that satisfy
the predicate is estimated as:```
30 - 20
------- X 100 = 25
70 - 30
```

an error of 150%.`r_1 + r_2`

, where `r_1`

is
the number of rows that satisfy the predicate `(C = 20)`

and `r_2`

is
the number of rows that satisfy the predicate ```
C > 20 AND
C <= 30
```

. Using formula (1), `r_1`

is estimated
as:```
100 - 80
-------- = 5
7 - 3
```

`r_2`

is
estimated as:```
30 - 20
------- X (number of rows with a value > 20 and <= 40)
40 - 20
30 - 20
= ------- X (25 - 5)
40 - 20
= 10
```

This yields a final estimate of 15 and reduces
the error by a factor of three.