# Optimizer use of distribution statistics

The optimizer uses distribution statistics for better estimates of the cost of different query access plans.

Unless it has additional information about the distribution of values between the low and high values, the optimizer assumes that data values are evenly distributed. If data values differ widely from each other, are clustered in some parts of the range, or contain many duplicate values, the optimizer will choose a less than optimal access plan.

```
select c1, c2
from table1
where c1 = 'NEW YORK'
and c2 <= 10
```

Assume that there is an index on both
columns C1 and C2. One possible access plan is to use the index on
C1 to retrieve all rows with `C1 = 'NEW YORK'`

, and
then to check whether `C2 <= 10`

for each retrieved
row. An alternate plan is to use the index on C2 to retrieve all rows
with `C2 <= 10`

, and then to check whether ```
C1
= 'NEW YORK'
```

for each retrieved row. Because the primary
cost of executing a query is usually the cost of retrieving the rows,
the best plan is the one that requires the fewest retrievals. Choosing
this plan means estimating the number of rows that satisfy each predicate.`C1 = KEY`

) is
estimated as CARD/COLCARD, and the number of rows that satisfy a range
predicate (`C1 BETWEEN KEY1 AND KEY2`

) can be estimated
with the following formula:```
KEY2 - KEY1
------------------ x CARD
HIGH2KEY - LOW2KEY
```

These estimates are accurate only when the true distribution of data values within a column is reasonably uniform. When distribution statistics are unavailable, and either the frequency of data values varies widely, or the data values are very unevenly distributed, the estimates can be off by orders of magnitude, and the optimizer might choose a suboptimal access plan.

When distribution statistics are available, the probability of such errors can be greatly reduced by using frequent-value statistics to estimate the number of rows that satisfy an equality predicate, and by using both frequent-value statistics and quantile statistics to estimate the number of rows that satisfy a range predicate.