Indexes and column statistics

While performing similar functions, indexes and column statistics are different.

If you are trying to decide whether to use statistics or indexes to provide information to the statistics manager, keep in mind the following differences.

One major difference between indexes and column statistics is that indexes are permanent objects that are updated when changes to the underlying table occur. Column statistics are not updated. If your data is constantly changing, the statistics manager might need to rely on stale column statistics. However, maintaining an index after each table change might use more system resources than refreshing stale column statistics after a group of changes have occurred.

Another difference is the effect that the existence of new indexes or column statistics has on the optimizer. When new indexes become available, the optimizer considers them for implementation. If they are candidates, the optimizer reoptimizes the query and tries to find a better implementation. However, this reoptimization is not true for column statistics. When new or refreshed column statistics are available, the statistics manager interrogates immediately. Reoptimization occurs only if the answers are different from the ones that were given before these refreshed statistics. It is possible to use statistics that are refreshed without causing a reoptimization of an access plan.

Also, column statistics and EVIs can be used to recognize data skew in joins, which standard radix indexes are not able to do. So, when join columns contain certain values that occur far more frequently than other values, column statistics or EVIs may be used to help provide better estimates.

Requests for selectivity information from indexes are obtained in the form of key range estimate operations. In order to get accurate information for a given predicate, the statistics manager may need to evaluate a significant portion of an index's data. This evaluation can take a long time for large indexes as index pages are loaded into active memory and then analyzed. The optimizer must wait for each estimate operation to complete before continuing with query optimization. Each query that is optimized will generally require multiple key range estimates. As a result, the time spent obtaining these estimates can significantly impact the overall time required to optimize a query.

In order to shorten the time required to optimize a query, the statistics manager uses several different strategies when faced with the need to perform a key range estimate. One strategy is to avoid getting the estimate when other information (e.g. file size or selectivity information from column statistics) indicate that the information returned by the estimate would not be worth the time required to obtain it. A second strategy is to cache prior estimates so that subsequent requests for the same key range can be returned quickly. Each index has a cache associated with it. Estimates are stored in this cache until the cache fills or becomes stale relative to the data in the index or until an IPL occurs. A third strategy, introduced in IBM i 7.3, is to quickly return a less accurate initial estimate while submitting a background request for a more accurate full estimate. With this strategy, the statistics manager adds a timeout value to each key range estimate operation. If the operation does not complete in the time indicated by the timeout value, the operation is interrupted. The statistics manager then looks at some other fast estimate sources (including similar ranges from the cache and a shallower evaluation of the index) and returns the best answer obtained thus far. This allows the query to resume optimizing with a reasonable estimate but without waiting for the full estimate to complete. In the meantime, the QDBFSTCCOL system job will take up the request to process the full key range estimate and will place an updated (and more accurate) final result in the estimate cache when it is ready. Once the full estimate is complete and in the cache, any query which used the initial estimate will have its plan invalidated and re-optimized on the next execution.

The amount of time that the statistics manager waits for a key range estimate to complete can be configured in the QAQQINI query options file with option KEY_RANGE_ESTIMATE_TIMEOUT. This control should be seen as a way to influence the optimization time and not as a hard limit . Although KEY_RANGE_ESTIMATE_TIMEOUT limits the amount of time that the optimizer may take for any individual estimate operation, it does not guarantee an upper bound on the overall optimization time. This is because the optimizer may request multiple key range estimates, each of which receives the full timeout duration.

When trying to determine the selectivity of predicates, the statistics manager considers column statistics and indexes as resources for its answers in the following order Start of change with some caveats and exceptions End of change :

Try to recognize if an index-based estimate would be very long running, and if so circumvent the index estimate and use column statistics instead
Try to use a multi-column keyed index when ANDed or ORed predicates reference multiple columns
If there is no perfect index that contains all the columns in the predicates, it tries to find a combination of indexes that can be used.
For single column questions, it uses available column statistics
If the answer derived from the column statistics shows a selectivity of less than 2%, indexes are used to verify this answer

Accessing column statistics to answer questions is faster than trying to obtain these answers from indexes.

Column statistics can only be used by SQE. For CQE, all statistics are retrieved from indexes.

Finally, column statistics can be used only for query optimization. They cannot be used for the actual implementation of a query, whereas indexes can be used for both.