Automatic statistics collection
The Db2® optimizer uses catalog statistics to determine the most efficient access plan for a query. With automatic statistics collection, part of the Db2 automated table maintenance feature, you can let the database manager determine whether statistics must be updated.
Instead of using automatic statistic collection, you can collect statistics manually. However, deciding which statistics to collect for a workload is complex, and keeping these statistics up-to-date is time consuming. Out-of-date or incomplete table or index statistics might lead the optimizer to select a suboptimal plan, which slows down query execution.
- For synchronous collection at statement compilation time, you can use the real-time statistics (RTS) feature. The auto_stmt_stats database configuration parameter is used for RTS statistics collection and RUNSTATS activity logging.
- For asynchronous collection, you can enable the RUNSTATS command to run in the background. The auto_runstats database configuration parameter is used for automatic background statistics collection.
You can use the Configuration Advisor to determine the initial configuration for databases, including the appropriate settings for various database configuration parameter.
In IBM® Data Studio Version 3.1 or later, you can use the task assistant for configuring automatic statistics collection. Task assistants can guide you through the process of setting options, reviewing the automatically generated commands to perform the task, and running these commands. For more details, see Administering databases with task assistants.In IBM Data Studio Version 3.1 or later, you can use the task assistant to configure automatic statistics collection. Task assistants can guide you through the process of setting options, reviewing the automatically generated commands to perform the task, and running these commands. For more details, see Administering databases with task assistants.
Understanding asynchronous and real-time statistics collection
The query optimizer determines how synchronous or asynchronous statistics are collected, based on the needs of the query and the amount of table update activity (the number of update, insert, or delete operations).
You can enable real-time statistics collection, so that statistics can be fabricated by using certain metadata. Fabrication means deriving or creating statistics, rather than collecting them as part of normal RUNSTATS command activity. For example, the number of rows in a table can be derived from knowing the number of pages in the table, the page size, and the average row width. In some cases, statistics are not derived but are maintained by the index and data manager and can be stored in the catalog. For example, the index manager maintains a count of the number of leaf pages and levels in each index.
- Synchronous statistics collection is used for sampling because the table is large.
- Synchronous statistics were fabricated.
- Synchronous statistics collection failed because the collection time is exceeded.
At most, two asynchronous requests can be processed at the same time, and only for different tables. One request must have been initiated by real-time statistics collection, and the other must have been initiated by asynchronous statistics collection.
- Asynchronous statistics collection is performed by using a throttled RUNSTATS utility. Throttling controls the amount of resource that is consumed by the RUNSTATS utility, based on current database activity. As database activity increases, the utility runs more slowly, reducing its resource demands.
- Synchronous statistics collection is limited to 5 seconds per query. The RTS optimization guideline determines the amount of time. If synchronous collection exceeds the time limit, an asynchronous collection request is submitted.
- Synchronous statistics collection does not store the statistics in the system catalog. Instead, the statistics are stored in a statistics cache and are later stored in the system catalog by an asynchronous operation. This storage sequence avoids the memory usage and possible lock contention that are involved in updating the system catalog. Statistics in the statistics cache are available for subsequent SQL compilation requests.
- Only one synchronous statistics collection operation occurs per table. Other requests requiring synchronous statistics collection fabricate statistics, if possible, and continue with statement compilation. This behavior is also enforced in a partitioned database environment, where operations on different database partitions might require synchronous statistics.
- Only tables with missing statistics or high levels of activity (as measured by the number of update, insert, or delete operations) are considered for statistics collection. Even if a table meets the statistics collection criteria, statistics are not collected synchronously unless query optimization requires them. In some cases, the query optimizer can choose an access plan without statistics. To check if asynchronous statistics collection is required, tables with more than 4000 pages are sampled to determine whether high table activity changed the statistics. Statistics for such large tables are collected only if warranted.
- Statistics collection during an online maintenance window depends
on whether the statistics are asynchronous or synchronous:
- For asynchronous statistics collection, the RUNSTATS utility is automatically scheduled to run during the online maintenance window that you specify in your maintenance policy. This policy also specifies the set of tables that are within the scope of automatic statistics collection, minimizing unnecessary resource consumption.
- Synchronous statistics collection and fabrication do not use the online maintenance window that you specify in your maintenance policy, because synchronous requests must occur immediately and have limited collection time. Instead, synchronous statistics collection and fabrication uses to the policy that specifies the set of tables that are within the scope of automatic statistics collection.
- While automatic statistics collection is being performed, the affected tables are still available for regular database activity (update, insert, or delete operations).
- Synchronous
are not collected for the following objects:
- Real-time statistics are not collected for statistical views.
- Real-time statistics are not collected for nicknames. To refresh nickname statistics in the system catalog for synchronous statistics collection, call the SYSPROC.NNSTAT procedure. For asynchronous statistics collection, Db2 software automatically calls the SYSPROC.NNSAT procedure to refresh the nickname statistics in the system catalog.
- Declared and created global temporary tables can have only real-time synchronous statistics collected.
Although real-time statistics collection is designed to minimize statistics collection memory usage, try it in a test environment first to ensure that there is no negative performance impact. There might be a negative performance impact in some online transaction processing (OLTP) scenarios, especially if there is a limit on how long a query can run.
Real-time synchronous statistics collection is performed for regular tables, materialized query tables (MQTs), and global temporary tables. Asynchronous statistics are not collected for global temporary tables. Global temporary tables cannot be excluded from real-time statistics via the automatic maintenance policy facility.
- Tables that are marked VOLATILE (tables that have the VOLATILE field set in the SYSCAT.TABLES catalog view).
- Tables
for which you manually updated statistics by issuing UPDATE statements
against SYSSTAT catalog views, including manual updates of expression-based
key column statistics for any expression-based indexes that the table
has, even though those statistics are in a separate statistical view.
When you modify table statistics manually, the database manager assumes that you are now responsible for maintaining the statistics. To induce the database manager to maintain statistics for a table with manually updated statistics, collect the statistics by using the RUNSTATS command, or specify statistics collection when using the LOAD command. Tables that you created before Version 9.5 and for which you manually updated statistics before upgrading are not affected by this limitation. Their statistics are automatically maintained by the database manager until you manually update them.
- Statistical views
- Tables for which you manually updated statistics by issuing UPDATE statements against SYSSTAT catalog views. If real-time statistics collection is not enabled, some statistics fabrication still occurs for tables for which you manually updated statistics.
In a partitioned database environment, statistics are collected on a single database partition and then extrapolated. The database manager always collects statistics (both RTS and automatic background statistics) on the first database partition of the database partition group.
No real-time statistics collection occurs until at least 5 minutes after database activation.
Real-time statistics processing occurs for both static and dynamic SQL.
A table that you truncated, either by using the TRUNCATE statement or by using the IMPORT command, is automatically recognized as having out-of-date statistics.
Automatic statistics collection, both synchronous and asynchronous, invalidates cached dynamic statements that reference tables for which statistics were collected. This invalidation is done so that cached dynamic statements can be re-optimized with the latest statistics.
Asynchronous automatic statistics collection operations might be interrupted when the database is deactivated. If you did not explicitly activate the database by using the ACTIVATE DATABASE command or the sqle_activate_db API, the database is deactivated when the last user disconnects from the database. If operations are interrupted, error messages might be recorded in the Db2 diagnostic log file. To avoid interrupting asynchronous automatic statistics collection operations, explicitly activate the database.
If a table has expression-based indexes, statistics for the expression-based key columns are collected and cached as part of statistics collection for the table. Statistics are not fabricated for expression-based key columns
Real-time statistics and explain processing
There is no real-time processing for a query that is only explained (not executed) by the EXPLAIN facility. The following table summarizes the behavior under different values of the CURRENT EXPLAIN MODE special register.
CURRENT EXPLAIN MODE special register value | Real-time statistics collection considered |
---|---|
YES | Yes |
EXPLAIN | No |
NO | Yes |
REOPT | Yes |
RECOMMEND INDEXES | No |
EVALUATE INDEXES | No |
AI Query Optimizer Model Discovery and Training
Model discovery and training is performed automatically as part of automatic asynchronous statistics collection. The discovery operation mines the data using a sample to find strongly correlated columns and identifies and ranks the columns that will benefit the most from the model. This functionality helps keep the training time low by focusing on strongly correlated columns.
Model training is performed after statistics are collected, automatically generating training and validation query sets from a sample of data. The generated queries that have various combinations of BETWEEN predicates on the columns included in the model are run against the sample. The predicate pattern and the selectivity estimate for each query is encoded and used to train and build the model. Trained models are stored in system catalogs. To determine if a particular table has trained models, query the SYSCAT.AIOPT_TABLECARDMODELS catalog view.
The retraining of models is determined using the same table data change analysis performed by automatic statistics collection. When a statistics collection is required, it will also drive a retrain of the model.
Automatic statistics collection and the statistics cache
You can use the statistics cached to make synchronously collected statistics available to all queries. This cache is part of the catalog cache. In a partitioned database environment, the statistics cache is on only the catalog database partition even though each database partition has a catalog cache. If you enable real-time statistics collection, catalog cache requirements are higher. Consider tuning the value of the catalogcache_sz database configuration parameter if you enable real-time statistics collection.
Automatic statistics collection and statistical profiles
You can customize the type of statistics that are collected by creating your own statistics profile for a particular table. For details, see collecting statistics using a statistics profile.
- To minimize the memory usage of synchronous statistics collection, the database manager might collect statistics by using sampling. In this case, the sampling rate and method might be different from those rates and methods that you specified in the statistical profile.
- RTS collection might fabricate statistics, but it might not be possible to fabricate all types of statistics that you specified in the statistical profile. For example, column statistics such as COLCARD, HIGH2KEY, and LOW2KEY cannot be fabricated unless the column is the primary column in some index.
If RTS statistics collection cannot gather all the statistics that you specified in the statistical profile, an asynchronous collection request is submitted.
The following sections explain different operating characteristics of automatic statistics collection.