DB2 Version 9.7 for Linux, UNIX, and Windows

Collecting statistics on a sample of the table data

Table statistics are used by the query optimizer to select the best access plan for a query, so it is important that statistics remain current. With the ever-increasing size of databases, efficient statistics collection becomes more challenging.

An effective approach is to collect statistics on a random sample of table data. For I/O-bound or processor-bound systems, the performance benefits can be enormous.

The DB2® product enables you to efficiently sample data for statistics collection, potentially improving the performance of the runstats utility by orders of magnitude, while maintaining a high degree of accuracy.

Two methods of sampling are available: row-level sampling and page-level sampling. For a description of these sampling methods, see "Data sampling in queries".

Performance of page-level sampling is excellent, because only one I/O operation is required for each selected page. With row-level sampling, I/O costs are not reduced, because every table page is retrieved in a full table scan. However, row-level sampling provides significant performance improvements, even if the amount of I/O is not reduced, because collecting statistics is processor-intensive.

Row-level sampling provides a better sample than page-level sampling in situations where the data values are highly clustered. Compared to page-level sampling, the row-level sample set will likely be a better reflection of the data, because it will include P percent of the rows from each data page. With page-level sampling, all the rows of P percent of the pages will be in the sample set. If the rows are distributed randomly over the table, the accuracy of row-sampled statistics will be similar to the accuracy of page-sampled statistics.

Each sample is randomly generated across repeated invocations of the RUNSTATS command, unless the REPEATABLE option is used, in which case the previous sample is regenerated. This option can be useful in cases where consistent statistics are required for tables whose data remains constant.

For collecting statistics, you can use the task assistant available in IBM® Data Studio Version 3.1 or later. Task assistants can guide you through the process of setting options, reviewing the automatically generated commands to perform the task, and running these commands. For more details, see Administering databases with task assistants.