The focus for this article is the collection of information for the optimizer, understanding what is collected, new performance enhancement added and a practical example of performance tuning.

Share:

John Miller, III (miller3@us.ibm.com), Senior technical staff, IBM Informix Technical Support, IBM

John F. Miller, III , a senior technical staff member on the Resolution Team for IBM Informix Technical Support, a team responsible for handling critical customer situations and developing support tools for use by the Technical Support Organization. During John's 12 years at Informix, he has specialized in Archive/Restore issues, including the development of the archecker tool, along with performance benchmarks including participation in several benchmarks for SAP and BAAN. He can be reached at miller3@us.ibm.com



25 March 2002

Also available in Russian

© 2002 International Business Machines Corporation. All rights reserved.

Introduction

The UPDATE STATISTICS command has three major functions:

  • Compiling stored procedures.
  • Dropping data distributions.
  • Collecting information for the optimizer.

This article focuses on the task of collecting information for the optimizer, including:

  • Information collected by UPDATE STATISTICS
  • Performance and usability improvements in UPDATE STATISTICS
  • Tuning UPDATE STATISTICS to run more efficiently.

Information collected by UPDATE STATISTICS

There are two types of data the optimizer will make use of when generating a query plan. The first is called statistics and the second is called data distributions. Statistics , introduced in IBM Informix © Dynamic Server™, Version 4.00, are generated any time UPDATE STATISTICS in low mode is run, or when UPDATE STATISTICS medium or high mode <ditto> is run and the phrase "distributions only" is omitted. The statistical information is stored in the system catalog tables systables, sysindexes, and syscolumns.

Data distributions , which were introduced in Version 6, provide more sophisticated information to the optimizer. Distributions are built any time the command UPDATE STATISTICS medium or UPDATE STATISTICS high is executed. Both statistics and data distributions are not collected in real time, but are only refreshed when the UPDATE STATISTICS command is executed. The difference between UPDATE STATISTICS high and medium is the number of rows sampled: high samples the entire table, while medium samples only a subset of rows.

Due to the size of todays large systems and the length of time it takes to run update statistics DBAs have reduce the frequency at which update statistics is run. Only through accurate and recent statistics and distributions can the query optimizer correctly estimate the cost of all possible query plans enabling it to select the optimial plan for the given query.

Statistics are generated by scanning tables and indexes. When walking an index, the entire leaf level is walked. While walking the leaf pages, if an index item is marked for deletion, it will be submitted to the b-tree cleaner. This ensures the indexes remain balanced and compact. After the scan of the table or index is complete, the data collected is inserted in the appropriate system tables as shown in Table 1.

Table 1. Catalog data updated by UPDATE STATISTICS command

ColumnDESCRIPTION

systables.nrows

Number of Rows

systables.npused

Number of pages to store the data

syscolumns.colmax

Second largest value for a column

syscolumns.colmin

Second smallest value for a column

sysindexes.nunique

The number of unique values for the lead key

sysindexes.clust

How highly clustered the values for the lead key

Data distributions are created by sampling a column's data, sorting the data, building distributions bins, and inserting the results into the sysdistrib system catalog table. You can control the sample size for the scan, although this fact is not documented clearly:

  • If you specify UPDATE STATISTICS high, the sample size is the entire table,

  • If you specify UPDATE STATISTICS medium, the sample size is based on the confidence and resolution used by UPDATE STATISTICS.

One misconception held by many DBAs is that the sample size for UPDATE STATISTICS medium is based on the table size. The sample size is solely based on the confidence and resolution. A range of resolutions, confidences and the associated sample size is listed below in Table 2. The default resolution of 2.5 and confidence of .95 yields a sample size of 2963 rows, regardless of the table size.

Table 2. Sample size is based on resolution and confidence

ResolutionConfidenceSamplesResolutionConfidenceSamples

2.5

.95

2,963

.25

.95

296,255

2.5

.99

4,273

.25

.99

425,104

1.0

.95

18,516

.1

.95

1,851,593

1.0

.99

26,569

.1

.99

2,656,900

.5

.95

74,064

.05

.95

7,406,375

.5

.99

106,276

.05

.99

10,627,600


Performance and diagnostic improvements in UPDATE STATISTICS

A project to improve UPDATE STATISTICS performance and to maximize resource utilization was undertaken with the assistance of a few customers. These changes were introduced into IBM Informix Dynamic Server, version 7.31.UD2 and 9.30.UC2 and later. Most of the changes to UPDATE STATISTICS are internal algorithm modification, so you don't have to do anything different, but a few changes can benefit by tuning the resources given to UPDATE STATISTICS.

The major areas of improvement in UPDATE STATISTICS are:

Improved parallelism

Parallelism in UPDATE STATISTICS was improved in two ways:

  • Allow one scan of the table to build multiple column distributions. This is valid for both medium and high modes

  • Scan fragmented indexes in parallel in low mode.

Reducing the number of table scans required to produce the data distributions provided the largest improvement. This wasaccomplished by enabling a a single table scan to produce distributions for several columns. As shown in Figure 1, a table scan distributes each column to an independent invocation of the parallel sort package. The data distributions are built from the result of the sort and inserted into the system catalog. A significant reduction in I/O results, by virtue of the reduced number of full table scans.

Figure 1. Sample size is based on resolution and confidence
Figure 1

The second parallelism feature can be used by those customers who have fragmented indexes and run UPDATE STATISTICS low. The previous version of UPDATE STATISTICS would scan each index fragment in serial counting various statistics. The newly improved version scans each fragment in parallel and then sums up each set of statistics. The percentage of index fragments scanned at one time is 10 times the PDQPRIORITY value, up to a maximum of 100 percent. A PDQPRIORITY of 0 will perform a serial scan while a PDQPRIORITY of 4 will scan 40 percent of the index fragments at one time (i.e., 4 *10 = 40 percent). Any PDQPRIORITY of 10 or higher will cause all index fragments to be scanned at once.

Use of light scan to improve I/O throughput

The I/O throughput of UPDATE STATISTICS was improved by using some data warehouse technology. This includes taking advantage of light scan technology when building data distributions. Light scans allow data to be scanned asynchronously into one or more private buffers. In addition, set-oriented reads are also put to use, allowing blocks of rows to be processed at one time.


Tuning UPDATE STATISTICS to run more efficiently

In order to improve the speed at which data distributions are built it is imperative to tune the sorting of data, especially on large tables. The default sort memory for UPDATE STATISTICS was previously set at 4 MB, but was raised to 15 MB as a more realistic default given today's larger computer systems. If you want to change this default, use the environment variable DBUPSPACE. Previously, DBUPSPACE controlled the maximum amount of disk space used by UPDATE STATISTICS. Now DBUPSPACE also controls the default memory used by UPDATE STATISTICS when PDQPRIORITY is not set. The maximum amount of memory you can request without using PDQPRIORITY is 50 MB.

The format of DBUPSPACE environment variable is {max disk space}:{default memory}. To increase the memory to 35 MB, for example, set DBUPSPACE to the value of 0:35.

The second way to improve the sort speed is by giving UPDATE STATISTICS more than 50 MB of memory to use. You can do this by by setting the PDQPRIORITY for UPDATE STATISTICS, allowing UPDATE STATISTICS to use the Memory Grant Manager (MGM) memory that you have configured.

Attention: When using PDQPRIORITY with UPDATE STATISTICS, you do not want to accidentally compile your stored procedures with the PDQPRIORITY turned on. If you run UPDATE STATISTICS with PDQPRORITY set to 50 to build data distributions and compile store procedures, the stored procedures execute with the compiled time PDQPRIORITY, not the PDQPRIORITY of the user who executes the procedure. This is why it is recommend to always use the phrase "for table" when building distributions. The following statement will build distributions for the entire database: UPDATE STATISTICS high for table.

The third way of improving the sort speed of UPDATE STATISTICS is to take advantage of the parallel sort package. To accomplish this, ensure the environment variables PSORT_NPROCS and DBSPACETEMP are properly configured for your computer.

When running medium or high on large tables consider lowering the resolution. This will increase the number of bins storing the data giving the optimizer more exacting information. In addition, for medium distributions, it will increase the sample size.

Viewing the UPDATE STATISTICS plansOne of the best tools for tuning the UPDATE STATISTICS processing is being able to see what UPDATE STATISTICS is doing. When the enhanced version of UPDATE STATISTICS is executing, and if you have enabled set explain, the scan plan and resource information are placed into the sqexplain.out file.

An example of this new output is shown in Figure 2. The set explain output shows how the new UPDATE STATISTICS will perform on a table with 215,000 rows and varying column sizes. From the output, we can see that UPDATE STATISTICS high was executed on the table, cust_tab, where UPDATE STATISTICS has 101 MB of data to sort and 15 MB of memory will be used. Based on the different column sizes in the table, UPDATE STATISTICS tries to group columns together in order to maximum the sort memory for a single scan of the table. UPDATE STATISTICS determined that five scans of the table would be optimal for the 10 columns in the table.

Figure 2 shows that the first table scan processes columns c9, c8, c10, c5, and c7. The second scan processes columns c6 and c1. The last three columns are done individually, because their size prevents them from being combined with other columns. The total time to process the UPDATE STATISTICS high command is 98 seconds. With previous versions the time was 146 seconds.

Figure 2. SET EXPLAIN output
Figure 2

Although we reduced the time by 33 percent, we now can tune the system for another large gain. After examining Figure 2, we see the sort memory was 15 MB, but the data available to sort was 101.4 MB. Buffering more of the sort data will increase the sort's performance. To accomplish this, we enabled PDQPRIORITY. Figure 3 shows that 106 MB of memory was allocated by the memory grant manager. The increased memory allows all columns to be processed with a single scan of the table. In addition, light scans were enabled and index scans are disabled. With these changes, the total time to process UPDATE STATISTICS now is only 29 seconds.

Figure 3. SET EXPLAIN output after tuning parameters for greater efficiency
Figure 3

Conclusion

Improvements to the UPDATE STATISTICS command enables more control over the resources, improved performance, and enhanced diagnostics. The days of minimal documentation, lack of understanding, and unusable performances for large databases system is a way of the past for UPDATE STATISTICS.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=13780
ArticleTitle=Understanding and Tuning Update Statistics
publish-date=03252002