Skip to main content

Get the most out of Informix Dynamic Server optimizer through UPDATE STATISTICS

Bharath Sriram (bhsriram@in.ibm.com), Associate Software Engineer, IBM Informix
Bharath Sriram
Bharath Sriram is an Associate Software Engineer at India Software Labs, IBM. He works for the IDS Integration Team across various IDS Products and also works on the OpenAdmin Tool for IDS 11.
Harshavardhan Changappa (vardhan.harsha@in.ibm.com), Software Engineer, IBM Informix
Harshavardhan Changappa
Harshavardhan Changappa is working as a Software Engineer for the IDS Integration Team at India Software Labs, IBM. He has worked on IDS certification of various Linux OS levels such as Ubuntu, Asianux, Debian etc. Harshavardhan works for the IDS Integration Team across various IDS products. He is also involved in developing automated test cases for various features of IDS.
Priyambada Behera (prbehera@in.ibm.com), Systems Software Engineer, IBM India Software Labs
Priyambada Behera
Priyambada Behera is a Systems Software Engineer at the IBM India Software Labs, India. She works for the IDS Integration Team across various IDS Products. She is certified in "Managing and Optimizing Informix Dynamic Server Databases".

Summary:  Use the UPDATE STATISTICS statement in IBM® Informix® Dynamic Server (IDS) to get the most out of the database optimizer. Get a brief overview of this SQL statement and see how it can be used for a variety of problems. Discover the importance of updating statistics and view illustrations that show how statistics are gathered. Finally, browse a FAQ section at the end of the article to answer any lingering questions you might have about this important SQL statement.

Date:  06 Mar 2008
Level:  Intermediate PDF:  A4 and Letter (54KB | 14 pages)Get Adobe® Reader®
Activity:  951 views

Introduction

UPDATE STATISTICS is a proprietary Informix SQL command that updates the facts about each table and its columns by analyzing the data and storing this information in the system catalogs that are used to estimate the cost of subsequent queries. To truly understand the importance of UPDATE STATISTICS, you need to understand what exactly happens when an SQL query is entered by the user for execution. Every SQL query that is entered must be parsed, optimized and executed.

The optimizer is the component that prepares a query plan. A query plan ideally is an optimal plan to execute the given query — that is, it determines the most efficient way to fetch the data. For this purpose, it makes use of a collection of statistical data; however, this statistical data cannot be assumed to be accurate. The preciseness of this data depends upon a large number of factors such as the type of sampling algorithm used, the number of samples taken, and the skew of the data.

The query optimizer does not recalculate the profile for tables automatically. In some cases, gathering the statistics might take longer than executing the query. To ensure that the optimizer selects a query plan that best reflects the current state of your tables, run UPDATE STATISTICS at regular intervals.

How UPDATE STATISTICS works

UPDATE STATISTICS should be run after initially loading the data and creating indexes. It should also be run after any significant changes to the database tables including any large scale insert, update or delete operations. If you do not implement UPDATE STATISTICS, the optimizer has inaccurate data to determine access paths.

Now, learn about the working of the command. The query optimizer bases query-cost estimates on the number of rows to be retrieved from each table. In turn, the estimated number of rows is based on the selectivity of each conditional expression that is used within the WHERE clause. A filter is a conditional expression that is used to select rows. The selectivity is a value between 0 and 1 that indicates the percentage of rows within the table that the filter can pass. A selective filter, one that passes few rows, has selectivity towards 0, and a filter that passes almost all rows has selectivity towards 1.

The optimizer can use data distributions to calculate selectivity for the filters in a query. However, in the absence of data distributions, the database server calculates selectivity for filters of different types based on table indexes.

The correctness of estimating the selectivity highly influences the cost of each execution plan. So, the accuracy of getting an optimal plan is entirely dependant on the up-to-date statistical information about the database objects involved in the query.

Each time the UPDATE STATISTICS query is run, the following system catalog tables are refreshed. The items listed under each heading are columns in tables.

  • In SYSTABLES
    • nindexes - Number of indexes in the table
    • nrows - Number of rows in the table
    • npused - Number of pages used to store 'nrows' of Tupule
  • In SYSCOLUMNS
    • colmin - Second minimum value of the column
    • colmax - Second maximum value of the column
  • In SYSINDEXES and SYSFRAGMENTS
    • levels - Number of levels in the B-tree
    • leaves - Number of leaves contained in the B-tree
    • nunique - Number of distinct values
    • clust - Clusterdness relative to the table
  • In SYSDISTRIB
    • Rows containing information about how the values of user data table are distributed. (So all the columns are affected)
    • It also maintains the resolution and confidence level of the distribution.

Modes of UPDATE STATISTICS

Resolution and confidence

Two important terms to understand in relation to UPDATE STATISTICS are resolution and confidence. Resolution refers to the percentage of data that is put in each bin. A resolution is a number between 0.005 and 10. Confidence is a measure of the equivalence between the estimates obtained and the actual values retrieved. It is expressed as a value between 0.80 and 0.99. Ideally, it is required for the query to have a high confidence value.

The default resolution for the high mode is 0.5, and, for the medium mode, it is 2.5. The default confidence for the high mode is 0.99. It is between 0.85 and 0.99 for the medium mode.

The three modes updating the statistical data are low, medium and high.

The low mode populates only the scalar statistical values (that is, no distribution information) of a table. All the information like the levels of B-tree index, number of pages consumed by the table, number of distinct values of a column are stored.

In medium mode, in addition to low mode statistics, the statement sorts a sample set of column values and populates the distribution data in the sysdistrib system catalog table that represents this sample. It generally has a confidence in the range of 85% to 99%.

In high mode, in addition to low mode statistics, the statement sorts all the column values and populates the sysdistrib table with exact distribution information on all values in the table at the time of execution. While it may seem like a good idea to always use the UPDATE STATISTICS in high mode, this is not actually the case. The high mode uses too many bins to achieve higher accuracy owing to its low resolution of 0.5. Therefore, it consumes a lot of disk space, especially when the table is too big. Also, it is quite intensive to run UPDATE STATISTICS in the high mode ina production system that runs continually. In addition, the high mode does not always guarantee a perfect estimate since no execution plan is guaranteed to be 100% optimal since the plan is based on an estimate.

UPDATE STATISTICS and histograms

Informix uses histograms to store the distribution information of the data. Histograms are used to calculate the selectivity of the predicates. They are proven to produce the low-error estimate and incur almost no run-time overhead.

A histogram is used to graphically summarize and display the distribution of a process data set. Informix uses height-balanced histograms (that is, equi-height/equi-depth) with overflow buckets as opposed to the width-balanced histograms (equi-width). Having the height balanced means the same number of values in each bucket is maintained. Depending on the values, the range is determined, and a bucket represents a range. Below is an example of a histogram:


Histogram
Graph

Assume that there are 1000 rows in a table. If the number of buckets is fixed at 10, the proportion will be 1000/10; thus there are 100 rows in each bucket. This 100 represents the height. In case of a height-balanced histogram, this value (100) is fixed.

Data distributions are generated by the server for individual columns of a table, and are stored as encoded histograms in the encdat column of the sysdistrib system catalog. The default number of bins may be adequate if the data is uniformly distributed across the domain of values for that column. However, if the data is highly skewed, then a larger number of bins (smaller resolution) are required to ensure the data is not too skewed.

The magic of UPDATE STATISTICS

Basically, whenever the UPDATE STATISTICS command is run either in the medium or the high mode, two types of bins are created, namely the distribution bin and the overflow bin. Each entry in the bin is made up of the following:

  1. Distribution bin
    • Number of rows in the bin
    • Number of distinct values in the bin
    • Highest data value in the bin
  2. Overflow bin
    • Data value
    • Frequency of data value

Illustration



Now, let us consider an inventory table consisting of three columns — item_num, customer_name and amount. We will perform UPDATE STATISTICS on the column item_num and then see how the two bins are constructed.



Table 1. Inventory table
item_numcustomer_nameamount
1manoj92.5
1prasanna43.75
1bharath90
1ranjani78.5
1priya23.5
1radhika126.75
1vaibhav75
1harsha300.50
1vishwas20
2deepak32.5
2vinay90
2archit65.20
2vishnu32.75
2samir48.3
2ravi49.5
3srini67.5
4rahul56.0
5sudev73.0

To populate the distribution of data, run UPDATE STATISTICS in the medium or high mode. The histogram information can be retrieved from Informix's utility dbschema by invoking the following command:

$ dbschema -d <dbname> -hd [ <table> ]

The following listing shows the distribution of the above data when the UPDATE STATISTICS is run in medium mode with a resolution of 10.


Listing 1. Distribution output
$ dbschema -d newdb -hd Inventory

DBSCHEMA Schema Utility INFORMIX-SQL Version 11.10.FC2
Copyright IBM Corporation 1996, 2006 All rights reserved
Software Serial Number AAA#B000000

{

Distribution for informix.Inventory.item_num

Constructed on 2007-11-09 04:47:00.00000

Medium Mode, 10.000000 Resolution, 0.950000 Confidence


--- DISTRIBUTION ---

    (                     1)
 1: (  2,   2,            4)
 2: (  1,   1,            5)


--- OVERFLOW ---

 1: ( 9,                 1)
 2: ( 6,                 2)
          

Understanding the distribution bin

Setting the number of bins

The formula for calculating the number of bins is:
100/resolution = number of bins
A resolution of 1 means that 1% of the data goes into each bin (100/1 = 100 bins).
A resolution of 10 means that 10% of the data goes into each bin (10 bins are created).

The bin always starts with a bin number or an identifier which is sequentially incremented for every row. Each row has three columns associated with it. All the three columns are represented with the parentheses. The first column specifies the bin size. The second column specifies the number of distinct elements within the range, and the third column specifies the highest values within the range.

For example, taking the above Distribution output into consideration, the first row can be read as follows:

  • The bin size = 2
  • There are two distinct elements between the range of 1 and 4. You can see this when you look at the Inventory table. The two distinct values are 3 and 4. A really keen observer may have noticed that the values 1 and 2 are not taken into consideration. The count for this is included in the overflow bin and is not to be taken into consideration for this as well.
  • The range is between the values 1 and 4.

Understanding the overflow bin

Like the distribution bin, the overflow bin is also represented in rows. Each row begins with an identifier which gets incremented for every row. There are only two columns in this bin, each represented within the parenthesis. The first column specifies the number of times the value in the second column is repeated. The second column specifies the column value itself.

For example, in the above table, it can be seen that the C1=1 is repeated 9 times and C1=2 is repeated 6 times. Note that only those column values which cross 25% of the bin size are put into the overflow bin. The counts obtained here are absolute sure shot values which can be used for estimates.

NOTE: The output of dbschema cannot be compared to the regular notion of rows and columns. The terms row and column here are literally taken and have no connection with the row and column conception used in databases.

Illustration of a query plan

Consider two tables, customer and orders. The customer table has the attributes customer_num, zipcode and customer_name; orders has the attributes customer_num, quantity, and item_num.

Below is an example of how a query plan would look like for the query:

 select * 
from orders a, customer b 
where a.customer_num > 435 and b.zipcode > "65*" ;

When the UPDATE STATISTICS is run in the high mode, here are the results:


Listing 2. SQEXPLAIN output
QUERY:
------
Completed pass 1 in 0 minutes 0 seconds

UPDATE STATISTICS:
==================

Table:          informix.customer
Mode:           HIGH
Number of Bins:      288        Bin size       11
Sort data            0.2 MB     Sort memory granted          0.2 MB
Estimated number of table scans 1
PASS #1 zipcode
Light scans enabled
Scan 0 Sort 0 Build 0 Insert 0 Close 0 Total 0
Completed pass 1 in 0 minutes 0 seconds
QUERY:
------
select * from orders a, customer b where a.customer_num > 435 and b.zipcode > "65*"

Estimated Cost: 9805
Estimated # of Rows Returned: 244530
Maximum Threads: 1

  1) informix.b: INDEX PATH

    (1) Index Keys: zipcode   (Parallel, fragments: ALL)
        Lower Index Filter: informix.b.zipcode > '65*'

  2) informix.a: INDEX PATH

    (1) Index Keys: customer_num   (Parallel, fragments: ALL)
        Lower Index Filter: informix.a.customer_num > 435
NESTED LOOP JOIN

          

Life without UPDATE STATISTICS

Consider a typical scenario where the importance of UPDATE STATISTICS can be demonstrated. Consider three tables, T1, T2 and T3. Assume they have 10, 100 and 1000 rows respectively. Now suppose you have to perform an 'EQUI-JOIN' operation on all the three tables. The optimizer would chalk out a query plan before performing the actual join operation itself by making use of the statistical data. You assume that this statistical data is up-to-date and move on.

Now, how many ways can you perform this join operation? Since you know that T1 * T2 is not the same as T2 * T1, it's a matter of simple permutation to find out. You have three tables and hence you would have 3! * 2! * 1! ways of performing a join operation. That is 12 different ways of doing it. For this example, just 5 cases from the 12 are used.

Case 1 -- (T1 * T2) * T3 results in 1,011,010 row accesses.
Case 2 -- (T2 * T1) * T3 results in 1,101,100 row accesses.
Case 3 -- (T1 * T3) * T2 results in 1,011,010 row accesses.
Case 4 -- T1 * (T2 * T3) results in 1,001,010 row accesses.
Case 5 -- T3 * (T2 * T1) results in 1,011,000 row accesses.

Now, suppose there were a large number of transactions that took place, and these tables were manipulated. The current scenario would result in T1, T2 and T3 having 1000, 100 and 10 rows respectively. Since the UPDATE STATISTICS are not run automatically, the System Catalog tables would not be updated and would still be retaining the old statistical data. Now, you should perform the EQUI-JOIN operation once again.

The optimizer chooses Case 4 since it determined that it has the minimum number of row accesses and therefore should be efficient. But, the plan no longer produces the best result. According to the current situation, an efficient plan would be to use Case 5 — (T3 * (T2 * T1) — whereas the optimizer chooses Case 4 — T1 * (T2 * T3). This results in a whopping '9990' extra row accesses. This figure can grow to trillions for bulky databases which drastically reduces the efficiency of the query. Had you executed the UPDATE STATISTICS statement, it would have been much more efficient. This example shows why UPDATE STATISTICS has to be run periodically.

To sum it all up, use the UPDATE STATISTICS statement to perform any of the following tasks:

  • Calculate the distribution of column values.
  • Update system catalog tables that the database server uses to optimize queries.
  • Force re-optimization of SPL routines.
  • Convert existing indexes when you upgrade the database server.

Syntax of UPDATE STATISTICS

Version 1: To UPDATE STATISTICS for the entire database

UPDATE STATISTICS [LOW | MEDIUM | HIGH];

Version 2: To UPDATE STATISTICS for a particular table in a database. All columns are updated in this case.

UPDATE STATISTICS [LOW | MEDIUM | HIGH] FOR TABLE <table_name> ;

Version 3: To UPDATE STATISTICS for a particular column of a particular table in a database.

          UPDATE STATISTICS [LOW | MEDIUM | HIGH] FOR TABLE <table_name> (<column_name>);

Version 4: To UPDATE STATISTICS for a stored procedure in a database.

UPDATE STATISTICS [LOW | MEDIUM | HIGH] FOR PROCEDURE;

Version 5: To perform UPDATE STATISTICS by setting your own resolution.

          UPDATE STATISTICS [LOW | MEDIUM | HIGH] FOR TABLE <table_name> RESOLUTION 10;

Tips and Tricks

The UPDATE STATISTICS command is not executed automatically. It is the responsibility of the user to run this statement manually. This command has to be executed periodically so that the optimizer has the latest data with it to come to an effective plan to fetch the data.

Ideally, it is preferred to run this statement when a bulk of data is manipulated in the database. By default, the UPDATE STATISTICS runs in the low mode. In this mode, data distributions are not generated. Never use the high mode for an entire database or all tables in a database or even all columns in a table unless the database or the tables are of a considerably smaller size.
Always remember that distributions are not created on text or byte columns.

UPDATE STATISTICS FAQs

  • What is the ideal mode that I should use for the Update Statistics command?

    There is no such thing as an "ideal" mode. The DBA should look at the current situation and then choose the mode for UPDATE STATISTICS. However, the following list should give you some hints to help you choose the best mode:

    Use UPDATE STATISTICS LOW if the number of rows has changed significantly or after migration from previous version of database server. Also use it for all columns that are not the leading column of any index.
    Use UPDATE STATISTICS MEDIUM DISTRIBUTIONS ONLY if queries have non-indexed join columns or filter columns.
    Use UPDATE STATISTICS HIGH <table> if queries have a multi-column indexed defined on join columns or filter columns.
    Use UPDATE STATISTICS HIGH ON <small tables> if queries have many small tables (fit into one extent).

  • Can I set the resolution and confidence values myself when I run the statement?

    Yes, the syntax is as follows: - UPDATE STATISTICS MEDIUM FOR TABLE <tabname> RESOLUTION 1 0.99-----> confidence

  • I find the whole process leading to a severe time and resource crunch. Don't you think it would be better to execute the query directly?

    We should always remember that only the samples of the rows are considered while preparing the statistical data and not all rows are read. Therefore, UPDATE STATISTICS cannot be compared to executing the query itself unless it is run in the high mode.

  • What should be an ideal resolution value?

    There is no such thing as an "ideal" resolution value. The value entirely depends upon the data and the application.

  • What is the default resolution and confidence for all the three modes?

    The default resolution is 0.5 for the HIGH mode and 2.5 for the MEDIUM mode.
    The default confidence is 0.99 for the HIGH mode and between 0.85 and 0.99 for the MEDIUM mode.

  • What is new regarding this feature in Cheetah?

    In Cheetah (IDS 11), IDS automatically collects index statistics on the leading index key while the index is being created. This eliminates the need to manually execute a statistics gathering command. With this feature, the query optimizer can immediately consider the index when determining an access plan. You can view statistics about completed queries in the new Query Statistics section in SET EXPLAIN. In explicit UPDATE STATISTICS operations in MEDIUM mode, a new SAMPLING SIZE option in the resolution clause specifies the minimum number of rows to sample for column distributions. The SET EXPLAIN statement now supports an optional parameter to override the default name and location of the output file.

  • What do you mean by executing UPDATE STATISTICS on a stored procedure?

    The database server re-optimizes SQL statements in the named procedure. The database server does not update the statistics in the system catalog tables.

  • What happens when I set the resolution to 0.5 and confidence to 0.99 and run UPDATE STATISTICS in the MEDIUM mode? Is this equivalent to running it in the HIGH mode?

    Yes.

  • What is the maximum number of bins that I can have?

    Ideally, the resolution is a value between 0.005 and 10. Therefore the number of bins ranges from 10 to 20,000. But practically, the maximum number of bins depends upon the disk space and any limitations imposed by the IDS.

Conclusion

The Informix Dynamic Server is a powerful database server equipped with many powerful features — UPDATE STATISTICS is one such essential feature. The article has explained the need for UPDATE STATISTICS in today's world where the databases are highly volatile due to the millions of transactions occurring in split seconds. Query optimization is crucial in dealing with databases, and UPDATE STATISTICS does just that. Since UPDATE STATISTICS is not a perfect solution, it is important for the DBA to perform trade-offs and choose the best possible solution depending on the current situation. This article has given you the information you need to use UPDATE STATISTICS in a way that will make your life much easier.


Resources

Learn

Get products and technologies

Discuss

About the authors

Bharath Sriram

Bharath Sriram is an Associate Software Engineer at India Software Labs, IBM. He works for the IDS Integration Team across various IDS Products and also works on the OpenAdmin Tool for IDS 11.

Harshavardhan Changappa

Harshavardhan Changappa is working as a Software Engineer for the IDS Integration Team at India Software Labs, IBM. He has worked on IDS certification of various Linux OS levels such as Ubuntu, Asianux, Debian etc. Harshavardhan works for the IDS Integration Team across various IDS products. He is also involved in developing automated test cases for various features of IDS.

Priyambada Behera

Priyambada Behera is a Systems Software Engineer at the IBM India Software Labs, India. She works for the IDS Integration Team across various IDS Products. She is certified in "Managing and Optimizing Informix Dynamic Server Databases".

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=293581
ArticleTitle=Get the most out of Informix Dynamic Server optimizer through UPDATE STATISTICS
publish-date=03062008
author1-email=bhsriram@in.ibm.com
author1-email-cc=
author2-email=vardhan.harsha@in.ibm.com
author2-email-cc=
author3-email=prbehera@in.ibm.com
author3-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers