Skip to main content

skip to main content

developerWorks  >  Information Management  >

Get the most out of Informix Dynamic Server optimizer through UPDATE STATISTICS

developerWorks
Document options
PDF format - Fits A4 and Letter

PDF - Fits A4 and Letter
54KB (14 pages)

Get Adobe® Reader®

Document options requiring JavaScript are not displayed

Discuss


My developerWorks needs you!

Connect to your technical community


Rate this page

Help us improve this content


Level: Intermediate

Bharath Sriram (bhsriram@in.ibm.com), Associate Software Engineer, IBM Informix
Harshavardhan Changappa (vardhan.harsha@in.ibm.com), Software Engineer, IBM Informix
Priyambada Behera (prbehera@in.ibm.com), Systems Software Engineer, IBM India Software Labs

06 Mar 2008

Use the UPDATE STATISTICS statement in IBM® Informix® Dynamic Server (IDS) to get the most out of the database optimizer. Get a brief overview of this SQL statement and see how it can be used for a variety of problems. Discover the importance of updating statistics and view illustrations that show how statistics are gathered. Finally, browse a FAQ section at the end of the article to answer any lingering questions you might have about this important SQL statement.

Introduction

UPDATE STATISTICS is a proprietary Informix SQL command that updates the facts about each table and its columns by analyzing the data and storing this information in the system catalogs that are used to estimate the cost of subsequent queries. To truly understand the importance of UPDATE STATISTICS, you need to understand what exactly happens when an SQL query is entered by the user for execution. Every SQL query that is entered must be parsed, optimized and executed.

The optimizer is the component that prepares a query plan. A query plan ideally is an optimal plan to execute the given query — that is, it determines the most efficient way to fetch the data. For this purpose, it makes use of a collection of statistical data; however, this statistical data cannot be assumed to be accurate. The preciseness of this data depends upon a large number of factors such as the type of sampling algorithm used, the number of samples taken, and the skew of the data.

The query optimizer does not recalculate the profile for tables automatically. In some cases, gathering the statistics might take longer than executing the query. To ensure that the optimizer selects a query plan that best reflects the current state of your tables, run UPDATE STATISTICS at regular intervals.

How UPDATE STATISTICS works

UPDATE STATISTICS should be run after initially loading the data and creating indexes. It should also be run after any significant changes to the database tables including any large scale insert, update or delete operations. If you do not implement UPDATE STATISTICS, the optimizer has inaccurate data to determine access paths.

Now, learn about the working of the command. The query optimizer bases query-cost estimates on the number of rows to be retrieved from each table. In turn, the estimated number of rows is based on the selectivity of each conditional expression that is used within the WHERE clause. A filter is a conditional expression that is used to select rows. The selectivity is a value between 0 and 1 that indicates the percentage of rows within the table that the filter can pass. A selective filter, one that passes few rows, has selectivity towards 0, and a filter that passes almost all rows has selectivity towards 1.

The optimizer can use data distributions to calculate selectivity for the filters in a query. However, in the absence of data distributions, the database server calculates selectivity for filters of different types based on table indexes.

The correctness of estimating the selectivity highly influences the cost of each execution plan. So, the accuracy of getting an optimal plan is entirely dependant on the up-to-date statistical information about the database objects involved in the query.

Each time the UPDATE STATISTICS query is run, the following system catalog tables are refreshed. The items listed under each heading are columns in tables.

  • In SYSTABLES
    • nindexes - Number of indexes in the table
    • nrows - Number of rows in the table
    • npused - Number of pages used to store 'nrows' of Tupule
  • In SYSCOLUMNS
    • colmin - Second minimum value of the column
    • colmax - Second maximum value of the column
  • In SYSINDEXES and SYSFRAGMENTS
    • levels - Number of levels in the B-tree
    • leaves - Number of leaves contained in the B-tree
    • nunique - Number of distinct values
    • clust - Clusterdness relative to the table
  • In SYSDISTRIB
    • Rows containing information about how the values of user data table are distributed. (So all the columns are affected)
    • It also maintains the resolution and confidence level of the distribution.

Modes of UPDATE STATISTICS

Resolution and confidence

Two important terms to understand in relation to UPDATE STATISTICS are resolution and confidence. Resolution refers to the percentage of data that is put in each bin. A resolution is a number between 0.005 and 10. Confidence is a measure of the equivalence between the estimates obtained and the actual values retrieved. It is expressed as a value between 0.80 and 0.99. Ideally, it is required for the query to have a high confidence value.

The default resolution for the high mode is 0.5, and, for the medium mode, it is 2.5. The default confidence for the high mode is 0.99. It is between 0.85 and 0.99 for the medium mode.

The three modes updating the statistical data are low, medium and high.

The low mode populates only the scalar statistical values (that is, no distribution information) of a table. All the information like the levels of B-tree index, number of pages consumed by the table, number of distinct values of a column are stored.

In medium mode, in addition to low mode statistics, the statement sorts a sample set of column values and populates the distribution data in the sysdistrib system catalog table that represents this sample. It generally has a confidence in the range of 85% to 99%.

In high mode, in addition to low mode statistics, the statement sorts all the column values and populates the sysdistrib table with exact distribution information on all values in the table at the time of execution. While it may seem like a good idea to always use the UPDATE STATISTICS in high mode, this is not actually the case. The high mode uses too many bins to achieve higher accuracy owing to its low resolution of 0.5. Therefore, it consumes a lot of disk space, especially when the table is too big. Also, it is quite intensive to run UPDATE STATISTICS in the high mode ina production system that runs continually. In addition, the high mode does not always guarantee a perfect estimate since no execution plan is guaranteed to be 100% optimal since the plan is based on an estimate.

UPDATE STATISTICS and histograms

Informix uses histograms to store the distribution information of the data. Histograms are used to calculate the selectivity of the predicates. They are proven to produce the low-error estimate and incur almost no run-time overhead.

A histogram is used to graphically summarize and display the distribution of a process data set. Informix uses height-balanced histograms (that is, equi-height/equi-depth) with overflow buckets as opposed to the width-balanced histograms (equi-width). Having the height balanced means the same number of values in each bucket is maintained. Depending on the values, the range is determined, and a bucket represents a range. Below is an example of a histogram:


Histogram
Graph

Assume that there are 1000 rows in a table. If the number of buckets is fixed at 10, the proportion will be 1000/10; thus there are 100 rows in each bucket. This 100 represents the height. In case of a height-balanced histogram, this value (100) is fixed.

Data distributions are generated by the server for individual columns of a table, and are stored as encoded histograms in the encdat column of the sysdistrib system catalog. The default number of bins may be adequate if the data is uniformly distributed across the domain of values for that column. However, if the data is highly skewed, then a larger number of bins (smaller resolution) are required to ensure the data is not too skewed.

The magic of UPDATE STATISTICS

Basically, whenever the UPDATE STATISTICS command is run either in the medium or the high mode, two types of bins are created, namely the distribution bin and the overflow bin. Each entry in the bin is made up of the following:

  1. Distribution bin
    • Number of rows in the bin
    • Number of distinct values in the bin
    • Highest data value in the bin
  2. Overflow bin
    • Data value
    • Frequency of data value

Illustration

Now, let us consider an inventory table consisting of three columns item_num, customer_name and amount. We will perform UPDATE STATISTICS on the column item_num and then see how the two bins are constructed.

Table 1. Inventory table
item_numcustomer_nameamount
1manoj92.5
1prasanna43.75
1bharath90
1ranjani78.5
1priya23.5
1radhika126.75
1vaibhav75
1harsha300.50
1vishwas20
2deepak32.5
2vinay90
2archit65.20
2vishnu32.75
2samir48.3
2ravi49.5
3srini67.5
4rahul56.0
5sudev73.0

To populate the distribution of data, run UPDATE STATISTICS in the medium or high mode. The histogram information can be retrieved from Informix's utility dbschema by invoking the following command:


$ dbschema -d <dbname> -hd [ <table> ]

The following listing shows the distribution of the above data when the UPDATE STATISTICS is run in medium mode with a resolution of 10.


Listing 1. Distribution output
                
$ dbschema -d newdb -hd Inventory

DBSCHEMA Schema Utility INFORMIX-SQL Version 11.10.FC2
Copyright IBM Corporation 1996, 2006 All rights reserved
Software Serial Number AAA#B000000

{

Distribution for informix.Inventory.item_num

Constructed on 2007-11-09 04:47:00.00000

Medium Mode, 10.000000 Resolution, 0.950000 Confidence


--- DISTRIBUTION ---

    (                     1)
 1: (  2,   2,            4)
 2: (  1,   1,            5)


--- OVERFLOW ---

 1: ( 9,                 1)
 2: ( 6,                 2)





Understanding the distribution bin

Setting the number of bins

The formula for calculating the number of bins is:
100/resolution = number of bins
A resolution of 1 means that 1% of the data goes into each bin (100/1 = 100 bins).
A resolution of 10 means that 10% of the data goes into each bin (10 bins are created).

The bin always starts with a bin number or an identifier which is sequentially incremented for every row. Each row has three columns associated with it. All the three columns are represented with the parentheses. The first column specifies the bin size. The second column specifies the number of distinct elements within the range, and the third column specifies the highest values within the range.

For example, taking the above Distribution output into consideration, the first row can be read as follows:

  • The bin size = 2
  • There are two distinct elements between the range of 1 and 4. You can see this when you look at the Inventory table. The two distinct values are 3 and 4. A really keen observer may have noticed that the values 1 and 2 are not taken into consideration. The count for this is included in the overflow bin and is not to be taken into consideration for this as well.
  • The range is between the values 1 and 4.

Understanding the overflow bin

Like the distribution bin, the overflow bin is also represented in rows. Each row begins with an identifier which gets incremented for every row. There are only two columns in this bin, each represented within the parenthesis. The first column specifies the number of times the value in the second column is repeated. The second column specifies the column value itself.

For example, in the above table, it can be seen that the C1=1 is repeated 9 times and C1=2 is repeated 6 times. Note that only those column values which cross 25% of the bin size are put into the overflow bin. The counts obtained here are absolute sure shot values which can be used for estimates.

NOTE: The output of dbschema cannot be compared to the regular notion of rows and columns. The terms row and column here are literally taken and have no connection with the row and column conception used in databases.

Illustration of a query plan

Consider two tables, customer and orders. The customer table has the attributes customer_num, zipcode and customer_name; orders has the attributes customer_num, quantity, and item_num.

Below is an example of how a query plan would look like for the query:

 select * 
from orders a, customer b 
where a.customer_num > 435 and b.zipcode > "65*" ;

When the UPDATE STATISTICS is run in the high mode, here are the results:

Listing 2. SQEXPLAIN output
                    
QUERY:
------
Completed pass 1 in 0 minutes 0 seconds

UPDATE STATISTICS:
==================

Table:          informix.customer
Mode:           HIGH
Number of Bins:      288        Bin size       11
Sort data            0.2 MB     Sort memory granted          0.2 MB
Estimated number of table scans 1
PASS #1 zipcode
Light scans enabled
Scan 0 Sort 0 Build 0 Insert 0 Close 0 Total 0
Completed pass 1 in 0 minutes 0 seconds
QUERY:
------
select * from orders a, customer b where a.customer_num > 435 and b.zipcode > "65*"

Estimated Cost: 9805
Estimated # of Rows Returned: 244530
Maximum Threads: 1

  1) informix.b: INDEX PATH

    (1) Index Keys: zipcode   (Parallel, fragments: ALL)
        Lower Index Filter: informix.b.zipcode > '65*'

  2) informix.a: INDEX PATH

    (1) Index Keys: customer_num   (Parallel, fragments: ALL)
        Lower Index Filter: informix.a.customer_num > 435
NESTED LOOP JOIN


Life without UPDATE STATISTICS

Consider a typical scenario where the importance of UPDATE STATISTICS can be demonstrated. Consider three tables, T1, T2 and T3. Assume they have 10, 100 and 1000 rows respectively. Now suppose you have to perform an 'EQUI-JOIN' operation on all the three tables. The optimizer would chalk out a query plan before performing the actual join operation itself by making use of the statistical data. You assume that this statistical data is up-to-date and move on.

Now, how many ways can you perform this join operation? Since you know that T1 * T2 is not the same as T2 * T1, it's a matter of simple permutation to find out. You have three tables and hence you would have 3! * 2! * 1! ways of performing a join operation. That is 12 different ways of doing it. For this example, just 5 cases from the 12 are used.

Case 1 -- (T1 * T2) * T3 results in 1,011,010 row accesses.
Case 2 -- (T2 * T1) * T3 results in 1,101,100 row accesses.
Case 3 -- (T1 * T3) * T2 results in 1,011,010 row accesses.
Case 4 -- T1 * (T2 * T3) results in 1,001,010 row accesses.
Case 5 -- T3 * (T2 * T1) results in 1,011,000 row accesses.

Now, suppose there were a large number of transactions that took place, and these tables were manipulated. The current scenario would result in T1, T2 and T3 having 1000, 100 and 10 rows respectively. Since the UPDATE STATISTICS are not run automatically, the System Catalog tables would not be updated and would still be retaining the old statistical data. Now, you should perform the EQUI-JOIN operation once again.

The optimizer chooses Case 4 since it determined that it has the minimum number of row accesses and therefore should be efficient. But, the plan no longer produces the best result. According to the current situation, an efficient plan would be to use Case 5 — (T3 * (T2 * T1) — whereas the optimizer chooses Case 4 — T1 * (T2 * T3). This results in a whopping '9990' extra row accesses. This figure can grow to trillions for bulky databases which drastically reduces the efficiency of the query. Had you executed the UPDATE STATISTICS statement, it would have been much more efficient. This example shows why UPDATE STATISTICS has to be run periodically.

To sum it all up, use the UPDATE STATISTICS statement to perform any of the following tasks:

  • Calculate the distribution of column values.
  • Update system catalog tables that the database server uses to optimize queries.
  • Force re-optimization of SPL routines.
  • Convert existing indexes when you upgrade the database server.

Syntax of UPDATE STATISTICS

Version 1: To UPDATE STATISTICS for the entire database

UPDATE STATISTICS [LOW | MEDIUM | HIGH];

Version 2: To UPDATE STATISTICS for a particular table in a database. All columns are updated in this case.
UPDATE STATISTICS [LOW | MEDIUM | HIGH] FOR TABLE <table_name> ;

Version 3: To UPDATE STATISTICS for a particular column of a particular table in a database.
UPDATE STATISTICS [LOW | MEDIUM | HIGH] FOR TABLE <table_name> (<column_name>);

Version 4: To UPDATE STATISTICS for a stored procedure in a database.
UPDATE STATISTICS [LOW | MEDIUM | HIGH] FOR PROCEDURE;

Version 5: To perform UPDATE STATISTICS by setting your own resolution.
UPDATE STATISTICS [LOW | MEDIUM | HIGH] FOR TABLE <table_name> RESOLUTION 10;

Tips and Tricks

The UPDATE STATISTICS command is not executed automatically. It is the responsibility of the user to run this statement manually. This command has to be executed periodically so that the optimizer has the latest data with it to come to an effective plan to fetch the data.

Ideally, it is preferred to run this statement when a bulk of data is manipulated in the database. By default, the UPDATE STATISTICS runs in the low mode. In this mode, data distributions are not generated. Never use the high mode for an entire database or all tables in a database or even all columns in a table unless the database or the tables are of a considerably smaller size.
Always remember that distributions are not created on text or byte columns.

UPDATE STATISTICS FAQs

  • What is the ideal mode that I should use for the Update Statistics command?

    There is no such thing as an "ideal" mode. The DBA should look at the current situation and then choose the mode for UPDATE STATISTICS. However, the following list should give you some hints to help you choose the best mode:

    Use UPDATE STATISTICS LOW if the number of rows has changed significantly or after migration from previous version of database server. Also use it for all columns that are not the leading column of any index.
    Use UPDATE STATISTICS MEDIUM DISTRIBUTIONS ONLY if queries have non-indexed join columns or filter columns.
    Use UPDATE STATISTICS HIGH <table> if queries have a multi-column indexed defined on join columns or filter columns.
    Use UPDATE STATISTICS HIGH ON <small tables> if queries have many small tables (fit into one extent).

  • Can I set the resolution and confidence values myself when I run the statement?

    Yes, the syntax is as follows: - UPDATE STATISTICS MEDIUM FOR TABLE <tabname> RESOLUTION 1 0.99-----> confidence

  • I find the whole process leading to a severe time and resource crunch. Don't you think it would be better to execute the query directly?

    We should always remember that only the samples of the rows are considered while preparing the statistical data and not all rows are read. Therefore, UPDATE STATISTICS cannot be compared to executing the query itself unless it is run in the high mode.

  • What should be an ideal resolution value?

    There is no such thing as an "ideal" resolution value. The value entirely depends upon the data and the application.

  • What is the default resolution and confidence for all the three modes?

    The default resolution is 0.5 for the HIGH mode and 2.5 for the MEDIUM mode.
    The default confidence is 0.99 for the HIGH mode and between 0.85 and 0.99 for the MEDIUM mode.

  • What is new regarding this feature in Cheetah?

    In Cheetah (IDS 11), IDS automatically collects index statistics on the leading index key while the index is being created. This eliminates the need to manually execute a statistics gathering command. With this feature, the query optimizer can immediately consider the index when determining an access plan. You can view statistics about completed queries in the new Query Statistics section in SET EXPLAIN. In explicit UPDATE STATISTICS operations in MEDIUM mode, a new SAMPLING SIZE option in the resolution clause specifies the minimum number of rows to sample for column distributions. The SET EXPLAIN statement now supports an optional parameter to override the default name and location of the output file.

  • What do you mean by executing UPDATE STATISTICS on a stored procedure?

    The database server re-optimizes SQL statements in the named procedure. The database server does not update the statistics in the system catalog tables.

  • What happens when I set the resolution to 0.5 and confidence to 0.99 and run UPDATE STATISTICS in the MEDIUM mode? Is this equivalent to running it in the HIGH mode?

    Yes.

  • What is the maximum number of bins that I can have?

    Ideally, the resolution is a value between 0.005 and 10. Therefore the number of bins ranges from 10 to 20,000. But practically, the maximum number of bins depends upon the disk space and any limitations imposed by the IDS.

Conclusion

The Informix Dynamic Server is a powerful database server equipped with many powerful features — UPDATE STATISTICS is one such essential feature. The article has explained the need for UPDATE STATISTICS in today's world where the databases are highly volatile due to the millions of transactions occurring in split seconds. Query optimization is crucial in dealing with databases, and UPDATE STATISTICS does just that. Since UPDATE STATISTICS is not a perfect solution, it is important for the DBA to perform trade-offs and choose the best possible solution depending on the current situation. This article has given you the information you need to use UPDATE STATISTICS in a way that will make your life much easier.



Resources

Learn

Get products and technologies

Discuss


About the authors

Bharath Sriram

Bharath Sriram works as an Associate Software Engineer at India Software Labs, IBM. He works for the IDS Integration Team across various IDS products, and is also working on the Open Admin Tool for Cheetah.


Harshavardhan Changappa

Harshavardhan Changappa is working as a Software Engineer for the IDS Integration Team at India Software Labs, IBM. He has worked on IDS certification of various Linux OS levels such as Ubuntu, Asianux, Debian etc. Harshavardhan works for the IDS Integration Team across various IDS products. He is also involved in developing automated test cases for various features of IDS.


Priyambada Behera

Priyambada Behera is a Systems Software Engineer at the IBM India Software Labs, India. She works for the IDS Integration Team across various IDS Products. She is certified in "Managing and Optimizing Informix Dynamic Server Databases".




Rate this page


Please take a moment to complete this form to help us better serve you.



 


 


Not
useful
Extremely
useful
 


Share this....

digg Digg this story del.icio.us del.icio.us Slashdot Slashdot it!



Back to top


© Copyright IBM Corporation 1994, 2008. All rights reserved. Other company, product, or service names may be trademarks or service marks of others.