UPDATE STATISTICS is a proprietary Informix SQL command that updates the facts about each table and its columns by analyzing the data and storing this information in the system catalogs that are used to estimate the cost of subsequent queries. To truly understand the importance of UPDATE STATISTICS, you need to understand what exactly happens when an SQL query is entered by the user for execution. Every SQL query that is entered must be parsed, optimized and executed.
The optimizer is the component that prepares a query plan. A query plan ideally is an optimal plan to execute the given query — that is, it determines the most efficient way to fetch the data. For this purpose, it makes use of a collection of statistical data; however, this statistical data cannot be assumed to be accurate. The preciseness of this data depends upon a large number of factors such as the type of sampling algorithm used, the number of samples taken, and the skew of the data.
The query optimizer does not recalculate the profile for tables automatically. In some cases, gathering the statistics might take longer than executing the query. To ensure that the optimizer selects a query plan that best reflects the current state of your tables, run UPDATE STATISTICS at regular intervals.
UPDATE STATISTICS should be run after initially loading the data and creating indexes. It should also be run after any significant changes to the database tables including any large scale insert, update or delete operations. If you do not implement UPDATE STATISTICS, the optimizer has inaccurate data to determine access paths.
Now, learn about the working of the command. The query optimizer bases query-cost estimates on the number of rows to be retrieved from each table. In turn, the estimated number of rows is based on the selectivity of each conditional expression that is used within the WHERE clause. A filter is a conditional expression that is used to select rows. The selectivity is a value between 0 and 1 that indicates the percentage of rows within the table that the filter can pass. A selective filter, one that passes few rows, has selectivity towards 0, and a filter that passes almost all rows has selectivity towards 1.
The optimizer can use data distributions to calculate selectivity for the filters in a query. However, in the absence of data distributions, the database server calculates selectivity for filters of different types based on table indexes.
The correctness of estimating the selectivity highly influences the cost of each execution plan. So, the accuracy of getting an optimal plan is entirely dependant on the up-to-date statistical information about the database objects involved in the query.
Each time the UPDATE STATISTICS query is run, the following system catalog tables are refreshed. The items listed under each heading are columns in tables.
- In SYSTABLES
- nindexes - Number of indexes in the table
- nrows - Number of rows in the table
- npused - Number of pages used to store 'nrows' of Tupule
- In SYSCOLUMNS
- colmin - Second minimum value of the column
- colmax - Second maximum value of the column
- In SYSINDEXES and SYSFRAGMENTS
- levels - Number of levels in the B-tree
- leaves - Number of leaves contained in the B-tree
- nunique - Number of distinct values
- clust - Clusterdness relative to the table
- In SYSDISTRIB
- Rows containing information about how the values of user data table are distributed. (So all the columns are affected)
- It also maintains the resolution and confidence level of the distribution.
The three modes updating the statistical data are low, medium and high.
The low mode populates only the scalar statistical values (that is, no distribution information) of a table. All the information like the levels of B-tree index, number of pages consumed by the table, number of distinct values of a column are stored.
In medium mode, in addition to low mode statistics, the statement sorts a sample set of column values and populates the distribution data in the sysdistrib system catalog table that represents this sample. It generally has a confidence in the range of 85% to 99%.
In high mode, in addition to low mode statistics, the statement sorts all the column values and populates the sysdistrib table with exact distribution information on all values in the table at the time of execution. While it may seem like a good idea to always use the UPDATE STATISTICS in high mode, this is not actually the case. The high mode uses too many bins to achieve higher accuracy owing to its low resolution of 0.5. Therefore, it consumes a lot of disk space, especially when the table is too big. Also, it is quite intensive to run UPDATE STATISTICS in the high mode ina production system that runs continually. In addition, the high mode does not always guarantee a perfect estimate since no execution plan is guaranteed to be 100% optimal since the plan is based on an estimate.
UPDATE STATISTICS and histograms
Informix uses histograms to store the distribution information of the data. Histograms are used to calculate the selectivity of the predicates. They are proven to produce the low-error estimate and incur almost no run-time overhead.
A histogram is used to graphically summarize and display the distribution of a process data set. Informix uses height-balanced histograms (that is, equi-height/equi-depth) with overflow buckets as opposed to the width-balanced histograms (equi-width). Having the height balanced means the same number of values in each bucket is maintained. Depending on the values, the range is determined, and a bucket represents a range. Below is an example of a histogram:
Histogram

Assume that there are 1000 rows in a table. If the number of buckets is fixed at 10, the proportion will be 1000/10; thus there are 100 rows in each bucket. This 100 represents the height. In case of a height-balanced histogram, this value (100) is fixed.
Data distributions are generated by the server for individual columns of a table, and are stored as encoded histograms in the encdat column of the sysdistrib system catalog. The default number of bins may be adequate if the data is uniformly distributed across the domain of values for that column. However, if the data is highly skewed, then a larger number of bins (smaller resolution) are required to ensure the data is not too skewed.
The magic of UPDATE STATISTICS
Basically, whenever the UPDATE STATISTICS command is run either in the medium or the high mode, two types of bins are created, namely the distribution bin and the overflow bin. Each entry in the bin is made up of the following:
- Distribution bin
- Number of rows in the bin
- Number of distinct values in the bin
- Highest data value in the bin
- Overflow bin
- Data value
- Frequency of data value
Illustration
Now, let us consider an inventory table consisting of three columns — item_num, customer_name and amount. We will perform UPDATE STATISTICS on the column item_num and then see how the two bins are constructed.
Table 1. Inventory table
| item_num | customer_name | amount |
|---|---|---|
| 1 | manoj | 92.5 |
| 1 | prasanna | 43.75 |
| 1 | bharath | 90 |
| 1 | ranjani | 78.5 |
| 1 | priya | 23.5 |
| 1 | radhika | 126.75 |
| 1 | vaibhav | 75 |
| 1 | harsha | 300.50 |
| 1 | vishwas | 20 |
| 2 | deepak | 32.5 |
| 2 | vinay | 90 |
| 2 | archit | 65.20 |
| 2 | vishnu | 32.75 |
| 2 | samir | 48.3 |
| 2 | ravi | 49.5 |
| 3 | srini | 67.5 |
| 4 | rahul | 56.0 |
| 5 | sudev | 73.0 |
To populate the distribution of data, run UPDATE STATISTICS in the medium or high mode.
The histogram information can be retrieved from Informix's utility dbschema by invoking
the following command:
$ dbschema -d <dbname> -hd [ <table> ] |
The following listing shows the distribution of the above data when the UPDATE STATISTICS is run in medium mode with a resolution of 10.
Listing 1. Distribution output
$ dbschema -d newdb -hd Inventory
DBSCHEMA Schema Utility INFORMIX-SQL Version 11.10.FC2
Copyright IBM Corporation 1996, 2006 All rights reserved
Software Serial Number AAA#B000000
{
Distribution for informix.Inventory.item_num
Constructed on 2007-11-09 04:47:00.00000
Medium Mode, 10.000000 Resolution, 0.950000 Confidence
--- DISTRIBUTION ---
( 1)
1: ( 2, 2, 4)
2: ( 1, 1, 5)
--- OVERFLOW ---
1: ( 9, 1)
2: ( 6, 2)
|
Understanding the distribution bin
The bin always starts with a bin number or an identifier which is sequentially incremented for every row. Each row has three columns associated with it. All the three columns are represented with the parentheses. The first column specifies the bin size. The second column specifies the number of distinct elements within the range, and the third column specifies the highest values within the range.
For example, taking the above Distribution output into consideration, the first row can be read as follows:
- The bin size = 2
- There are two distinct elements between the range of 1 and 4. You can see this when you look at the Inventory table. The two distinct values are 3 and 4. A really keen observer may have noticed that the values 1 and 2 are not taken into consideration. The count for this is included in the overflow bin and is not to be taken into consideration for this as well.
- The range is between the values 1 and 4.
Understanding the overflow bin
Like the distribution bin, the overflow bin is also represented in rows. Each row begins with an identifier which gets incremented for every row. There are only two columns in this bin, each represented within the parenthesis. The first column specifies the number of times the value in the second column is repeated. The second column specifies the column value itself.
For example, in the above table, it can be seen that the C1=1 is repeated 9 times and C1=2 is repeated 6 times. Note that only those column values which cross 25% of the bin size are put into the overflow bin. The counts obtained here are absolute sure shot values which can be used for estimates.
NOTE: The output of dbschema cannot be compared to the regular notion of rows and columns. The terms row and column here are literally taken and have no connection with the row and column conception used in databases.
Consider two tables, customer and orders. The customer table has the
attributes customer_num, zipcode and customer_name; orders has the
attributes customer_num, quantity, and item_num.
Below is an example of how a query plan would look like for the query:
select * from orders a, customer b where a.customer_num > 435 and b.zipcode > "65*" ; |
When the UPDATE STATISTICS is run in the high mode, here are the results:
Listing 2. SQEXPLAIN output
QUERY:
------
Completed pass 1 in 0 minutes 0 seconds
UPDATE STATISTICS:
==================
Table: informix.customer
Mode: HIGH
Number of Bins: 288 Bin size 11
Sort data 0.2 MB Sort memory granted 0.2 MB
Estimated number of table scans 1
PASS #1 zipcode
Light scans enabled
Scan 0 Sort 0 Build 0 Insert 0 Close 0 Total 0
Completed pass 1 in 0 minutes 0 seconds
QUERY:
------
select * from orders a, customer b where a.customer_num > 435 and b.zipcode > "65*"
Estimated Cost: 9805
Estimated # of Rows Returned: 244530
Maximum Threads: 1
1) informix.b: INDEX PATH
(1) Index Keys: zipcode (Parallel, fragments: ALL)
Lower Index Filter: informix.b.zipcode > '65*'
2) informix.a: INDEX PATH
(1) Index Keys: customer_num (Parallel, fragments: ALL)
Lower Index Filter: informix.a.customer_num > 435
NESTED LOOP JOIN
|
Life without UPDATE STATISTICS
Consider a typical scenario where the importance of UPDATE STATISTICS can be demonstrated. Consider three tables, T1, T2 and T3. Assume they have 10, 100 and 1000 rows respectively. Now suppose you have to perform an 'EQUI-JOIN' operation on all the three tables. The optimizer would chalk out a query plan before performing the actual join operation itself by making use of the statistical data. You assume that this statistical data is up-to-date and move on.
Now, how many ways can you perform this join operation? Since you know that T1 * T2 is not the same as T2 * T1, it's a matter of simple permutation to find out.
You have three tables and hence you would have 3! * 2! * 1! ways of performing a join operation.
That is 12 different ways of doing it. For this example, just 5 cases from the 12 are used.
Case 1 -- (T1 * T2) * T3 results in 1,011,010 row accesses.
Case 2 -- (T2 * T1) * T3 results in 1,101,100 row accesses.
Case 3 -- (T1 * T3) * T2 results in 1,011,010 row accesses.
Case 4 -- T1 * (T2 * T3) results in 1,001,010 row accesses.
Case 5 -- T3 * (T2 * T1) results in 1,011,000 row accesses.
Now, suppose there were a large number of transactions that took place, and these tables
were manipulated. The current scenario would result in T1, T2 and T3 having 1000, 100 and
10 rows respectively. Since the UPDATE STATISTICS are not run automatically, the System
Catalog tables would not be updated and would still be retaining the old statistical data.
Now, you should perform the EQUI-JOIN operation once again.
The optimizer chooses Case 4 since it determined that it has the minimum number of row accesses and therefore should be efficient.
But, the plan no longer produces the best result.
According to the current situation, an efficient plan would be to use Case 5
— (T3 * (T2 *
T1) — whereas the optimizer chooses Case 4 — T1 * (T2 * T3).
This results in a whopping '9990' extra row accesses. This figure can grow to trillions
for bulky databases which drastically reduces the efficiency of the query. Had you executed the UPDATE
STATISTICS statement, it would have been much more efficient. This example shows why UPDATE
STATISTICS has to be run periodically.
To sum it all up, use the UPDATE STATISTICS statement to perform any of the following tasks:
- Calculate the distribution of column values.
- Update system catalog tables that the database server uses to optimize queries.
- Force re-optimization of SPL routines.
- Convert existing indexes when you upgrade the database server.
Version 1: To UPDATE STATISTICS for the entire database
UPDATE STATISTICS [LOW | MEDIUM | HIGH]; |
Version 2: To UPDATE STATISTICS for a particular table in a database. All columns are updated in this case.
UPDATE STATISTICS [LOW | MEDIUM | HIGH] FOR TABLE <table_name> ; |
Version 3: To UPDATE STATISTICS for a particular column of a particular table in a database.
UPDATE STATISTICS [LOW | MEDIUM | HIGH] FOR TABLE <table_name> (<column_name>); |
Version 4: To UPDATE STATISTICS for a stored procedure in a database.
UPDATE STATISTICS [LOW | MEDIUM | HIGH] FOR PROCEDURE; |
Version 5: To perform UPDATE STATISTICS by setting your own resolution.
UPDATE STATISTICS [LOW | MEDIUM | HIGH] FOR TABLE <table_name> RESOLUTION 10; |
The UPDATE STATISTICS command is not executed automatically. It is the responsibility of the user to run this statement manually. This command has to be executed periodically so that the optimizer has the latest data with it to come to an effective plan to fetch the data.
Ideally, it is preferred to run this statement when a bulk of data is manipulated in the database.
By default, the UPDATE STATISTICS runs in the low mode. In this mode, data distributions are not generated.
Never use the high mode for an entire database or all tables in a database or even all columns in a table unless the database or the tables are of a considerably smaller size.
Always remember that distributions are not created on text or byte columns.
- What is the ideal mode that I should use for the Update Statistics command?
There is no such thing as an "ideal" mode. The DBA should look at the current situation and then choose the mode for UPDATE STATISTICS. However, the following list should give you some hints to help you choose the best mode:
Use UPDATE STATISTICS LOW if the number of rows has changed significantly or after migration from previous version of database server. Also use it for all columns that are not the leading column of any index.
Use UPDATE STATISTICS MEDIUM DISTRIBUTIONS ONLY if queries have non-indexed join columns or filter columns.
Use UPDATE STATISTICS HIGH <table> if queries have a multi-column indexed defined on join columns or filter columns.
Use UPDATE STATISTICS HIGH ON <small tables> if queries have many small tables (fit into one extent). - Can I set the resolution and confidence values myself when I run the statement?
Yes, the syntax is as follows: - UPDATE STATISTICS MEDIUM FOR TABLE <tabname> RESOLUTION 1 0.99-----> confidence - I find the whole process leading to a severe time and resource crunch. Don't you think it would be better to execute the query directly?
We should always remember that only the samples of the rows are considered while preparing the statistical data and not all rows are read. Therefore, UPDATE STATISTICS cannot be compared to executing the query itself unless it is run in the high mode. - What should be an ideal resolution value?
There is no such thing as an "ideal" resolution value. The value entirely depends upon the data and the application. - What is the default resolution and confidence for all the three modes?
The default resolution is 0.5 for the HIGH mode and 2.5 for the MEDIUM mode.
The default confidence is 0.99 for the HIGH mode and between 0.85 and 0.99 for the MEDIUM mode. - What is new regarding this feature in Cheetah?
In Cheetah (IDS 11), IDS automatically collects index statistics on the leading index key while the index is being created. This eliminates the need to manually execute a statistics gathering command. With this feature, the query optimizer can immediately consider the index when determining an access plan. You can view statistics about completed queries in the new Query Statistics section in SET EXPLAIN. In explicit UPDATE STATISTICS operations in MEDIUM mode, a new SAMPLING SIZE option in the resolution clause specifies the minimum number of rows to sample for column distributions. The SET EXPLAIN statement now supports an optional parameter to override the default name and location of the output file. - What do you mean by executing UPDATE STATISTICS on a stored procedure?
The database server re-optimizes SQL statements in the named procedure. The database server does not update the statistics in the system catalog tables. - What happens when I set the resolution to 0.5 and confidence to 0.99 and run
UPDATE STATISTICS in the MEDIUM mode? Is this equivalent to running it in the HIGH mode?
Yes. - What is the maximum number of bins that I can have?
Ideally, the resolution is a value between 0.005 and 10. Therefore the number of bins ranges from 10 to 20,000. But practically, the maximum number of bins depends upon the disk space and any limitations imposed by the IDS.
The Informix Dynamic Server is a powerful database server equipped with many powerful features — UPDATE STATISTICS is one such essential feature. The article has explained the need for UPDATE STATISTICS in today's world where the databases are highly volatile due to the millions of transactions occurring in split seconds. Query optimization is crucial in dealing with databases, and UPDATE STATISTICS does just that. Since UPDATE STATISTICS is not a perfect solution, it is important for the DBA to perform trade-offs and choose the best possible solution depending on the current situation. This article has given you the information you need to use UPDATE STATISTICS in a way that will make your life much easier.
Learn
- Informix Dynamic Server v11 Information Center:
Access detailed information about IDS.
- IBM Informix Guide to SQL, Reference, V11.1:
Get details on the system catalog tables.
- IBM Informix Guide to SQL, Syntax, V11.1:
Get more information on syntax for UPDATE STATISTICS and other SQL statements.,
- IBM Informix Dynamic Server Performance Guide, V11.1:
Get more information on how to configure and operate IDS to achieve optimum performance.
-
"An Overview of the IBM Informix Dynamic Server Optimizer"
(developerWorks, Jul 2005) gives you a basic understanding of the IDS optimizer.
- Informix Unleashed by John McNally, Glenn Miller, Jim Prajesh,Jose Fortuny, et al: Get up to speed with the knowledge and information you need to use Informix products successfully.
Get products and technologies
-
Download a free trial version of Informix Dynamic Server.
Discuss

Bharath Sriram is an Associate Software Engineer at India Software Labs, IBM. He works for the IDS Integration Team across various IDS Products and also works on the OpenAdmin Tool for IDS 11.

Harshavardhan Changappa is working as a Software Engineer for the IDS Integration Team at India Software Labs, IBM. He has worked on IDS certification of various Linux OS levels such as Ubuntu, Asianux, Debian etc. Harshavardhan works for the IDS Integration Team across various IDS products. He is also involved in developing automated test cases for various features of IDS.
Comments (Undergoing maintenance)






