Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

When We Think That the Optimizer Doesn't Get It Right

Alexander Kuznetsov, Software designer, Chicago, IL
Alexander Kuznetsov has 15 years of experience in software design, development and database administration. Currently he designs a multi-terabyte clustered database in DB2 UDB EEE . Alexander is an IBM Certified Advanced Technical Expert (DB2 Cluster) and an IBM Certified Solutions Expert (Database Administration and Application Development). He can be reached at alkuzo at mindspring.com and the comp.databases.ibm-db2 newsgroup.

Summary:  Taken from real life experience, author Alexander Kuznetsov shows you some simple tips to make sure the optimizer makes good decisions for your queries.

Date:  09 Mar 2004
Level:  Introductory

Activity:  2447 views
Comments:  

Introduction

IBM® DB2® Universal Database™ comes with a very intelligent optimizer, yet sometimes its choices may seem off the mark. The optimizer, however complex it might be, is still a program. It processes input data, such as physical database structure and statistics, and generates execution plans. If we think the optimizer just isn't getting it right, let's try to provide it with some better input and see what happens. Maybe the optimizer's choice is correct after all (it usually is). This article provides several examples in which collecting current and complete statistics, adding appropriate constraints and setting proper optimization level results in better execution plans.


Why two nearly identical queries can run very differently

Let's consider a very typical scenario: The query SELECT * FROM CUSTOMER WHERE STATE = 'IN' runs very slowly. A very similar query, SELECT * FROM CUSTOMER WHERE STATE = 'MI', repeatedly runs more than 10 times faster. We check the obvious first: Is there an index on the STATE column? There is. Next, we check to see if there is a significant difference in the number of customers in each state. The following queries show that there is a seemingly insignificant difference in the number of customers in each state:

SELECT STATE, COUNT(*) AS NUM_CUST FROM CUSTOMER GROUP BY STATE WHERE
                   STATE IN('IN', 'MI')  

STATE NUM_CUST 
----- ----------
 IN    19071
 MI    18554

SELECT COUNT(*) AS NUM_CUST FROM CUSTOMER

NUM_CUST 
---------
 2007931 

When we look at the execution plans, we discover that the slower query is executed as a tablespace scan, and the faster one uses the index. What is the difference? Why doesn't the optimizer choose the same efficient plan for both queries?

Let's have a closer look at the statistics relevant to this table. In DB2, statistics are very conveniently exposed by system views in SYSCAT and SYSSTAT schemas. (For more details on statistics refer to the DB2 Administration Guide section on Performance). In our case, the statistics aren't very recent (in the listing below, see STATS_TIME, which is two months old). Also the total amount of records (now 2007931) has dramatically increased after the statistics were last collected (in the listing below, see CARD):

SELECT STATS_TIME, CARD FROM SYSCAT.TABLES WHERE TABNAME = 'CUSTOMER' 

STATS_TIME                   CARD 
--------------------------   ------
2002-10-01-08.49.49.117405    59616 

While the table has already exceeded two million rows, the optimizer looks up the statistics and estimates that the table still has less than 60,000 records. On top of that, the distribution of values in the STATE column has significantly changed, too (TYPE = 'F' stands for most frequent values):

SELECT TYPE,SEQNO, VALCOUNT, CAST(COLVALUE AS CHAR(30))  
AS COLVALUE FROM SYSSTAT.COLDIST WHERE TABSCHEMA='DB2INST1'
  AND TABNAME='CUSTOMER_DATA' AND COLNAME='STATE' AND TYPE = 'F'  

TYPE SEQNO  VALCOUNT             COLVALUE
---- ------ -------------------- --------------------------
F         1                19071 'IN'                           
F         2                18554 'SC'                           
F         3                11061 'CA'                           
F         4                 5857 'TN'                           
F         5                 2741 'KY'                           
F         6                 1065 'MO'                           
F         7                  220 'IL'                           
F         8                   90 'WI'                           
F         9                   26 'MI'                           
F        10                    4 'FL'                           

(This query retrieves the 10 most frequent values in this column.) Note that at the time the statistics were last collected, customers from Indiana (STATE='IN') accounted for over 30 percent of all the customers, compared to 0.05 percent customers for the state of Michigan (STATE='MI'). Let's refresh the statistics, including distributions:

 RUNSTATS ON TABLE MYSCHEMA.CUSTOMER WITH DISTRIBUTION AND DETAILED INDEXES ALL 

After we did that, the optimizer used index access to execute the formerly slow query, which now ran much faster. (For the full syntax of the RUNSTATS command, refer to the (Command Reference). As we have seen, it is essential to keep statistics current.

But let us not think it is always better to use an index if an appropriate one exists.


Why a tablespace scan is sometimes preferable to an index scan

Consider the same query, SELECT * FROM CUSTOMER WHERE STATE = ?. Believe it or not, sometimes it is more efficient to perform a tablespace scan than to access records via an index. Sounds surprising? Yes, it might take some analysis to arrive at this conclusion. Let's begin with some benchmarking; the explanations will follow later.

Some 18 percent of the records in the CUSTOMER table match the criteria WHERE STATE='IL'. Having analyzed the query SELECT * FROM CUSTOMER WHERE STATE = 'IL', the optimizer has chosen a tablespace scan to execute it. Let's save this query in a file named select.sql, and use a benchmarking tool (db2batch) to measure the real costs of executing the query.

db2batch -d MY_DB -f select.sql -r benchmark.txt -o p3  

Number of rows retrieved is:    19998 
Number of rows sent to output is:   19998  
Elapsed Time is:           5.540      seconds    
Locks held currently                         = 0 
Lock escalations                             = 0 
Total sorts                                  = 0 
Total sort time (ms)                         = 0 
Sort overflows                               = 0 
Buffer pool data logical reads               = 2721 
Buffer pool data physical reads              = 2580 
Buffer pool data writes                      = 0 
Buffer pool index logical reads              = 0 
Buffer pool index physical reads             = 0  

(For more details on db2batch see the Command Reference.)

Now let's cheat the optimizer into choosing index scan for executing the same query. Let's make the statistics look as if there were an imaginary state (STATE='IM') with many customers, and let's make it appear that the number of customers from Illinois are negligible, if any. Because the views in SYSSTAT schema are updateable, let's update one of them:

UPDATE SYSSTAT.COLDIST SET COLVALUE='IM' WHERE  
TABSCHEMA='DB2INST1' AND TABNAME='CUSTOMER_DATA' AND COLNAME='STATE'   
--AND TYPE = 'F'  
AND COLVALUE='''IL''' 

Tip: In the WHERE clause you must wrap the column value in quotes (COLVALUE='''IL''').

Now, based on these statistics, the optimizer concludes that only a tiny number of records might have STATE='IL'. Consequently, it chooses an access plan using the index on STATE (see the small estimated cardinality):

 SQL Statement:

    SELECT *    FROM CUSTOMER_DATA
    WHERE STATE='IL'  

Estimated Cost        = 50 
Estimated Cardinality = 1

Access Table Name = DB2INST1.CUSTOMER_DATA  ID = 2,5 
|  #Columns = 13 
|  Index Scan:  Name = DB2INST1.CUST_STATE  ID = 5 
|  |  Index Columns: 
|  |  |  1: STATE (Ascending) 
|  |  #Key Columns = 1 
|  |  |  Start Key: Inclusive Value 
|  |  |  |  1: 'IL' 
|  |  |  Stop Key: Inclusive Value 
|  |  |  |  1: 'IL' 
|  |  Data Prefetch: None 
|  |  Index Prefetch: None 

Now let's use db2batch to do the second benchmark:

Number of rows retrieved is:    19998 
Number of rows sent to output is:   19998  
Elapsed Time is:           5.976      seconds    
Locks held currently                         = 0 
Lock escalations                             = 1 
Total sorts                                  = 0 
Total sort time (ms)                         = 0 
Sort overflows                               = 0 
Buffer pool data logical reads               = 19998 
Buffer pool data physical reads              = 2614 
Buffer pool data writes                      = 0 
Buffer pool index logical reads              = 138 
Buffer pool index physical reads             = 28 

Obviously, the cheating did not do us any good. The elapsed time is actually smaller when using the tablespace scan over the index in this situation.

Important: We have manually updated statistics to perform some "what if" analysis on a test database. This is a perfectly legitimate usage of updateable views in SYSSTAT schema. However, we should not under normal circumstances ever update statistics in production databases.

Now let's explain what's happening. I once heard a 5-year old boy say, "It won't hurt to spend just one second in jail, so it won't hurt to spend just two seconds in jail, so it won't hurt to spend just three seconds in jail ..." Similarly, it's faster to read just one record via an index, it's faster to read just two records via an index, so on, but only up to a certain point and no more.

Based on the statistics, the optimizer estimates that 18 percent of the records would match the criteria STATE='IL'. Also it expects these records to be distributed more or less evenly across the whole table, because the cluster ratio of the index on STATE is very low, less than 0.1. (For more information on cluster ratio, refer to DB2 Administration Guide section on Performance.) This means that there is going to be at least one matching record on almost every page in the table, anyway. Tablespace scans use pre-fetching, which means that the database engine reads several adjacent pages in one efficient read operation. Tablespace scan is the most efficient way to read all the pages in the table. However efficient an index scan might be, there is still the extra work of scanning through the index.

For more information on pre-fetching, refer to:

  • Syntax of CREATE TABLESPACE statement and its PREFETCHSIZE option in the SQL Reference.
  • The default prefetch size (DFT_PREFETCH_SZ) configuration parameter in the DB2 Administration Guide section on Performance.

So the optimizer's choice of a tablespace scan was correct after all, however surprising that might seem. We have seen that in this situation index access would be definitely less efficient.


Why MIN is sometimes calculated much faster than MAX

The query, SELECT MIN(TOTAL_AMOUNT) FROM CUSTOMER, looked up the value in an existing index on TOTAL_AMOUNT and returned the answer instantly. However, a very similar query, SELECT MAX(TOTAL_AMOUNT) FROM CUSTOMER, took much longer to complete. The execution plan indicated that the optimizer had chosen to scan the whole index to calculate MAX. Why?

In this particular case, there were no better choices. The index on TOTAL_AMOUNT did not allow reverse scans:

SELECT REVERSE_SCANS FROM SYSCAT.INDEXES WHERE  
INDNAME = 'CUSTOMER_AMT'  

REVERSE_SCANS 
-------------
N 

After dropping the index and recreating it with the option ALLOW REVERSE SCANS, both queries started running equally fast.

CREATE INDEX CUSTOMER_AMT ON CUSTOMER(TOTAL_AMOUNT) ALLOW REVERSE SCANS 
RUNSTATS ON TABLE MYSCHEMA.CUSTOMER FOR INDEX MYSCHEMA. CUSTOMER_AMT 

By default, DB2 indexes do not allow reverse scans.

Tip: Whenever you create a PRIMARY KEY, FOREIGN KEY, or UNIQUE constraint inside a CREATE TABLE statement, an index is implicitly created. That index does not allow reverse scans.

You can override the default behavior:

  1. Create a table with no constraints (or drop the existing constraints).
  2. Create an appropriate index.
  3. Create the constraint using ALTER TABLE SQL statement.

For example:

ALTER TABLE CUSTOMER DROP PRIMARY KEY; 
--or create a table not defining a primary key 
CREATE UNIQUE INDEX CUSTOMER_ID ON CUSTOMER(ID) ALLOW REVERSE SCANS; 
ALTER TABLE CUSTOMER ADD PRIMARY KEY(ID); 

DB2 gives you a warning and reuses the index created in step 2.

As we have seen, indexes that allow reverse scans are a must in some very common situations.


Eliminating unnecessary joins

Let's consider the following view:

CREATE VIEW CUSTOMER_ORDER_LIST 
AS 
SELECT 
CUSTOMER_ORDER.CUSTOMER_ID 
CUSTOMER.LAST_NAME 
CUSTOMER.FIRST_NAME 
CUSTOMER.PHONE 
CUSTOMER.EMAIL 
CUSTOMER_ORDER.ORDER_DT 
CUSTOMER_ORDER.AMOUNT 
CUSTOMER_ORDER.STATUS 
FROM CUSTOMER JOIN CUSTOMER_ORDER  
ON CUSTOMER.ID = CUSTOMER_ORDER.CUSTOMER_ID 

All the records in the CUSTOMER_ORDER table have parent records in the CUSTOMER table. This business rule is maintained by triggers rather than by a foreign key constraint. (Don't ask me why. All I can say is I've seen it many times in production databases.)

Consider the query:

 SELECT CUSTOMER_ID, ORDER_DT, AMOUNT, STATUS FROM CUSTOMER_ORDER_LIST 

You would think that there is no need to access CUSTOMER table at all, because all the necessary information is in the view from the CUSTOMER_ORDER table, correct?

Not really. For some reason, the optimizer chooses to access an index on the CUSTOMER table:

Estimated Cost        = 25693  

Access Table Name = DB2INST1.CUSTOMER  ID = 2,5 
|  #Columns = 1 
|  Index Scan:  Name = SYSIBM.SQL021126111001110  ID = 3 
|  |  Index Columns: 
|  |  |  1: ID (Ascending) 
|  |  #Key Columns = 0 
|  |  |  Start Key: Beginning of Index 
|  |  |  Stop Key: End of Index 
|  |  Index-Only Access 
|  |  Index Prefetch: Eligible 199 
|  Lock Intents 
|  |  Table: Intent Share 
|  |  Row  : Next Key Share Merge Join 
|  Access Table Name = DB2INST1.CUSTOMER_ORDER  ID = 2,6 

(This is only partial output.)

Why access the CUSTOMER table at all? The optimizer's choice actually makes perfect sense: You could easily drop a trigger, insert a record violating referential integrity into the CUSTOMER_ORDER table and recreate the trigger. The record would remain in the CUSTOMER_ORDER table, which means that presence of such a trigger does not guarantee referential integrity. Which means that the optimizer must assume that there could be some records in the CUSTOMER_ORDER table that do not have matching ones in the CUSTOMER table, so the lookup in the index on CUSTOMER table is mandatory.

Now let's create an appropriate constraint and see what happens:

 ALTER TABLE CUSTOMER_ORDER ADD FOREIGN KEY(CUSTOMER_ID) REFERENCES CUSTOMER(ID) 

Should there be any records violating this constraint, this statement would fail. Now the optimizer is able to eliminate an unnecessary join and the query runs faster:

Estimated Cost        = 18067  

Access Table Name = DB2INST1.CUSTOMER_ORDER  ID = 2,6 
|  #Columns = 1 
|  Relation Scan 
|  |  Prefetch: Eligible 
|  Lock Intents 
|  |  Table: Intent Share 
|  |  Row  : Next Key Share 
|  Return Data to Application 
|  |  #Columns = 1 Return Data Completion 

As we have seen, adding a foreign key constraint provided the optimizer with some very useful data. The optimizer rewarded us with a much more efficient execution plan.


When good decisions are better than quick ones

Once when I was looking for a new job I came across two open positions from one and the same employer for one and the same project. For that project, they needed a project manager and a technical leader. Among the requirements they listed were:

  • For the project manager: "The ability to make quick decisions."
  • For the technical leader: "The ability to make good decisions."

How very true!

For low optimization levels, the optimizer must act quickly. Whatever up-to-date and detailed statistics we have managed to provide, the optimizer simply might not have enough time to analyze it. All the examples in the previous chapters were run under the default optimization level 5. Should we reconsider our previous example with a low optimization level 1, adding a referential integrity constraint would not result in a better plan.

If you want good decisions rather than quick ones, set the optimization level accordingly.

For more information on optimization levels, refer to the DB2 Administration Guide section on Performance and Implementation.


Conclusion

The DB2 optimizer is very intelligent. However, based on incorrect information, it may come to less than optimal conclusions. We have seen how to:

  • Detect incorrect or incomplete statistics.
  • Provide the optimizer with correct and complete statistics.
  • Update views in SYSSTATS schema and perform "what if" experiments in test environments.

Performance tuning has never been easy. There are no hard and fast rules in query optimization. Check the optimization level, keep statistics current and make sure business rules are implemented as constraint whenever possible. I hope this article will help in the many challenges database developers handle.

Good luck!


Acknowledgement

The author would like to thank Mike Pittinger for his help.


DB2, DB2 Universal Database, and IBM are trademarks or registered trademarks of IBM Corporation in the United States, other countries, or both.

IBM copyright and trademark information


Resources

About the author

Alexander Kuznetsov

Alexander Kuznetsov has 15 years of experience in software design, development and database administration. Currently he designs a multi-terabyte clustered database in DB2 UDB EEE . Alexander is an IBM Certified Advanced Technical Expert (DB2 Cluster) and an IBM Certified Solutions Expert (Database Administration and Application Development). He can be reached at alkuzo at mindspring.com and the comp.databases.ibm-db2 newsgroup.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=14092
ArticleTitle=When We Think That the Optimizer Doesn't Get It Right
publish-date=03092004
author1-email=
author1-email-cc=

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Try IBM PureSystems. No charge.

Special offers