IBM® DB2® Universal Database™ comes with a very intelligent optimizer, yet sometimes its choices may seem off the mark. The optimizer, however complex it might be, is still a program. It processes input data, such as physical database structure and statistics, and generates execution plans. If we think the optimizer just isn't getting it right, let's try to provide it with some better input and see what happens. Maybe the optimizer's choice is correct after all (it usually is). This article provides several examples in which collecting current and complete statistics, adding appropriate constraints and setting proper optimization level results in better execution plans.
Why two nearly identical queries can run very differently
Let's consider a very typical scenario: The query SELECT *
FROM CUSTOMER WHERE STATE = 'IN' runs very slowly. A very
similar query, SELECT * FROM CUSTOMER WHERE STATE =
'MI', repeatedly runs more than 10 times faster. We check
the obvious first: Is there an index on the STATE column? There is.
Next, we check to see if there is a significant difference in the
number of customers in each state. The following queries show that
there is a seemingly insignificant difference in the number of
customers in each state:
SELECT STATE, COUNT(*) AS NUM_CUST FROM CUSTOMER GROUP BY STATE WHERE
STATE IN('IN', 'MI')
STATE NUM_CUST
----- ----------
IN 19071
MI 18554
SELECT COUNT(*) AS NUM_CUST FROM CUSTOMER
NUM_CUST
---------
2007931
|
When we look at the execution plans, we discover that the slower query is executed as a tablespace scan, and the faster one uses the index. What is the difference? Why doesn't the optimizer choose the same efficient plan for both queries?
Let's have a closer look at the statistics relevant to this
table. In DB2, statistics are very conveniently exposed by system
views in SYSCAT and SYSSTAT schemas. (For
more details on statistics refer to the
DB2 Administration Guide section on Performance). In our
case, the statistics aren't very recent (in the listing below, see
STATS_TIME, which is two months old). Also the total
amount of records (now 2007931) has dramatically increased after
the statistics were last collected (in the listing below, see
CARD):
SELECT STATS_TIME, CARD FROM SYSCAT.TABLES WHERE TABNAME = 'CUSTOMER' STATS_TIME CARD -------------------------- ------ 2002-10-01-08.49.49.117405 59616 |
While the table has already exceeded two million rows, the
optimizer looks up the statistics and estimates that the table
still has less than 60,000 records. On top of that, the
distribution of values in the STATE column has
significantly changed, too (TYPE = 'F' stands for most
frequent values):
SELECT TYPE,SEQNO, VALCOUNT, CAST(COLVALUE AS CHAR(30)) AS COLVALUE FROM SYSSTAT.COLDIST WHERE TABSCHEMA='DB2INST1' AND TABNAME='CUSTOMER_DATA' AND COLNAME='STATE' AND TYPE = 'F' TYPE SEQNO VALCOUNT COLVALUE ---- ------ -------------------- -------------------------- F 1 19071 'IN' F 2 18554 'SC' F 3 11061 'CA' F 4 5857 'TN' F 5 2741 'KY' F 6 1065 'MO' F 7 220 'IL' F 8 90 'WI' F 9 26 'MI' F 10 4 'FL' |
(This query retrieves the 10 most frequent values in this
column.) Note that at the time the statistics were last collected,
customers from Indiana (STATE='IN') accounted for over
30 percent of all the customers, compared to 0.05 percent customers
for the state of Michigan (STATE='MI'). Let's refresh
the statistics, including distributions:
RUNSTATS ON TABLE MYSCHEMA.CUSTOMER WITH DISTRIBUTION AND DETAILED INDEXES ALL |
After we did that, the optimizer used index access to execute
the formerly slow query, which now ran much faster. (For the full
syntax of the RUNSTATS command, refer to the
(Command Reference). As we have seen, it is essential to keep
statistics current.
But let us not think it is always better to use an index if an appropriate one exists.
Why a tablespace scan is sometimes preferable to an index scan
Consider the same query, SELECT * FROM CUSTOMER WHERE
STATE = ?. Believe it or not, sometimes it is more efficient
to perform a tablespace scan than to access records via an index.
Sounds surprising? Yes, it might take some analysis to arrive at
this conclusion. Let's begin with some benchmarking; the
explanations will follow later.
Some 18 percent of the records in the CUSTOMER
table match the criteria WHERE STATE='IL'. Having
analyzed the query SELECT * FROM CUSTOMER WHERE STATE =
'IL', the optimizer has chosen a tablespace scan to execute
it. Let's save this query in a file named select.sql,
and use a benchmarking tool (db2batch) to measure the real costs of
executing the query.
db2batch -d MY_DB -f select.sql -r benchmark.txt -o p3 Number of rows retrieved is: 19998 Number of rows sent to output is: 19998 Elapsed Time is: 5.540 seconds Locks held currently = 0 Lock escalations = 0 Total sorts = 0 Total sort time (ms) = 0 Sort overflows = 0 Buffer pool data logical reads = 2721 Buffer pool data physical reads = 2580 Buffer pool data writes = 0 Buffer pool index logical reads = 0 Buffer pool index physical reads = 0 |
(For more details on db2batch see the Command Reference.)
Now let's cheat the optimizer into choosing index scan for
executing the same query. Let's make the statistics look as if
there were an imaginary state (STATE='IM') with many
customers, and let's make it appear that the number of customers
from Illinois are negligible, if any. Because the views in
SYSSTAT schema are updateable, let's update one of
them:
UPDATE SYSSTAT.COLDIST SET COLVALUE='IM' WHERE TABSCHEMA='DB2INST1' AND TABNAME='CUSTOMER_DATA' AND COLNAME='STATE' --AND TYPE = 'F' AND COLVALUE='''IL''' |
Tip: In the WHERE clause you must wrap the
column value in quotes (COLVALUE='''IL''').
Now, based on these statistics, the optimizer concludes that
only a tiny number of records might have STATE='IL'.
Consequently, it chooses an access plan using the index on
STATE (see the small estimated cardinality):
SQL Statement:
SELECT * FROM CUSTOMER_DATA
WHERE STATE='IL'
Estimated Cost = 50
Estimated Cardinality = 1
Access Table Name = DB2INST1.CUSTOMER_DATA ID = 2,5
| #Columns = 13
| Index Scan: Name = DB2INST1.CUST_STATE ID = 5
| | Index Columns:
| | | 1: STATE (Ascending)
| | #Key Columns = 1
| | | Start Key: Inclusive Value
| | | | 1: 'IL'
| | | Stop Key: Inclusive Value
| | | | 1: 'IL'
| | Data Prefetch: None
| | Index Prefetch: None
|
Now let's use db2batch to do the second
benchmark:
Number of rows retrieved is: 19998 Number of rows sent to output is: 19998 Elapsed Time is: 5.976 seconds Locks held currently = 0 Lock escalations = 1 Total sorts = 0 Total sort time (ms) = 0 Sort overflows = 0 Buffer pool data logical reads = 19998 Buffer pool data physical reads = 2614 Buffer pool data writes = 0 Buffer pool index logical reads = 138 Buffer pool index physical reads = 28 |
Obviously, the cheating did not do us any good. The elapsed time is actually smaller when using the tablespace scan over the index in this situation.
Important: We have manually updated statistics to perform some "what if" analysis on a test database. This is a perfectly legitimate usage of updateable views in SYSSTAT schema. However, we should not under normal circumstances ever update statistics in production databases.
Now let's explain what's happening. I once heard a 5-year old boy say, "It won't hurt to spend just one second in jail, so it won't hurt to spend just two seconds in jail, so it won't hurt to spend just three seconds in jail ..." Similarly, it's faster to read just one record via an index, it's faster to read just two records via an index, so on, but only up to a certain point and no more.
Based on the statistics, the optimizer estimates that 18 percent
of the records would match the criteria STATE='IL'.
Also it expects these records to be distributed more or less evenly
across the whole table, because the cluster ratio of the index on
STATE is very low, less than 0.1. (For more
information on cluster ratio, refer to
DB2 Administration Guide section on Performance.) This
means that there is going to be at least one matching record on
almost every page in the table, anyway. Tablespace scans use
pre-fetching, which means that the database engine reads several
adjacent pages in one efficient read operation. Tablespace scan is
the most efficient way to read all the pages in the table. However
efficient an index scan might be, there is still the extra work of
scanning through the index.
For more information on pre-fetching, refer to:
- Syntax of CREATE TABLESPACE statement and its PREFETCHSIZE option in the SQL Reference.
- The default prefetch size (DFT_PREFETCH_SZ) configuration parameter in the DB2 Administration Guide section on Performance.
So the optimizer's choice of a tablespace scan was correct after all, however surprising that might seem. We have seen that in this situation index access would be definitely less efficient.
Why MIN is sometimes calculated much faster than MAX
The query, SELECT MIN(TOTAL_AMOUNT) FROM CUSTOMER,
looked up the value in an existing index on
TOTAL_AMOUNT and returned the answer instantly.
However, a very similar query, SELECT MAX(TOTAL_AMOUNT) FROM
CUSTOMER, took much longer to complete. The execution plan
indicated that the optimizer had chosen to scan the whole index to
calculate MAX. Why?
In this particular case, there were no better choices. The index
on TOTAL_AMOUNT did not allow reverse scans:
SELECT REVERSE_SCANS FROM SYSCAT.INDEXES WHERE INDNAME = 'CUSTOMER_AMT' REVERSE_SCANS ------------- N |
After dropping the index and recreating it with the option
ALLOW REVERSE SCANS, both queries started running
equally fast.
CREATE INDEX CUSTOMER_AMT ON CUSTOMER(TOTAL_AMOUNT) ALLOW REVERSE SCANS RUNSTATS ON TABLE MYSCHEMA.CUSTOMER FOR INDEX MYSCHEMA. CUSTOMER_AMT |
By default, DB2 indexes do not allow reverse scans.
Tip: Whenever you create a PRIMARY KEY, FOREIGN KEY, or UNIQUE constraint inside a CREATE TABLE statement, an index is implicitly created. That index does not allow reverse scans.
You can override the default behavior:
- Create a table with no constraints (or drop the existing constraints).
- Create an appropriate index.
- Create the constraint using ALTER TABLE SQL statement.
For example:
ALTER TABLE CUSTOMER DROP PRIMARY KEY; --or create a table not defining a primary key CREATE UNIQUE INDEX CUSTOMER_ID ON CUSTOMER(ID) ALLOW REVERSE SCANS; ALTER TABLE CUSTOMER ADD PRIMARY KEY(ID); |
DB2 gives you a warning and reuses the index created in step 2.
As we have seen, indexes that allow reverse scans are a must in some very common situations.
Let's consider the following view:
CREATE VIEW CUSTOMER_ORDER_LIST AS SELECT CUSTOMER_ORDER.CUSTOMER_ID CUSTOMER.LAST_NAME CUSTOMER.FIRST_NAME CUSTOMER.PHONE CUSTOMER.EMAIL CUSTOMER_ORDER.ORDER_DT CUSTOMER_ORDER.AMOUNT CUSTOMER_ORDER.STATUS FROM CUSTOMER JOIN CUSTOMER_ORDER ON CUSTOMER.ID = CUSTOMER_ORDER.CUSTOMER_ID |
All the records in the CUSTOMER_ORDER table have
parent records in the CUSTOMER table. This business
rule is maintained by triggers rather than by a foreign key
constraint. (Don't ask me why. All I can say is I've seen it many
times in production databases.)
Consider the query:
SELECT CUSTOMER_ID, ORDER_DT, AMOUNT, STATUS FROM CUSTOMER_ORDER_LIST |
You would think that there is no need to access
CUSTOMER table at all, because all the necessary
information is in the view from the CUSTOMER_ORDER
table, correct?
Not really. For some reason, the optimizer chooses to access an
index on the CUSTOMER table:
Estimated Cost = 25693 Access Table Name = DB2INST1.CUSTOMER ID = 2,5 | #Columns = 1 | Index Scan: Name = SYSIBM.SQL021126111001110 ID = 3 | | Index Columns: | | | 1: ID (Ascending) | | #Key Columns = 0 | | | Start Key: Beginning of Index | | | Stop Key: End of Index | | Index-Only Access | | Index Prefetch: Eligible 199 | Lock Intents | | Table: Intent Share | | Row : Next Key Share Merge Join | Access Table Name = DB2INST1.CUSTOMER_ORDER ID = 2,6 |
(This is only partial output.)
Why access the CUSTOMER table at all? The optimizer's choice
actually makes perfect sense: You could easily drop a trigger,
insert a record violating referential integrity into the
CUSTOMER_ORDER table and recreate the trigger. The
record would remain in the CUSTOMER_ORDER table, which
means that presence of such a trigger does not guarantee
referential integrity. Which means that the optimizer must
assume that there could be some records in the
CUSTOMER_ORDER table that do not have matching ones in
the CUSTOMER table, so the lookup in the index on
CUSTOMER table is mandatory.
Now let's create an appropriate constraint and see what happens:
ALTER TABLE CUSTOMER_ORDER ADD FOREIGN KEY(CUSTOMER_ID) REFERENCES CUSTOMER(ID) |
Should there be any records violating this constraint, this statement would fail. Now the optimizer is able to eliminate an unnecessary join and the query runs faster:
Estimated Cost = 18067 Access Table Name = DB2INST1.CUSTOMER_ORDER ID = 2,6 | #Columns = 1 | Relation Scan | | Prefetch: Eligible | Lock Intents | | Table: Intent Share | | Row : Next Key Share | Return Data to Application | | #Columns = 1 Return Data Completion |
As we have seen, adding a foreign key constraint provided the optimizer with some very useful data. The optimizer rewarded us with a much more efficient execution plan.
When good decisions are better than quick ones
Once when I was looking for a new job I came across two open positions from one and the same employer for one and the same project. For that project, they needed a project manager and a technical leader. Among the requirements they listed were:
- For the project manager: "The ability to make quick decisions."
- For the technical leader: "The ability to make good decisions."
How very true!
For low optimization levels, the optimizer must act quickly. Whatever up-to-date and detailed statistics we have managed to provide, the optimizer simply might not have enough time to analyze it. All the examples in the previous chapters were run under the default optimization level 5. Should we reconsider our previous example with a low optimization level 1, adding a referential integrity constraint would not result in a better plan.
If you want good decisions rather than quick ones, set the optimization level accordingly.
For more information on optimization levels, refer to the DB2 Administration Guide section on Performance and Implementation.
The DB2 optimizer is very intelligent. However, based on incorrect information, it may come to less than optimal conclusions. We have seen how to:
- Detect incorrect or incomplete statistics.
- Provide the optimizer with correct and complete statistics.
- Update views in SYSSTATS schema and perform "what if" experiments in test environments.
Performance tuning has never been easy. There are no hard and fast rules in query optimization. Check the optimization level, keep statistics current and make sure business rules are implemented as constraint whenever possible. I hope this article will help in the many challenges database developers handle.
Good luck!
The author would like to thank Mike Pittinger for his help.
DB2, DB2 Universal Database, and IBM are trademarks or registered trademarks of IBM Corporation in the United States, other countries, or both.
IBM copyright and trademark information
- Command Reference at
ftp://ftp.software.ibm.com/ps/products/db2/info/vr8/pdf/letter/db2n0e80.pdf - Administration Guide: Implementation at
ftp://ftp.software.ibm.com/ps/products/db2/info/vr8/pdf/letter/db2d2e80.pdf - Administration Guide: Performance at
ftp://ftp.software.ibm.com/ps/products/db2/info/vr8/pdf/letter/db2d3e80.pdf

Alexander Kuznetsov has 15 years of experience in software design, development and database administration. Currently he designs a multi-terabyte clustered database in DB2 UDB EEE . Alexander is an IBM Certified Advanced Technical Expert (DB2 Cluster) and an IBM Certified Solutions Expert (Database Administration and Application Development). He can be reached at alkuzo at mindspring.com and the comp.databases.ibm-db2 newsgroup.




