Part 2: Tuning Informix SQL

Examples and analysis

This is the second of a two-part seres on tuning Informix® SQL. Part 1 introduced guidelines for SQL query tuning for Informix, looked at factors you need to consider when tuning SQL queries, and examined some real life tuning examples gathered from the author's experience. Now the author discusses the impact on performance of access methods, subqueries and extent management, and explains how you can affect performance by understanding index levels, uniqueness, fragmentation, and PDQ priority. Examples illustrate how you can apply these principles.

Jianing Fan (cjf035@motorola.com), Senior Software Engineer, Motorola

Jianing Fan is a senior software engineer at Motorola specializing in relational database management systems. He is an Informix Certified Professional, Oracle Certified Professional, and has over 10 years of database and system experience as a developer, system administrator, and DBA. Jianing can be reached at cjf035@email.mot.com.



07 October 2004

Introduction

There are many performance issues to consider when tuning Informix SQL. Two of the major issues are access methods and join methods. In this article I'll discuss what the choices are, and how you can determine the cost of each choice. Once you understand your options, you can influence the optimizer to select the best query execution plan, either by adjusting the OPTCOMPIND parameter or by adding query directives or indexes.


Access methods

Access methods refer to the ways that the optimizer reads (or rather retrieves) data from database tables. There are basically two ways.

  • The simplest way is to read the table's data sequentially, performing a table scan as we usually call it. The optimizer chooses a table scan when most data of the table must be read anyway, or when table do not have indexes.
  • Another way is to use indexes. The optimizer may be able to use a key-only index scan if the if the columns have indexes.

The optimizer will compare the cost of each method to determine the best one to use. It will evaluate all aspects of the costs to perform the query, such as the number of disk I/O operations required, the number of rows that would be retrieved, the costs of sorting, and so forth.

The costs of table scans can be fairly low if the data in the table is clustered or if it is in physical order. When the first row on a disk page is requested, the disk page is read into a buffer page in memory. The next time the data on that page is requested, it need not be read from disk again. Requests for subsequent rows on that page are filled from the buffer until all the rows on that page are processed. When one page is exhausted, the page for the next set of rows must be read in.

To speed up the process, Informix has the ability to read the next data page into memory when the first one is still in the memory. To use this ability effectively, two parameters, RA_PAGES and RA_THRESHOLD, must be properly set in the configuration file. For details how to set up those parameters, refer to IBM Informix Dynamic Server Administrator's Guide, Version 9.4. If tables in your database have few extents, the disk pages of consecutive rows are placed in consecutive locations on the disk, and the disk access arm doesn't have to move too much when reading data pages. Moreover, the costs for latency will also be low.

For small tables (with 1,000 or fewer rows), the cost of a table scan are not too expensive, especially when the table is used repeatedly in your SQL or application and thus stays completely in memory. Actually, a table scan of an in-memory table is much faster than searching the same table through an index, especially with the Informix light scan mechanism. But if a table has 100,000 or more rows, repeated sequential scans are deadly to performance. You may need to think of adding indexes to the table and use them as a way of accessing data from the table.

In reality, except for clustered tables, data in almost all tables will not be in order. This is true for systems with heavy transactions such as banking systems. In this case, a table scan is very expensive, because the Informix engine has to access data randomly. Disk I/O costs are higher when the rows of a table are read in a sequence unrelated to their physical order on disk. Because the pages are not read sequentially from the disk, both seek and rotational delays occur before each page can be read. As a result, the disk access time is much longer when reading data in random order than when reading in sequential order. A better approach is to use the key-index scan.

Indexes are collections of sequentially ordered pointers or keys used for locating data pages. Index scans work down form the root page to a leaf page. Because the root page is used so often, it always resides in the memory buffer. The costs used to find a leaf page depends on the size of the index, the filer of the query, and the uniqueness of the index. If each value occurs only once in the index, and the query is a join, each row to be joined requires a non-sequential lookup into the index, followed by a non-sequential access to the associated row in the table. However, if there are many duplicate rows per distinct index value, and the associated table is highly clustered, the added costs of joining through the index can be slight.

Although index entries are sequential, there is no guarantee that rows with adjacent index entries will reside on the same data pages. In many cases, a separate disk access must be made to fetch the page for each row located through an index. If a row is larger than a page (that is larger than 2 KB), then the page containing that row might be kicked out of memory. In other words, it may be removed from the buffer and written back to the disk before subsequent data pages can be read into the memory and processed. This page then has to be read into the memory again for processing later.

The optimizer will decide whether an index can be used to evaluate a filter. For this purpose, an indexed column is any single column with an index or the first column named in a composite index. If the values contained in the index are all that is required, the rows are not read. It is much faster to omit data page lookups whenever values can be directly read from index.

The optimizer will use indexes in the following cases:

  • When a column is indexed. The database server locates relevant rows in the table by first finding the row in the index. In this case, key-index scan would be much faster than table scan.
  • When a column is indexed and the value to be compared is a column in another table (a join express). The database server uses the index to find matching values.
  • When processing an ORDER BY clause. If all the columns in the clause appear in the required sequence with a single index, the database server uses the index to read the rows in the ordered sequence, thus avoiding a sort.
  • When processing GROUP BY clause. If all the columns in the clause appear in one index, the database server reads groups with equal keys from the index without requiring additional processing after the rows are retrieved form their tables.

How do you decide which is the most effective access method for the query? There is no fixed rule, because business requirements are different from one organization to another. However, there are some general guidelines. If the table in your query are pretty small (usually fewer than 1,000 rows) or clustered tables, and your query is not selective at all, it would be beneficial to use a table scan. But if the tables are very large and have many rows (say, a few million rows) and your query fairly selective, it is much better to use indexes.

The optimizer is usually pretty smart in choosing access methods, but if it does not behave in the way you expect, you can influence its behavior by using query directives to force the optimizer to choose what you think is the best access method. You can analyze the query execution plan output to decide whether or not this is necessary.


Join methods

If your query contains more than one table, those tables should be joined using filters in order to avoid a Cartesian join. The optimizer calculates and compares the costs of each join method and selects the best one to use when evaluating query execution plans. The most commonly used join methods are nested-loop join, sort merge join and dynamic hash join. Now let's look at each of those in detail.

Nested-loop join

In a nested-loop join, the first (or outer) table is scanned for rows that meet the query criteria. For every row found in the outer table, the database server searches for its corresponding row in the second (or inner) table. The outer table can be accessed by an index scan or by a table scan, depending on the table. The database server applies filters first, if there are any. If the inner table does not have an index, the database server will compare the cost of building an index on the table to the cost of a sequential scan, and will choose the one with the least cost. The overall cost depends on whether there is an index on the join columns. If there is an index on the join column, the costs will be significantly less; otherwise the database server must perform table scans for all tables, both outer and inner tables.

Sort merge join

This join method is usually used when no index is available on the join columns for the joining tables. Before the join begins, the database server first applies filters if there are any, and then sorts the rows for each table on the join column. Once the rows are sorted, the algorithm for joining the two tables is easy: the database server simply reads both sorted tables sequentially and merge all rows if they are matching. Because this method has to sort all joining tables before they join, the costs are usually pretty high. This join method is replaced by dynamic hash join for IDS version 7 and above.

Dynamic hash join

As mentioned above, dynamic hash join is a new join method for IDS version 7 and above. It is used when there are no indexes on one or more joining tables, or when the database server must read a large number of rows from all joining tables. In this method, one of the tables, usually the smaller one, is scanned and used to create a hash table in memory. Using the hash function, each row is put in a bucket with other rows that have the same hash value. After the first table has been scanned and placed in a hash table, the second table is scanned and each row in the table is looked up in the hash table to see if a join can be made. If there are more tables in the join, the database server will do the same thing for each joining table.

The dynamic hash join consists of two activities: building the hash (or build phase as we call it), and probing the hash table (or probe phase). In the build phase, the database server reads one table, and, after it applies any existing filters, it creates a hash table in memory. You can think of a hash table conceptually as a series of buckets, each with an address that is derived form the key value by applying the hash function. The database server does not sort keys in a particular hash bucket. In the probe phase, the database server reads the other tables in the join and applies filters if any. For each row that satisfies the filter condition, the database server applies hash function to the key and proves the hash table to find a match. A hash join is usually faster than a sort merge join since there is no sort operation involved.


Subqueries

Subqueries are SQL statements within SQL statements, or rather nested SQL statements. Such SQL statements can be used in many cases such as SELECT, DELETE, UPDATE and INSERT. For example, the following SQL statement uses a subquery to count the number of employees who have the minimum salary:

select count(*) form employee
where salary = (select min(salary) from employees);

There are two types of subqueries, uncorrelated and correlated. The uncorrelated subqueries do not rely on the main query for information; hence they increase the power and flexibility of your main query. On the other hand, correlated subqueries need information to locate rows in the inner table, and this implies the database server has to run the correlated query when selecting every row that satisfies the query condition. The query above contains a correlated subquery, since the main query and the subquery are relaying on the same table, the employees table. Correlated subqueries will greatly degrade the speed of data retrieval, especially for non-selective queries on huge tables with millions of rows and with no index. We need to pay attention to correlated subqueries and try to avoid using them whenever possible.


Extents for tables and indexes

An extent is a block of physically contiguous space. There is, however, no guarantee that extents themselves are also contiguous. Thus, if a table or index has only one extent, it will keep all its data or keys in one physical place. Otherwise, its data or keys will be scattered throughout the extents that have been occupied by the table or index. Contiguity of physical data or keys is important to the speed of table scans or key-index scans. When data are contiguous, the time used to access data on the disk is minimized and the database server can read data faster. The same is true for keys. If tables have too many extents, it is very likely that those extents are interleaved. This harms performance in general, since when you retrieve data for a certain table, the disk head must seek multiple, noncontiguous extents belonging to this table instead on one large extent with contiguous physical pages. This slows down disk-seeking speed rather significantly.

If any table or index has ten or more extents, with the exception of large fragmented tables, you need to consider rebuilding those table and indexes to consolidate their extents for better performance. For details how to estimate and allocate extent size for tables and indexes, refer to IBM Informix Dynamic Server Performance's Guide, Version 9.4.


Index levels and uniqueness

The level and uniqueness of an index will also affect the speed of accessing data. The more index levels, the more reads Informix Dynamic Server must perform to get to index leaf nodes, and the more time it will take to get to the real data. Furthermore, if a leaf node gets split or merged during data insertion or deletion or update, it may take longer for the whole index to adjust the change. For example, if an index has only two levels, only two levels need to be adjusted, but if it has four levels then all four levels need to be adjusted accordingly if there is a node split or merge in any part of the index. The time used for this adjustment is, of course, much longer for indexes with four levels than indexes with only two levels. This is especially true in an OLTP environment, where transactions are huge and data is constantly inserted, deleted and updated. Thus if any index has more than four levels, you may need to consider dropping and rebuilding it with fewer levels for better performance.

The uniqueness of index refers to how many duplicate keys an index has. A highly duplicated index can severely impact the speed of accessing, updating and deleting data. Suppose you have an index on the column customer_type in the customer table, and there are only five possible customer_type codes. If the table has one million rows, there could on average be 200,000 rows with the same customer_type code. The B-tree would store the key value, followed by a list of pointers to each of the physical rows. The problem occurs when you have to delete or update any of the key values. Informix Dynamic Server must search all those duplicate keys to find the correct key to delete or update, and that is 200,000 keys to search!

Ideally, every key value in the index is unique. How do we now find out how unique our indexes are? We can get this information by querying two system tables; systables and sysindexes. Here is the query:

select nrows, nunique from systables a, sysindexes b
where a.tabid = b.tabid
and idxname = index_name

Fragmentation and PDQ priority

Fragmentation and PDQ priority also have a huge effect on the speed of data retrieval and will help table scans. Fragmentation is a method of intelligent partitioning of database tables and indexes into smaller units called a fragments. Each fragment is a group of rows or index keys within a table or index. The algorithm used for fragmentation is called a distribution scheme. Informix Dynamic Server supports two types of distribution schemes: round robin and expression-based. The round robin distribution scheme uses Informix internal rules to fragment tables and indexes, whereas the expression based distribution scheme employs user-defined rules to fragment tables and indexes.

The ultimate goal of fragmentation is to reduce the retrieval time by directly accessing the fragment that contains the data that satisfies the SQL query. For example, if a huge table of 100,000 rows is fragmented into 10 fragments with each fragment contains 10,000 rows, Informix Dynamic Server will go direct to the fragment that contains the data your query requested, eliminating scans of other fragments. This would limit IDS to only scanning 10,000 related rows out of 100,000 rows. That is a big savings, isn't it? For details on fragmentation, refer to IBM Informix Dynamic Server Administrator's Guide, Version 9.4.

Fragmentation works best when PDQ priority is applied. PDQ priority works on the basis of 'divide and conquer'. That is, Informix Dynamic Server breaks up complicated queries into much smaller pieces. Each of these little piece get assigned to separate threads, and each thread then works on part of the query in parallel. When they finish processing their parts of the query, the database server reassembles and process the resulting data sets before presenting the final answer to the end user. Since PDQ priority uses more threads to process a query than the normal way, it will shorten data access time. The question is, then, how do we keep the balance of using system and Informix resources so that it won't harm or rather slow down the execution of non-PDQ queries? This is especially important in the OLTP environment where transactions are huge and query response time is critical.


Examples and analysis

Let's now get to some concrete SQL tuning examples I have collected in my work and see how you can appy the tuning guidelines discussed above.

Our company is a communication company, and our database is used to store communication performance management data, event subscription lists, and network element configuration information. The database consists of several hundred of tables and has about 40 GB of data. It is running on the Sun Solaris 2.8 operation system with six CPUs and 4 GB of RAM. The hardware platform is a Sun Enterprise 3500 with six 18 GB disks and two 36 GB disks, each of which is mirrored to another.

Example 1

The original query and its execution plan:

QUERY:
------
SELECT     UNIQUE MNE.MSO_ID, MNE.NE_INST, ANE.MSO_ID, 1
   FROM      NE MNE, NE ANE, CELL, NEIGH, CELL NCELL
 WHERE      MNE.NE_TYPE = 0 {ACG}
         AND  CELL.ACG_INSTANCE   = MNE.NE_INST
         AND  NEIGH.ADV_CELL_INST = CELL.CELL_INSTANCE
         AND  NCELL.CELL_INSTANCE = NEIGH.NEIGH_CELL_INST
         AND  ANE.NE_INST = NCELL.ACG_INSTANCE
         AND  ANE.MSO_ID != MNE.MSO_ID
         AND  MNE.WORKSPACE_ID   = 41
         AND  ANE.WORKSPACE_ID   = 41
         AND  CELL.WORKSPACE_ID  = 41
         AND  NEIGH.WORKSPACE_ID = 41
         AND  NCELL.WORKSPACE_ID = 41

Estimated Cost: 13555
Estimated # of Rows Returned: 38

  1) root.cell: INDEX PATH

    (1) Index Keys: workspace_id tz_inst   (Serial, fragments: ALL)
        Lower Index Filter: root.cell.workspace_id = 41

  2) omcadmin.mne: INDEX PATH

        Filters: omcadmin.mne.ne_type = 0

    (1) Index Keys: workspace_id ne_inst   (Serial, fragments: ALL)
        Lower Index Filter: (root.cell.acg_instance = omcadmin.mne.ne_inst AND omcadmin.m
ne.workspace_id = 41 )
NESTED LOOP JOIN

  3) root.neigh: INDEX PATH

    (1) Index Keys: workspace_id adv_cell_inst neigh_cell_inst   (Key-Only)  (Serial, fra
gments: ALL)
        Lower Index Filter: (root.neigh.adv_cell_inst = root.cell.cell_instance AND root.
neigh.workspace_id = 41 )
NESTED LOOP JOIN

  4) omcadmin.ncell: INDEX PATH

    (1) Index Keys: workspace_id cell_instance   (Serial, fragments: ALL)
        Lower Index Filter: (omcadmin.ncell.cell_instance = root.neigh.neigh_cell_inst AN
D omcadmin.ncell.workspace_id = 41 )
NESTED LOOP JOIN

  5) omcadmin.ane: INDEX PATH

        Filters: omcadmin.ane.mso_id != omcadmin.mne.mso_id

    (1) Index Keys: workspace_id ne_inst   (Serial, fragments: ALL)
        Lower Index Filter: (omcadmin.ane.ne_inst = omcadmin.ncell.acg_instance AND omcad
min.ane.workspace_id = 41 )
NESTED LOOP JOIN

The optimized query and its execution plan:

QUERY:
------
select       unique mne.mso_id, mne.ne_inst, ane.mso_id, 1
 from        ne mne, ne ane, cell, neigh, cell ncell
where      cell.workspace_id = 41
         and  cell.acg_instance = mne.ne_inst
         and  cell.workspace_id = mne.workspace_id
         and  mne.ne_type = 0
         and  mne.mso_id !=ane.mso_id
         and  mne.workspace_id=ane.workspace_id
         and  ane.ne_inst=ncell.acg_instance
         and  cell.cell_instance = neigh.adv_cell_inst
         and  ncell.cell_instance = neigh.neigh_cell_inst
         and  cell.workspace_id = ncell.workspace_id
         and  ncell.workspace_id = neigh.workspace_id

Estimated Cost: 6555
Estimated # of Rows Returned: 38

  1) root.cell: INDEX PATH

    (1) Index Keys: workspace_id tz_inst   (Serial, fragments: ALL)
        Lower Index Filter: root.cell.workspace_id = 41

  2) omcadmin.mne: INDEX PATH

        Filters: omcadmin.mne.ne_type = 0

    (1) Index Keys: workspace_id ne_inst   (Serial, fragments: ALL)
        Lower Index Filter: (root.cell.acg_instance = omcadmin.mne.ne_inst AND root.cell.
workspace_id = omcadmin.mne.workspace_id )
NESTED LOOP JOIN

  3) root.neigh: INDEX PATH

    (1) Index Keys: workspace_id adv_cell_inst neigh_cell_inst   (Key-Only)  (Serial, fra
gments: ALL)
        Lower Index Filter: (root.cell.cell_instance = root.neigh.adv_cell_inst AND omcad
min.mne.workspace_id = root.neigh.workspace_id )
NESTED LOOP JOIN

  4) omcadmin.ncell: INDEX PATH

    (1) Index Keys: workspace_id cell_instance   (Serial, fragments: ALL)
        Lower Index Filter: (omcadmin.ncell.cell_instance = root.neigh.neigh_cell_inst AN
D omcadmin.mne.workspace_id = omcadmin.ncell.workspace_id )
NESTED LOOP JOIN

  5) omcadmin.ane: INDEX PATH

        Filters: omcadmin.mne.mso_id != omcadmin.ane.mso_id

    (1) Index Keys: workspace_id ne_inst   (Serial, fragments: ALL)
        Lower Index Filter: (omcadmin.ane.ne_inst = omcadmin.ncell.acg_instance AND root.
cell.workspace_id = omcadmin.ane.workspace_id )
NESTED LOOP JOIN

As you may observe from the output above, the only changes we made were to add one more condition to restrict the original query (specify the workspace_id for cell table) and make all tables in the original query join together. Those changes cut the costs for the original query down to half and also greatly reduced the query response time.

Example 2

The original query and its execution plan:

QUERY:
------
select min(ds0_start)
   from srate
  where srate.line_instance = 99930
    and srate.workspace_id = 41
    and srate.ds_instance in ( select ds_inst from ds
                                    where stream_type = 10
                                      and workspace_id = 41)

Estimated Cost: 5175
Estimated # of Rows Returned: 1

  1) root.srate: INDEX PATH

        Filters: root.srate.ds_instance = ANY <subquery>

    (1) Index Keys: workspace_id line_instance   (Serial, fragments: ALL)
        Lower Index Filter: (root.srate.line_instance = 99930 AND root.srate.workspace_id
 = 41 )

    Subquery:
    ---------
    Estimated Cost: 5171
    Estimated # of Rows Returned: 6400

      1) root.ds: INDEX PATH

            Filters: root.ds.stream_type = 10

        (1) Index Keys: workspace_id owner_ne   (Serial, fragments: ALL)
            Lower Index Filter: root.ds.workspace_id = 41

The optimized query and its execution plan:

QUERY:
------
select       min(ds0_start)
   from      srate , ds
 where     srate.line_instance = 99930
        and   srate.workspace_id = 41
        and   srate.ds_instance = ds.ds_inst
        and   ds.stream_type = 10
        and   srate.workspace_id = ds.workspace_id

Estimated Cost: 7
Estimated # of Rows Returned: 1

  1) root.srate: INDEX PATH

    (1) Index Keys: workspace_id line_instance   (Serial, fragments: ALL)
        Lower Index Filter: (root.srate.line_instance = 99930 AND root.srate.workspace_id
 = 41 )

  2) root.ds: INDEX PATH

        Filters: root.ds.stream_type = 10

    (1) Index Keys: workspace_id ds_inst   (Serial, fragments: ALL)
        Lower Index Filter: (root.srate.ds_instance = root.ds.ds_inst AND root.srate.work
space_id = root.ds.workspace_id )
NESTED LOOP JOIN

The only change we made here was to get rid of the subquery in the original qeury. Instead of using subquery, we join two tables directly. Since the join is performed on indexes, the costs are much lower and the speed of accessing data much faster.

Example 3

The original query and its execution plan:

QUERY:
------
SELECT workspace_impctlog.ne_instance, 1, br_status
        FROM   workspace_impctlog, ne
       WHERE   workspace_impctlog.workspace_id = 40
         AND   ((workspace_impctlog.impct_type = 10
                 AND workspace_impctlog.trans_type =! 2)
                  OR workspace_impctlog.impct_type = 30)
         AND   workspace_impctlog.ne_instance = ne.ne_inst
         AND   ne.workspace_id = 40
         AND   ne.ne_type =! 8
         AND   ne.ne_inst NOT IN
              (SELECT ne_inst from ne
               WHERE  ne.workspace_id = 40
                        and ne_type = 8)

Estimated Cost: 7880
Estimated # of Rows Returned: 1

  1) root.workspace_impctlog: SEQUENTIAL SCAN

        Filters: (root.workspace_impctlog.workspace_id = 40 AND ((root.workspace_impctlog
.impct_type = 10 AND root.workspace_impctlog.trans_type != 2 ) OR root.workspace_impctlog
.impct_type = 30 ) )

  2) root.ne: INDEX PATH

        Filters: root.ne.ne_type != 8

    (1) Index Keys: workspace_id ne_inst   (Key-First)  (Serial, fragments: ALL)
        Lower Index Filter: (root.workspace_impctlog.ne_instance = root.ne.ne_inst AND ro
ot.ne.workspace_id = 40 )
        Key-First Filters:  (root.ne.ne_inst != ALL <subquery> )
NESTED LOOP JOIN

    Subquery:
    ---------
    Estimated Cost: 7878
    Estimated # of Rows Returned: 16

      1) root.ne: INDEX PATH

            Filters: root.ne.ne_type = 8

        (1) Index Keys: workspace_id manager_ne_inst   (Serial, fragments: ALL)
            Lower Index Filter: root.ne.workspace_id = 40

The optimized query and its execution plan:

QUERY:
------
SELECT ne_inst from ne
               WHERE  ne.workspace_id = 40
union
SELECT ne_inst from ne
               WHERE  ne.workspace_id = 40
                 AND  ne_type = 8
                into  temp ne_temp1

Estimated Cost: 103
Estimated # of Rows Returned: 10957

  1) root.ne: INDEX PATH

    (1) Index Keys: workspace_id ne_inst   (Key-Only)  (Serial, fragments: ALL)
        Lower Index Filter: root.ne.workspace_id = 40


Union Query:
------------

  1) root.ne: INDEX PATH

        Filters: root.ne.ne_type = 8

    (1) Index Keys: workspace_id manager_ne_inst   (Serial, fragments: ALL)
        Lower Index Filter: root.ne.workspace_id = 40


QUERY:
------
SELECT workspace_impctlog.ne_instance, 1, br_status
        FROM   workspace_impctlog, ne
        WHERE  workspace_impctlog.workspace_id = 40
          AND  workspace_impctlog.impct_type = 10
          AND  workspace_impctlog.trans_type =! 2
          AND  workspace_impctlog.ne_instance = ne.ne_inst
          AND  ne.workspace_id = 40
          AND  ne.ne_type =! 8
          AND  ne.ne_inst NOT IN
               (SELECT * from ne_temp1)
union
SELECT workspace_impctlog.ne_instance, 1, br_status
        FROM   workspace_impctlog, ne
        WHERE  workspace_impctlog.workspace_id = 40
          AND  workspace_impctlog.impct_type = 30
          AND  workspace_impctlog.ne_instance = ne.ne_inst
          AND  ne.workspace_id = 40
          AND  ne.ne_type =! 8
          AND  ne.ne_inst NOT IN
               (SELECT * from ne_temp1)

Estimated Cost: 5
Estimated # of Rows Returned: 2

  1) root.workspace_impctlog: SEQUENTIAL SCAN

        Filters: ((root.workspace_impctlog.impct_type = 10 AND root.workspace_impctlog.wo
rkspace_id = 40 ) AND root.workspace_impctlog.trans_type != 2 )

  2) root.ne: INDEX PATH

        Filters: root.ne.ne_type != 8

    (1) Index Keys: workspace_id ne_inst   (Key-First)  (Serial, fragments: ALL)
        Lower Index Filter: (root.workspace_impctlog.ne_instance = root.ne.ne_inst AND ro
ot.ne.workspace_id = 40 )
        Key-First Filters:  (root.ne.ne_inst != ALL <subquery> )
NESTED LOOP JOIN

    Subquery:
    ---------
    Estimated Cost: 1
    Estimated # of Rows Returned: 1

      1) omcadmin.ne_temp1: SEQUENTIAL SCAN

Union Query:
------------

  1) root.workspace_impctlog: SEQUENTIAL SCAN

        Filters: (root.workspace_impctlog.impct_type = 30 AND root.workspace_impctlog.wor
kspace_id = 40 )

  2) root.ne: INDEX PATH

        Filters: root.ne.ne_type != 8

    (1) Index Keys: workspace_id ne_inst   (Key-First)  (Serial, fragments: ALL)
        Lower Index Filter: (root.workspace_impctlog.ne_instance = root.ne.ne_inst AND ro
ot.ne.workspace_id = 40 )
        Key-First Filters:  (root.ne.ne_inst != ALL <subquery> )
NESTED LOOP JOIN

    Subquery:
    ---------
    Estimated Cost: 1
    Estimated # of Rows Returned: 1

      1) omcadmin.ne_temp1: SEQUENTIAL SCAN

We first replaced as many subqueries in the original query as we could and then used the UNION set operator instead of OR operator. The execution plan output after the optimization showed a significant improvement.

Example 4

The following query is to find out how many orders Willson Market Company made and to retrieve its contact information such as address, phone and so on. The query and its execution plan is shown as follows:

QUERY:
------
select companyname, address, city, state, country, phone, fax, email, count(*)
from customer a, order b
where a.custid = b.custid
and a.custid = 'WILMK'
group by 1,2,3,4,5,6,7,8

Estimated Cost: 1135675
Estimated # of Rows Returned: 1
Temporary Files Required For: Group By

  1) omcadmin.a: SEQUENTIAL SCAN

        Filters: informix.a.custid = 'WILMK'

  2) omcadmin.b: SEQUENTIAL SCAN

        Filters:
        Table Scan Filters: informix.b.custid = 'WILMK'


DYNAMIC HASH JOIN
    Dynamic Hash Filters: informix.a.custid = informix.b.custid

The cost is too high. As we examined the tables, we found that two joining columns, customer.custid and order.custid, do not have indexes. This caused the Informix server to perform sequential scans which is much more expensive than index key scans in this case, since customer and order tables are huge; each contains several million of records. So we added indexes on those joining columns and from the following query execution plan, we can see the query cost is greatly reduced.

QUERY:
------
select companyname, address, city, state, country, phone, fax, email, count(*)
from cust a, order b
where a.custid = b.custid
and a.custid = 'WILMK'
group by 1,2,3,4,5,6,7,8

Estimated Cost: 15
Estimated # of Rows Returned: 1
Temporary Files Required For: Group By

  1) omcadmin.a: INDEX PATH

    (1) Index Keys: custid   (Serial, fragments: ALL)
        Lower Index Filter: informix.a.custid = 'WILMK'

  2) omcadmin.b: INDEX PATH

    (1) Index Keys: custid   (Key-Only)  (Serial, fragments: ALL)
        Lower Index Filter: informix.a.custid = informix.b.custid
NESTED LOOP JOIN

The response time for the query was also reduced from 10 minutes to 30 seconds. This is a great performance improvement!


Conclusion

Tuning SQL queries is not simple; it involves a well-planned and well-designed test strategy, careful observations and in-depth analysis. Furthermore, test results can differ across platforms, so you need to reflect your production environment in your test environment. But "no pains no gains"; you will feel quite excited when you see the big performance improvement you have made. I hope the above examples and analysis prove to be be helpful and serve as an introduction to SQL query tuning.

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=23539
ArticleTitle=Part 2: Tuning Informix SQL
publish-date=10072004