Impacts of correlated columns

Db2 might not determine the best access path, table order, or join method when your query uses columns that are highly correlated.

Begin program-specific programming interface information. Column correlation can make the estimated cost of operations cheaper than they actually are. Correlated columns affect both single-table queries and join queries.

Column correlation on the best matching columns of an index

The following query selects rows with females in department A345 from Fresno, California. Two indexes are defined on the table, Index 1 (CITY,STATE,ZIPCODE) and Index 2 (DEPTNO,SEX).

Query 1
 
SELECT ... FROM CREWINFO WHERE
  CITY = 'FRESNO' AND STATE = 'CA'         (PREDICATE1)
  AND DEPTNO = 'A345' AND SEX = 'F';       (PREDICATE2)

Consider the two compound predicates (labeled PREDICATE1 and PREDICATE2), their actual filtering effects (the proportion of rows they select), and their Db2 filter factors. Unless the proper catalog statistics are gathered, the filter factors are calculated as if the columns of the predicate are entirely independent (not correlated).

When the columns in a predicate correlate but the correlation is not reflected in catalog statistics, the actual filtering effect to be significantly different from the Db2 filter factor. The following table shows how the actual filtering effect and the Db2 filter factor can differ, and how that difference can affect index choice and performance.

Table 1. Effects of column correlation on matching columns
INDEX 1
INDEX 2
Matching predicates
Predicate1
CITY=FRESNO AND STATE=CA
Predicate2
DEPTNO=A345 AND SEX=F
Matching columns
2
2
Db2 estimate for
matching columns
(Filter Factor)
column=CITY, COLCARDF=4
Filter Factor=1/4
column=STATE, COLCARDF=3
Filter Factor=1/3
column=DEPTNO,
COLCARDF=4
Filter Factor=1/4
column=SEX, COLCARDF=2
Filter Factor=1/2
Compound filter factor
for matching columns
1/4 × 1/3 = 0.083
1/4 × 1/2 = 0.125
Qualified leaf pages
based on Db2 estimations
0.083 × 10 = 0.83
INDEX CHOSEN (.8 < 1.25)
0.125 × 10 = 1.25
Actual filter factor based on data distribution
4/10
2/10
Actual number of qualified leaf pages based on compound predicate
4/10 × 10 = 4
2/10 × 10 = 2
BETTER INDEX CHOICE
(2 < 4)

Db2 chooses an index that returns the fewest rows, partly determined by the smallest filter factor of the matching columns. Assume that filter factor is the only influence on the access path. The combined filtering of columns CITY and STATE seems very good, whereas the matching columns for the second index do not seem to filter as much. Based on those calculations, Db2 chooses Index 1 as an access path for Query 1.

The problem is that the filtering of columns CITY and STATE should not look good. Column STATE does almost no filtering. Since columns DEPTNO and SEX do a better job of filtering out rows, Db2 should favor Index 2 over Index 1.

Column correlation on index screening columns of an index

Correlation might also occur on nonmatching index columns, used for index screening. See Nonmatching index scan (ACCESSTYPE='I' and MATCHCOLS=0) for more information. Index screening predicates help reduce the number of data rows that qualify while scanning the index. However, if the index screening predicates are correlated, they do not filter as many data rows as their filter factors suggest. To illustrate this, use Query 1 with the following indexes on Table 1:

Index 3 (EMPNO,CITY,STATE)
Index 4 (EMPNO,DEPTNO,SEX)

In the case of Index 3, because the columns CITY and STATE of Predicate 1 are correlated, the index access is not improved as much as estimated by the screening predicates and therefore Index 4 might be a better choice. (Note that index screening also occurs for indexes with matching columns greater than zero.)

Multiple table joins

In Query 2, the data shown in the following table is added to the original query (see Query 1) to show the impact of column correlation on join queries.
Table 2. Data from the DEPTINFO table
CITY STATE MANAGER DEPT DEPTNAME
Fresno CA Smith J123 ADMIN
Los Angeles CA Jones A345 LEGAL
Query 2
SELECT ... FROM CREWINFO T1,DEPTINFO T2
   WHERE T1.CITY = 'FRESNO' AND T1.STATE='CA'          (PREDICATE 1)
   AND T1.DEPTNO = T2.DEPT AND T2.DEPTNAME = 'LEGAL';

The order that tables are accessed in a join statement affects performance. The estimated combined filtering of Predicate1 is lower than its actual filtering. So table CREWINFO might look better as the first table accessed than it should.

Also, due to the smaller estimated size for table CREWINFO, a nested loop join might be chosen for the join method. But, if many rows are selected from table CREWINFO because Predicate1 does not filter as many rows as estimated, then another join method or join sequence might be better. End program-specific programming interface information.