# One value problem in DB2 query optimization

## Abstract

One value problem in DB2 query optimization

## Body

Column cardinality affects filer factor estimation of the predicate when distribution stats is not collected.
If your query includes a predicate like "WHERE COL != <value> " for a flag type column which typically has low cardinality, you may encounter a sudden plan change depending on the environment. Typically, COLCARD=1 is tricky, hence I call it one value problem in this blog entry.

For example, when COLCARD = 2 for column COL2, meaning number of distinct values in the COL2 is 2,  filter Factor of (Q2.COL2 <> 0) = 0.5.
This is because ( filter factor of "COL2 = <value>" ) = 1/(COLCARD) = 1/2 = 0.5, as the result, ( filter factor of "COL2 != <value>" ) = 1 - 1/(COLCARD) = 1 - 1/2 = 0.5

-Initial case: COLCARD= 2 and Filter Factor of (Q2.COL2 <> 0) = 0.5
UPDATE SYSSTAT.COLUMNS
SET COLCARD=2,
...
WHERE COLNAME = 'COL2' AND TABNAME = 'B' AND TABSCHEMA = 'B       ';

Predicates:
----------
5) Sargable Predicate,
Comparison Operator:            Not Equal (<>)
Subquery Input Required:        No
Filter Factor:                  0.5

Predicate Text:
--------------
(Q2.COL2 <> 0)

Similarly, if COLCARD varies from 2 to 10, FF (Filter factor) varies accordingly as follows.

COLCARD  FF of COL2 = <value>    FF of COL2 != <value>
------- ---------------------    ---------------------
2       1/2 ( = 0.5 )            1 - 1/2 ( = 0.5)
3       1/3 ( = 0.3333)          1 - 1/3 ( = 0.6667)
4       1/4 ( = 0.25)            1 - 1/4 ( = 0.76)
5       1/5 ( = 0.2)             1 - 1/5 ( = 0.8)
10      1/10( = 0.1)             1 - 1/10( = 0.9)
20      1/20( = 0.05)            1 - 1/20( = 0.95)

However, if COLCARD of COL2 = 1, DB2 does not apply the formula of ( 1 - 1/(COLCARD) ) for "COL2 !"= <value>" predicate, because FF = 0 makes cost comparison meaningless.
In stead, DB2 chooses 1/CARD for the "COL2 !"= <value>" predicate as a positive small number less than 0.5, where CARD if the table cardinality, i.e. number of rows in the table.

-Plan change: COLCARD=1 and Filter Factor of (Q2.COL2 <> 0) = 1e-06
UPDATE SYSSTAT.TABLES
SET CARD=1000000,
...
WHERE TABNAME = 'B' AND TABSCHEMA = 'B       ';

UPDATE SYSSTAT.COLUMNS
SET COLCARD=1,
...
WHERE COLNAME = 'COL2' AND TABNAME = 'B' AND TABSCHEMA = 'B       ';

Predicates:
----------
5) Sargable Predicate,
Comparison Operator:            Not Equal (<>)
Subquery Input Required:        No
Filter Factor:                  1e-06

Predicate Text:
--------------
(Q2.COL2 <> 0)

This is an assumption based on available stats, however, changing FF from 0.5 to 0.000001 is fairly drastic and may cause plan change.

Here's a sample scenario to demonstrate that just updating a row caused the plan change from HSJOIN to NLJOIN which may be sub-optimal.
The query text is:
\$ cat s1.sql
select x.key from b.b x where x.col1 = 1 and x.col2 != 0 and x.key in ( select y.key from a.a y where y.col1 = 1 )
;

The initial plan with COLCARD=1 for COl2 is like
<HSJOIN><IXSCAN TABLE='B.B'/><IXSCAN TABLE='A.A'/></HSJOIN>

The plan after chainging to COLCARD=1 for COl2 by an UPDATE is like:
<NLJOIN><IXSCAN TABLE='B.B'/><IXSCAN TABLE='A.A'/></NLJOIN>

The command used for the test.

db2 update db cfg for sample using AUTO_SAMPLING off AUTO_STATS_VIEWS off AUTO_STMT_STATS off AUTO_RUNSTATS off AUTO_TBL_MAINT off AUTO_MAINT off
db2 terminate
db2stop force
db2start
db2 connect to sample
db2 -tvf exp1.sql
db2 "drop table a.a"
db2 "create table a.a ( key integer not null, col1 smallint not null) in userspace1"
db2 "create unique index a.ix0a on a.a (key)"
db2 "create        index a.ix1a on a.a (col1)"
db2 "drop table b.b"
db2 "create table b.b ( key integer not null, col1 smallint not null, col2 smallint not null) in userspace1"
db2 "create unique index b.ix0b on b.b (key)"
db2 "create        index b.ix1b on b.b (col1)"
db2 "load from expa.del of del insert into a.a"
db2 "load from expb.del of del insert into b.b"
db2 "runstats on table a.a and indexes all"
db2 "runstats on table b.b and indexes all"
db2look -d sample -e -z b -t b -m -o db2look.1.out
mkdir s1caemout
db2caem -d sample -sf s1.sql -o s1caemout

db2 "update b.b set col2=1 where key = 1"
db2 "runstats on table b.b and indexes all"
db2look -d sample -e -z b -t b -m -o db2look.2.out
mkdir s1caemout2
db2caem -d sample -sf s1.sql -o s1caemout2

The test data files are generated with following export commands:
\$ cat exp1.sql
export to expa.del of del
with w1 as (
select row_number() over( order by a.tabname ) number from syscat.columns a, syscat.columns b
fetch first 1000000 rows only
)
, w2 as (
select
number key
,mod(number,1000) col1
from w1
)
select key, col1 from w2 order by col1, key;
;
export to expb.del of del
with w1 as (
select row_number() over( order by a.tabname ) number from syscat.columns a, syscat.columns b
fetch first 1000000 rows only
)
, w2 as (
select
number key
,mod(number,100) col1
,case number
when 1 then 0
else        1
end              col2
from w1
)
select key, col1, col2 from w2 order by col1, key, col2
;

The access plan graph with Rows Actual for after initial runstats. DB2 estimated number of rows from FETCH(3) = 5000 where Actual number of rows = 10000

Access Plan:
-----------
Total Cost:             1583.12
Query Degree:           1

Rows
Rows Actual
RETURN
(   1)
Cost
I/O
|
5
999
^HSJOIN
(   2)
1583.12
NA
/---------+----------\
5000                     1000
10000                    1000
FETCH                    FETCH
(   3)                   (   5)
1385.76                  137.25
NA                       NA
/---+---\                /---+---\
10000       1e+06        1000        1e+06
10000        NA          1000         NA
IXSCAN  TABLE: B         IXSCAN  TABLE: A
(   4)         B         (   6)         A
780.87        Q2         86.361        Q1
NA                       NA
|                        |
1e+06                    1e+06
NA                       NA
INDEX: B                 INDEX: A
IX1B                     IX1A
Q2                       Q1

3) FETCH : (Fetch)

Filter Factor:                  0.5

Predicate Text:
--------------
(Q2.COL2 <> 0)

Output Streams:
--------------
4) To Operator #2

Estimated number of rows:       5000
Actual number of rows:          10000

- After running "update b.b set col2=1 where key = 1" which made COL2 all value 1 and running runstats. The plan was changed because the filter factor was changed from 0.5 to 0.000001.
The big difference between estimated number of rows and actual number of rows was in FETCH(3) due to the filter factor 0.000001 whereas all the rows satisfy (Q2.COL2 <> 0).

Access Plan:
-----------
Total Cost:             1402.79
Query Degree:           1

Rows
Rows Actual
RETURN
(   1)
Cost
I/O
|
1e-05
1000
^NLJOIN
(   2)
1402.79
NA
/---------+----------\
0.01                     0.001
10000                     0.1
FETCH                    FETCH
(   3)                   (   5)
1385.76                  23.9671
NA                       NA
/---+---\                /---+----\
10000       1e+06          1          1e+06
10000        NA            1           NA
IXSCAN  TABLE: B         IXSCAN   TABLE: A
(   4)         B         (   6)          A
780.87        Q2         16.5197        Q1
NA                       NA
|                        |
1e+06                    1e+06
NA                       NA
INDEX: B                 INDEX: A
IX1B                     IX0A
Q2                       Q1

3) FETCH : (Fetch)

Filter Factor:                  1e-06

Predicate Text:
--------------
(Q2.COL2 <> 0)

Output Streams:
--------------
4) To Operator #2

Estimated number of rows:       0.01
Actual number of rows:          10000

- The solution is to collect DISTRUBITION stats either by manually or by AUTO_RUNSTATS configuration parameter. Distributions stats avoids the one value problem.

Q1. Why the subquery "select x.key from b.b x where x.col1 = 1 and x.col2 != 0 and x.key in ( select y.key from a.a y where y.col1 = 1 )" results the joins.
A1. Because, DB2 rewrites the query to join as follows, then chooses optimal the plan. Optimized statement is the result of query rewrite.

Original Statement:
------------------
select
x.key
from
b.b x
where
x.col1 = 1 and
x.col2 != 0 and
x.key in
(select
y.key
from
a.a y
where
y.col1 = 1
)

Optimized Statement:
-------------------
SELECT
Q2.KEY AS "KEY"
FROM
"A ".A AS Q1,
"B ".B AS Q2
WHERE
(Q1.COL1 = 1) AND
(Q2.KEY = Q1.KEY) AND
(Q2.COL2 <> 0) AND
(Q2.COL1 = 1)

Q2. What are the characteristics of the two input data files ?
A2. data for table a.a: key column is a sequential number from 1 to 1000000, col1 is modulo 1000 of key, ordered by col1
data for table b.b: key column is a sequential number from 1 to 1000000, col1 is modulo 100 of key, col2 is 1 except key = 1, ordered by col1.
So, 999999 rows had col2=1, col2=0 only for key=1 initially.

Q3. Why index A.IX1A on COL1 was used for table A.A in HSJOIN and why index A.IX0A on KEY was used for table A.A in NLJOIN ?
A3. These indexes were chosen for cheaper cost for each join.

Q4. Is there a way to disable subquery to join rewrite answered in A1?
A4. Yes, SUBQ2JOIN OPTION in OPTGUIDELINES allows you to do.

Q5. Does DISTRIBUTION stats always avoid the problem ?
A5. Yes, it does, if it is used in optimization, but, queries which do NOT respect distribution stats will still hit the problem. They are:
- The predicate in question uses parameter markers or host variables, without reopt enabled. e.x. ( COL2 != ? )
- Statement concentrator enabled databases. i.e.  STMT_CONC = LITERALS
Note, these type of queries are optimized for query compilation time. This is a trade off, less query compilation time, less optimization.

[{"Business Unit":{"code":"BU029","label":"Data and AI"}, "Product":{"code":"SSEPGG","label":"DB2 for Linux- UNIX and Windows"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":""}]

ibm11140454