Scenario: Improving cardinality estimates using statistical views
In a data warehouse, fact table information often changes quite dynamically, whereas dimension table data is static. This means that dimension attribute data might be positively or negatively correlated with fact table attribute data.
Traditional base table statistics currently available to the optimizer do not allow it to discern relationships across tables. Column and table distribution statistics on statistical views (and MQTs) can be used to give the optimizer the necessary information to correct these types of cardinality estimation errors.
select sum(f.sales_price), d2.year
from product d1, period d2, daily_sales f
where d1.prodkey = f.prodkey
and d2.perkey = f.perkey
and d1.item_desc = 'golf club'
and d2.month = 'JUL'
group by d2.year
A star join query execution plan can be an excellent choice for this query, provided that the optimizer can determine whether the semi-join involving PRODUCT and DAILY_SALES, or the semi-join involving PERIOD and DAILY_SALES, is the most selective. To generate an efficient star join plan, the optimizer must be able to choose the most selective semi-join for the outer leg of the index ANDing operation.
Data warehouses often contain records for products that are no
longer on store shelves. This can cause the distribution of PRODUCT
columns after the join to appear dramatically different than their
distribution before the join. Because the optimizer, for lack of better
information, will determine the selectivity of local predicates based
solely on base table statistics, the optimizer might become overly
optimistic regarding the selectivity of the predicate item_desc
= 'golf club'
For example, if golf clubs historically represent 1% of the products
manufactured, but now account for 20% of sales, the optimizer would
likely overestimate the selectivity of item_desc = 'golf club'
,
because there are no statistics describing the distribution of item_desc after
the join. And if sales in all twelve months are equally likely, the
selectivity of the predicate month = 'JUL'
would
be around 8%, and thus the error in estimating the selectivity of
the predicate item_desc = 'golf club'
would mistakenly
cause the optimizer to perform the seemingly more selective semi-join
between PRODUCT and DAILY_SALES as the outer leg of the star join
plan's index ANDing operation.
The following example provides a step-by-step illustration of how to set up statistical views to solve this type of problem.
Column | storekey | store_number | city | state | district | ... |
---|---|---|---|---|---|---|
Attribute | integer
not null primary key |
char(2) | char(20) | char(5) | char(14) | ... |
Column | custkey | name | address | age | gender | ... |
---|---|---|---|---|---|---|
Attribute | integer
not null primary key |
char(30) | char(40) | smallint | char(1) | ... |
Column | prodkey | category | item_desc | price | cost | ... |
---|---|---|---|---|---|---|
Attribute | integer
not null primary key |
integer | char(30) | decimal(11) | decimal(11) | ... |
Column | promokey | promotype | promodesc | promovalue | ... |
---|---|---|---|---|---|
Attribute | integer
not null primary key |
integer | char(30) | decimal(5) | ... |
Column | perkey | calendar_date | month | period | year | ... |
---|---|---|---|---|---|---|
Attribute | integer
not null primary key |
date | char(3) | smallint | smallint | ... |
Column | storekey | custkey | prodkey | promokey | perkey | sales_price | ... |
---|---|---|---|---|---|---|---|
Attribute | integer | integer | integer | integer | integer | decimal(11) | ... |
Suppose the company managers want to determine whether or not consumers will buy a product again if they are offered a discount on a return visit. Moreover, suppose this study is done only for store '01', which has 18 locations nationwide. Table 7 shows information about the different categories of promotion that are available.
promotype | promodesc | COUNT (promotype) | percentage of total |
---|---|---|---|
1 | Return customers | 1 | 2.86% |
2 | Coupon | 15 | 42.86% |
3 | Advertisement | 5 | 14.29% |
4 | Manager's special | 3 | 8.57% |
5 | Overstocked items | 4 | 11.43% |
6 | End aisle display | 7 | 20.00% |
select count(*)
from store d1, promotion d2, daily_sales f
where d1.storekey = f.storekey
and d2.promokey = f.promokey
and d1.store_number = '01'
and d2.promotype = 1
6.15567e+06
IXAND
( 8)
/------------------+------------------\
2.15448e+07 2.15448e+08
NLJOIN NLJOIN
( 9) ( 13)
/---------+--------\ /---------+--------\
1 2.15448e+07 18 1.19694e+07
FETCH IXSCAN FETCH IXSCAN
( 10) ( 12) ( 14) ( 16)
/---+---\ | /---+---\ |
35 35 7.54069e+08 18 63 7.54069e+08
IXSCAN TABLE: DB2DBA INDEX: DB2DBA IXSCAN TABLE: DB2DBA INDEX: DB2DBA
( 11) PROMOTION PROMO_FK_IDX ( 15) STORE STORE_FK_IDX
| |
35 63
INDEX: DB2DBA INDEX: DB2DBA
PROMOTION_PK_IDX STOREX1
Before Join | After Join | |||
---|---|---|---|---|
Predicate | count | percentage of rows qualified | count | percentage of rows qualified |
store_number = '01' | 18 | 28.57% | 2.15448e+08 | 28.57% |
promotype = 1 | 1 | 2.86% | 2.15448e+07 | 2.86% |
Because the probability of promotype = 1
is less
than that of store_number = '01'
, the optimizer chooses
the semi-join between PROMOTION and DAILY_SALES as the outer leg of
the star join plan's index ANDing operation. This leads to an estimated
count of approximately 6 155 670 products sold using promotion type
1 - an incorrect cardinality estimate that is off by a factor of 2.09
(12 889 514 ÷ 6 155 670 ≈ 2.09).
- Create a view representing the join of STORE with DAILY_SALES.
create view sv_store_dailysales as (select s.* from store s, daily_sales ds where s.storekey = ds.storekey)
- Create a view representing the join of PROMOTION with DAILY_SALES.
create view sv_promotion_dailysales as (select p.* from promotion.p, daily_sales ds where p.promokey = ds.promokey)
- Make the views statistical views by enabling them for query optimization:
alter view sv_store_dailysales enable query optimization alter view sv_promotion_dailysales enable query optimization
- Execute the RUNSTATS command to collect statistics
on the views:
runstats on table db2dba.sv_store_dailysales with distribution runstats on table db2dba.sv_promotion_dailysales with distribution
- Run the query again so that it can be re-optimized. Upon reoptimization,
the optimizer will match SV_STORE_DAILYSALES and SV_PROMOTION_DAILYSALES
with the query, and will use the view statistics to adjust the cardinality
estimate of the semi-joins between the fact and dimension tables,
causing a reversal of the original order of the semi-joins chosen
without these statistics. The new plan is as follows:
1.04627e+07 IXAND ( 8) /------------------+------------------\ 6.99152e+07 1.12845e+08 NLJOIN NLJOIN ( 9) ( 13) /---------+--------\ /---------+--------\ 18 3.88418e+06 1 1.12845e+08 FETCH IXSCAN FETCH IXSCAN ( 10) ( 12) ( 14) ( 16) /---+---\ | /---+---\ | 18 63 7.54069e+08 35 35 7.54069e+08 IXSCAN TABLE:DB2DBA INDEX: DB2DBA IXSCAN TABLE: DB2DBA INDEX: DB2DBA DB2DBA ( 11) STORE STORE_FK_IDX ( 15) PROMOTION PROMO_FK_IDX | | 63 35 INDEX: DB2DBA INDEX: DB2DBA STOREX1 PROMOTION_PK_IDX
Before Join | After Join (no statistical views) | After Join (with statistical views) | ||||
---|---|---|---|---|---|---|
Predicate | count | percentage of rows qualified | count | percentage of rows qualified | count | percentage of rows qualified |
store_number = '01' | 18 | 28.57% | 2.15448e+08 | 28.57% | 6.99152e+07 | 9.27% |
promotype = 1 | 1 | 2.86% | 2.15448e+07 | 2.86% | 1.12845e+08 | 14.96% |
store_number = '01'
will filter more rows
than promotype = 1
. This time, the optimizer estimates
that there are approximately 10 462 700 products sold. This estimate
is off by a factor of 1.23 (12 889 514 ÷ 10 462 700 ≈ 1.23), which
is a significant improvement over the estimate without statistical
views (in Table 8).