Scenario: Improving cardinality estimates using statistical views

In a data warehouse, fact table information often changes quite dynamically, whereas dimension table data is static. This means that dimension attribute data might be positively or negatively correlated with fact table attribute data.

Traditional base table statistics currently available to the optimizer do not allow it to discern relationships across tables. Column and table distribution statistics on statistical views (and MQTs) can be used to give the optimizer the necessary information to correct these types of cardinality estimation errors.

Consider the following query that computes annual sales revenue for golf clubs sold during July of each year:
   select  sum(f.sales_price), d2.year
     from  product d1, period d2, daily_sales f
     where  d1.prodkey = f.prodkey
       and  d2.perkey = f.perkey
       and  d1.item_desc = 'golf club'
       and  d2.month = 'JUL'
     group by  d2.year

A star join query execution plan can be an excellent choice for this query, provided that the optimizer can determine whether the semi-join involving PRODUCT and DAILY_SALES, or the semi-join involving PERIOD and DAILY_SALES, is the most selective. To generate an efficient star join plan, the optimizer must be able to choose the most selective semi-join for the outer leg of the index ANDing operation.

Data warehouses often contain records for products that are no longer on store shelves. This can cause the distribution of PRODUCT columns after the join to appear dramatically different than their distribution before the join. Because the optimizer, for lack of better information, will determine the selectivity of local predicates based solely on base table statistics, the optimizer might become overly optimistic regarding the selectivity of the predicate item_desc = 'golf club'

For example, if golf clubs historically represent 1% of the products manufactured, but now account for 20% of sales, the optimizer would likely overestimate the selectivity of item_desc = 'golf club', because there are no statistics describing the distribution of item_desc after the join. And if sales in all twelve months are equally likely, the selectivity of the predicate month = 'JUL' would be around 8%, and thus the error in estimating the selectivity of the predicate item_desc = 'golf club' would mistakenly cause the optimizer to perform the seemingly more selective semi-join between PRODUCT and DAILY_SALES as the outer leg of the star join plan's index ANDing operation.

The following example provides a step-by-step illustration of how to set up statistical views to solve this type of problem.

Consider a database from a typical data warehouse, where STORE, CUSTOMER, PRODUCT, PROMOTION, and PERIOD are the dimension tables, and DAILY_SALES is the fact table. The following tables provide the definitions for these tables.
Table 1. STORE (63 rows)
Column storekey store_number city state district ...
Attribute
integer
not null
primary key
char(2) char(20) char(5) char(14) ...
Table 2. CUSTOMER (1 000 000 rows)
Column custkey name address age gender ...
Attribute
integer
not null
primary key
char(30) char(40) smallint char(1) ...
Table 3. PRODUCT (19 450 rows)
Column prodkey category item_desc price cost ...
Attribute
integer
not null
primary key
integer char(30) decimal(11) decimal(11) ...
Table 4. PROMOTION (35 rows)
Column promokey promotype promodesc promovalue ...
Attribute
integer
not null
primary key
integer char(30) decimal(5) ...
Table 5. PERIOD (2922 rows)
Column perkey calendar_date month period year ...
Attribute
integer
not null
primary key
date char(3) smallint smallint ...
Table 6. DAILY_SALES (754 069 426 rows)
Column storekey custkey prodkey promokey perkey sales_price ...
Attribute integer integer integer integer integer decimal(11) ...

Suppose the company managers want to determine whether or not consumers will buy a product again if they are offered a discount on a return visit. Moreover, suppose this study is done only for store '01', which has 18 locations nationwide. Table 7 shows information about the different categories of promotion that are available.

Table 7. PROMOTION (35 rows)
promotype promodesc COUNT (promotype) percentage of total
1 Return customers 1 2.86%
2 Coupon 15 42.86%
3 Advertisement 5 14.29%
4 Manager's special 3 8.57%
5 Overstocked items 4 11.43%
6 End aisle display 7 20.00%
The table indicates that discounts for return customers represents only 2.86% of the 35 kinds of promotions that were offered.
The following query returns a count of 12 889 514:
   select  count(*)
     from  store d1, promotion d2, daily_sales f
     where  d1.storekey = f.storekey
       and  d2.promokey = f.promokey
       and  d1.store_number = '01'
       and  d2.promotype = 1
This query executes according to the following plan that is generated by the optimizer. In each node of this diagram, the first row is the cardinality estimate, the second row is the operator type, and the third row (the number in parentheses) is the operator ID.
                                          6.15567e+06
                                             IXAND
                                             (   8)
                             /------------------+------------------\
                      2.15448e+07                                2.15448e+08
                        NLJOIN                                     NLJOIN
                        (   9)                                     (  13)
                 /---------+--------\                       /---------+--------\
               1                  2.15448e+07            18                1.19694e+07
            FETCH                   IXSCAN             FETCH                   IXSCAN
            (  10)                  (  12)             (  14)                  (  16)
           /---+---\                  |               /---+---\                  |
         35         35           7.54069e+08        18         63          7.54069e+08
      IXSCAN  TABLE: DB2DBA  INDEX: DB2DBA       IXSCAN   TABLE: DB2DBA       INDEX: DB2DBA
      (  11)       PROMOTION     PROMO_FK_IDX     (  15)         STORE             STORE_FK_IDX
        |                                          |
        35                                         63
  INDEX: DB2DBA                                INDEX: DB2DBA
 PROMOTION_PK_IDX                              STOREX1
At the nested loop join (number 9), the optimizer estimates that around 2.86% of the product sold resulted from customers coming back to buy the same products at a discounted price (2.15448e+07 ÷ 7.54069e+08 ≈ 0.0286). Note that this is the same value before and after joining the PROMOTION table with the DAILY_SALES table. Table 8 summarizes the cardinality estimates and their percentage (the filtering effect) before and after the join.
Table 8. Cardinality estimates before and after joining with DAILY_SALES.
  Before Join After Join
Predicate count percentage of rows qualified count percentage of rows qualified
store_number = '01' 18 28.57% 2.15448e+08 28.57%
promotype = 1 1 2.86% 2.15448e+07 2.86%

Because the probability of promotype = 1 is less than that of store_number = '01', the optimizer chooses the semi-join between PROMOTION and DAILY_SALES as the outer leg of the star join plan's index ANDing operation. This leads to an estimated count of approximately 6 155 670 products sold using promotion type 1 - an incorrect cardinality estimate that is off by a factor of 2.09 (12 889 514 ÷ 6 155 670 ≈ 2.09).

What causes the optimizer to only estimate half of the actual number of records satisfying the two predicates? Store '01' represents about 28.57% of all the stores. What if other stores had more sales than store '01' (less than 28.57%)? Or what if store '01' actually sold most of the product (more than 28.57%)? Likewise, the 2.86% of products sold using promotion type 1 shown in Table 8 can be misleading. The actual percentage in DAILY_SALES could very well be a different figure than the projected one.
You can use statistical views to help the optimizer correct its estimates. First, you need to create two statistical views representing each semi-join in the previous query. The first statistical view provides the distribution of stores for all daily sales. The second statistical view represents the distribution of promotion types for all daily sales. Note that each statistical view can provide the distribution information for any particular store number or promotion type. In this example, you use a 10% sample rate to retrieve the records in DAILY_SALES for the respective views and save them in global temporary tables. You can then query those tables to collect the necessary statistics to update the two statistical views.
  1. Create a view representing the join of STORE with DAILY_SALES.
       create view sv_store_dailysales as
         (select s.*
           from store s, daily_sales ds
           where s.storekey = ds.storekey)
  2. Create a view representing the join of PROMOTION with DAILY_SALES.
       create view sv_promotion_dailysales as
         (select p.*
           from promotion.p, daily_sales ds
           where p.promokey = ds.promokey)
  3. Make the views statistical views by enabling them for query optimization:
       alter view sv_store_dailysales enable query optimization
       alter view sv_promotion_dailysales enable query optimization
  4. Execute the RUNSTATS command to collect statistics on the views:
       runstats on table db2dba.sv_store_dailysales with distribution
       runstats on table db2dba.sv_promotion_dailysales with distribution
  5. Run the query again so that it can be re-optimized. Upon reoptimization, the optimizer will match SV_STORE_DAILYSALES and SV_PROMOTION_DAILYSALES with the query, and will use the view statistics to adjust the cardinality estimate of the semi-joins between the fact and dimension tables, causing a reversal of the original order of the semi-joins chosen without these statistics. The new plan is as follows:
                                              1.04627e+07
                                                IXAND
                                                (   8)
                                /------------------+------------------\
                         6.99152e+07                                1.12845e+08
                           NLJOIN                                     NLJOIN
                           (   9)                                     (  13)
                    /---------+--------\                       /---------+--------\
                 18                 3.88418e+06             1                  1.12845e+08
               FETCH                   IXSCAN             FETCH                   IXSCAN
               (  10)                  (  12)             (  14)                  (  16)
              /---+---\                  |               /---+---\                  |
           18           63           7.54069e+08      35           35          7.54069e+08
         IXSCAN   TABLE:DB2DBA     INDEX: DB2DBA   IXSCAN    TABLE: DB2DBA    INDEX: DB2DBA DB2DBA
        (  11)      STORE            STORE_FK_IDX    (  15)       PROMOTION        PROMO_FK_IDX
           |                                          |
           63                                         35
       INDEX: DB2DBA                              INDEX: DB2DBA
       STOREX1                                    PROMOTION_PK_IDX
Table 9 summarizes the cardinality estimates and their percentage (the filtering effect) before and after the join for each semi-join.
Table 9. Cardinality estimates before and after joining with DAILY_SALES.
  Before Join After Join (no statistical views) After Join (with statistical views)
Predicate count percentage of rows qualified count percentage of rows qualified count percentage of rows qualified
store_number = '01' 18 28.57% 2.15448e+08 28.57% 6.99152e+07 9.27%
promotype = 1 1 2.86% 2.15448e+07 2.86% 1.12845e+08 14.96%
Note that this time, the semi-join between STORE and DAILY_SALES is performed on the outer leg of the index ANDing plan. This is because the two statistical views essentially tell the optimizer that the predicate store_number = '01' will filter more rows than promotype = 1. This time, the optimizer estimates that there are approximately 10 462 700 products sold. This estimate is off by a factor of 1.23 (12 889 514 ÷ 10 462 700 ≈ 1.23), which is a significant improvement over the estimate without statistical views (in Table 8).