# Get the most out of DB2 optimizer: Leveraging statistical views to improve query execution performance

The DB2® cost-based optimizer (referred to as optimizer) estimates the execution costs of each access plan and selects an optimal access plan based on its estimates. The cardinality, or number of rows, estimated to be processed at each operator in an access plan is a significant factor in estimating the cost. The accuracy of the cardinality estimates is dependent on the statistics collected. Typically, users focus on collecting statistics on individual tables, including basic single-column statistics, distribution statistics on individual columns, and multi-column statistics covering sets of two or more columns. These statistics can handle most requirements, but they do not capture complex relationships such as those across joins. Statistical views provide the statistics required to capture complex relationships.

This article provides examples to show how to identify these complex relationships and how to capture them using statistical views. Additional information (listed in Related topics) includes links to articles and documentation on the statistics that can be collected, including statistical views. This article assumes that you have a basic understanding of statistical views.

## Schema layout

In this article, a retail product supplier business model is used, which is described by the TPC decision support benchmark.

##### Figure 1. Schema

The retail business model consists of four subject areas—catalog, web, store, and inventory— and all except inventory are comprised of two fact tables. The majority of the example scenarios discussed in this article are based on the web subject area, with a focus on the star schema that includes the WEB_SALES fact table.

##### Figure 2. WEB_SALES star schema

## Statistical views capture skew in a join

The optimizer computes the cardinality estimate of equality joins assuming the data is uniformly distributed and the domain of values from one side of the join exist in the other. Using these assumptions, the optimizer computes the filter factor of the equality join predicate as 1/(max (colcard(Table1.Column1), colcard(Table2.Column2)), where colcard(Table.Column) represents the number of distinct values in the column. If the data in one or both sides of the join is not uniformly distributed, this uniformity assumption can introduce an error in the cardinality estimate. A statistical view on the join can capture the skew in the join, resulting in a more accurate cardinality estimate.

In this example, you can use the query in Listing 1. It is a simple query with a single join between the two fact tables STORE_SALES and STORE_RETURNS.

##### Listing 1. Query with single join

select * from tpcds.store_sales s, tpcds.store_returns r where r.sr_item_sk=s.ss_item_sk ;

##### Listing 2. Access plan for join query from Listing 1

Rows RETURN ( 1) Cost I/O | 4.62396e+07 DTQ ( 2) 1.07504e+07 11975 | 2.31198e+07 HSJOIN ( 3) 261742 11975 /-------+-------\ 144857 1.45096e+06 TBSCAN TBSCAN ( 4) ( 5) 13146.3 142221 972 11003 | | 144857 1.45096e+06 DP-TABLE: TPCDS DP-TABLE: TPCDS STORE_RETURNS STORE_SALES Q1 Q2

The query in Listing 1 returns 58 million rows, but the access plan in Listing 2 shows the optimizer estimates 46 million rows (23 million rows per database partition), under-estimating the cardinality of the join by 12 million rows. The query only consists of a single join predicate, so the error in the cardinality results from the non-uniform distribution in the join columns. You can create the statistical view shown in Listing 3 to improve the estimate.

##### Listing 3. Statistical view to handle skew

create view tpcds.skew1 (c1) as ( select 1 from tpcds.store_sales s, tpcds.store_returns r where r.sr_item_sk=s.ss_item_sk ); alter view tpcds.skew1 enable query optimization; runstats on view tpcds.skew1 tablesample bernoulli(10);

The view definition in Listing 3 does not include any columns in the select list. Typically, the view definition will include columns to provide the distribution of the columns after the join, allowing the optimizer to capture other relationships with the join. You will see examples of this later in this article. In this scenario only the result of the join is captured, so the cardinality statistic of the view is the only information required. Not including columns in the select list of the view helps to reduce the time to collect statistics on the view.

When collecting statistics, you will not be able to apply system-level sampling because the join is not loss-less. As a result, the full join will be materialized, but the use of bernoulli sampling will reduce the amount of work RUNSTATS needs to perform on the result set. If the result is large for the join predicate, statistics collection can be expensive, even with bernoulli sampling, especially if the join is expanding. To improve the time to collect statistics, you can update the statistics manually and sampling the larger of the two tables involved in the join, as shown in Listing 4.

##### Listing 4. Update statistics using sampling

update sysstat.tables set card = 10 * (select count_big(*) from tpcds.store_returns, tpcds.store_sales s tablesample system(10) where r.sr_item_sk=s.ss_item_sk) where tabschema='TPCDS' and tabname='SKEW1';

The update statement in Listing 4 applies system-level sampling on the store_sales table, which is the larger of the two, using a sampling rate of 10%, so you must extrapolate by a factor of 10 when updating the cardinality statistic. Because the join is not loss-less, system level sampling can introduce a large error, but experience has shown that for joins with significant skew in the distribution, the error is not a significant factor.

For joins where both tables are large, or the result is expanding, sampling only one side of the join might still be expensive to collect the cardinality statistic. Under these conditions, you can consider system-level sampling on each table. The result of the join with both sides sampled will contain at most 1/sampling_rate(store_returns) * 1/sampling_rate(store_returns), so the update statement must extrapolate the result using the inverse of this result. If the sampling rate is 10 for both tables, you must extrapolate by a factor of 100 (computed as1/ (1/sampling_rate(store_returns) * 1/sampling_rate(store_returns)) = 1/(1/10 * 1/10) = 100. You must be very careful when sampling both sides of the join because it is highly likely that the result will produce no rows for low sampling rates.

Listing 5 shows the access plan after updating the statistics of the view. With this extra statistic on the join available, the optimizer is able to calculate a better estimate

##### Listing 5. Access plan after view statistics are updated

Rows RETURN ( 1) Cost I/O | 5.82537e+07 DTQ ( 2) 1.34868e+07 11975 | 2.91269e+07 HSJOIN ( 3) 273042 11975 /-------+-------\ 144857 1.45096e+06 TBSCAN TBSCAN ( 4) ( 5) 13146.3 142221 972 11003 | | 144857 1.45096e+06 DP-TABLE: TPCDS DP-TABLE: TPCDS STORE_RETURNS STORE_SALES Q1 Q2

If you see that the error in the cardinality estimate is still too large, you can experiment by increasing the sampling rate.

The access plan in Listing 5 shows that the resulting cardinality improved and is very accurate, but the access plan did not change. This is a simple example to illustrate the concepts. The optimizer can tolerate some error in the cardinality estimates, and the above scenario is such an example.

In the next example, you can use the query in Listing 6 involving another single
join, this time between the fact tables WEB_RETURNS and WEB_SALES, where the
non-uniform distribution of the join columns can impact the optimizer's access plan
decision. In this join, the tables are related on two columns, as well as an extra
range predicate in the `WHERE`

clause applied on WEB_RETURNS.

##### Listing 6. Query with singe join, multiple predicates

select * from tpcds.web_returns r, tpcds.web_sales s where r.wr_web_page_sk=s.ws_web_page_sk and s.ws_item_sk=r.wr_item_sk and r.wr_return_amt > 10000;

Two indexes of interest are defined on the WEB_SALES table:

- SALES_PAGE_ITEM, on columns (WS_WEB_PAGE_SK, WS_ITEM_SK), covers both join predicates
- Primary key index on (WS_ITEM_SK, WS_ORDER_NUMBER), which covers one of the join columns

As you did for the query in Listing 1, you can create a statistical view on the join between WEB_SALES and WEB_RETURNS, as shown in Listing 7.

##### Listing 7. Statistical view to handle skew covering two join predicates

create view tpcds.skew2 (c1) as ( select r.* from tpcds.web_returns r, tpcds.web_sales s where r.wr_web_page_sk=s.ws_web_page_sk and s.ws_item_sk=r.wr_item_sk);

The view definition in Listing 7 includes the columns from web_returns in the SELECT list. This provides the distribution of the columns after the join and will allow the optimizer to capture the relationship between the range predicate on WR_RETURN_AMT and the join.

The access plan for the query in Listing 6, after creating the statistical view, is shown in Listing 8.

##### Listing 8. Access plan for join query from Listing 6

Rows RETURN ( 1) Cost I/O | 5533.93 DTQ ( 2) 29322.2 6447.38 | 2766.96 NLJOIN ( 3) 27991.4 6447.38 /-------+--------\ 1297.64 2.1323 TBSCAN FETCH ( 4) ( 5) 1340.55 69.4394 297 34.2444 | /----+-----\ 36212 39.8974 362707 DP-TABLE: TPCDS RIDSCN DP-TABLE: TPCDS WEB_RETURNS ( 6) WEB_SALES Q2 17.4891 Q1 1 | 39.8974 SORT ( 7) 17.4376 1 | 39.8974 IXSCAN ( 8) 16.6369 1 | 362707 INDEX: SYSIBM SQL130403122909700 Q1

The access plan shows a nested loop join (NLJOIN) operation used to perform the join, with WEB_SALES on the inner of the join, using the primary key index to access the data from the WEB_SALES table. Because the primary key index only includes one of the columns referenced by the join predicates, the index scan IXSCAN(8) only applies the join predicate s.WS_ITEM_SK=r.WR_ITEM_SK. The other join predicate is applied in FETCH(5).

The query performs better if index SALES_PAGE_ITEM is selected by the optimizer as it covers both columns in both join predicates. In this example, we do account for the non-uniform distribution in the join columns, for both join predicates combined, but each individual join predicate can also reference columns with non-uniformly distributed data. As a result, the cardinality estimate at IXSCAN(8) might be too low because the optimizer will compute the filter factor of the join predicate using the uniform distribution assumption described above. Furthermore, because a statistical view is created that covers both join predicates, the cardinality for the index access of the SALES_PAGE_ITEM index will be accurately computed and this partial knowledge can lead the optimizer to bias toward an index access with a cost estimated using less knowledge.

You can create an additional statistical view to fill in the gap using the definition in Listing 9.

##### Listing 9. Statistical view to handle skew for one of the join predicates

create view tpcds.skew3 (c1) as ( select r.* from tpcds.web_returns r, tpcds.web_sales s where s.WS_ITEM_SK=r.WR_ITEM_SK);

Listing 10 shows the access plan after collecting the statistics on the view. With the additional statistical view collected to fill in the missing statistics, the optimizer chooses the expected, better performing access plan.

##### Listing 10. Access plan after creating additional statistical view

Rows RETURN ( 1) Cost I/O | 5533.93 DTQ ( 2) 50711.7 4975 | 2766.96 NLJOIN ( 3) 49050.3 4975 /-------+--------\ 1297.64 2.1323 TBSCAN FETCH ( 4) ( 5) 1340.55 79.1295 297 9.57933 | /----+-----\ 36212 2.1323 362707 DP-TABLE: TPCDS IXSCAN DP-TABLE: TPCDS WEB_RETURNS ( 6) WEB_SALES Q2 66.9975 Q1 8 | 362707 DP-INDEX: TPCDS SALES_PAGE_ITEM Q1

Comparing the access plans in Listings 8 and 10, you will notice that the output cardinality at the FETCH operator on the inner of the NLJOIN, and the NLJOIN operator as well, is the same. The statistical view, SKEW2, provides the appropriate statistic to compute the cardinality estimate for the complete join, so this will remain the same after creating the statistical view SKEW3. You can use the optimization guideline described in Listing 11, applied using an optimization profile, to force the optimizer to choose the primary key index access to better understand the difference after the SKEW3 statistical view is created.

##### Listing 11. Optimization guideline and resulting access plan

<OPTGUIDELINES> <IXSCAN TABID='S' INDEX='SQL130403122909700'/> </OPTGUIDELINES> Rows RETURN ( 1) Cost I/O | 5533.93 DTQ ( 2) 62374.8 4808 | 2766.97 NLJOIN ( 3) 60713.4 4808 /-------+--------\ 1297.64 2.1323 TBSCAN FETCH ( 4) ( 5) 1340.55 261.726 297 34.2004 | /----+-----\ 36212 52.334 362707 DP-TABLE: TPCDS IXSCAN DP-TABLE: TPCDS WEB_RETURNS ( 6) WEB_SALES Q2 11.9133 Q1 1 | 362707 INDEX: SYSIBM SQL130403122909700 Q1

Comparing the IXSCAN(6) in Listing 11 to IXSCAN(8) in Listing 8, you can confirm that the output cardinality of the primary key index access increased. This results in a larger cost estimate for the inner input to the NLJOIN, as well as the NLJOIN itself, leading to the optimizer favouring the better performing access plan in Listing 10.

## Improving cardinality for a query with multiple query blocks

In this example, you will look at a more complex scenario (shown in Listing 12) involving multiple query blocks.

##### Listing 12. Scenario with multiple query blocks

SELECT WS.WS_NET_PROFIT, WS.WS_EXT_SALES_PRICE, WS.WS_ORDER_NUMBER, WS.WS_ITEM_SK FROM (SELECT WS.WS_NET_PROFIT, WS.WS_EXT_SALES_PRICE, WS.WS_ORDER_NUMBER, WS.WS_ITEM_SK, WS.WS_SHIP_ADDR_SK, W.WEB_SITE_ID FROM TPCDS.WEB_SALES AS WS, TPCDS.PROMOTION AS P, TPCDS.DATE_DIM AS D, TPCDS.WEB_SITE AS W, TPCDS.ITEM AS I WHERE 'N' = P.P_CHANNEL_TV AND P.P_PROMO_SK = WS.WS_PROMO_SK AND D.D_DATE_SK = WS.WS_SOLD_DATE_SK AND W.WEB_SITE_SK = WS.WS_WEB_SITE_SK AND I.I_ITEM_SK = WS.WS_ITEM_SK AND I.I_CURRENT_PRICE >= 1 AND D.D_DATE BETWEEN '12/01/1998' and '12/31/1998' FETCH FIRST 5000 ROWS ONLY ) AS WS, TPCDS.WEB_RETURNS WR, TPCDS.CUSTOMER_ADDRESS CA WHERE WS.WS_SHIP_ADDR_SK = CA.CA_ADDRESS_SK AND CA.CA_STATE in ('NM') AND CA.CA_COUNTRY = 'United States' AND WR.wr_returning_addr_sk = CA.CA_ADDRESS_SK AND WS.WS_ITEM_SK = WR.WR_ITEM_SK

##### Listing 13. Access plan for query from Listing 12

Rows RETURN ( 1) Cost I/O | 0.000340365 DTQ ( 2) 3019.24 912.709 | 0.000170183 NLJOIN ( 3) 3016.66 912.709 /--------+--------\ 1.56409 0.000108806 DTQ FETCH ( 4) ( 22) 2972.54 28.2322 906.964 3.67285 | ...-+-----\ 3.12818 36212 ^NLJOIN DP-TABLE: TPCDS ( 5) WEB_RETURNS 2969.54 Q2 906.964 /--------+--------\ 235.248 0.0132974 TBSCAN FETCH ( 6) ( 20) 1807.02 17.0631 360.796 2 | /----+----\ 235.248 1 50000 SORT IXSCAN TABLE: TPCDS ( 7) ( 21) CUSTOMER_ADDRESS 1804.61 9.53623 Q1 360.796 1 | | 235.248 50000 DTQ INDEX: SYSIBM ( 8) SQL130423185609760 1789.43 Q1 360.796 | 117.624 ^HSJOIN ( 9) 1773.45 360.796 /-----------+------------\ 117.624 30 HSJOIN^ IXSCAN ( 10) ( 19) 1768.53 3.66352 360.796 0 /-------------+-------------\ | 7736.27 138.222 30 TBSCAN ^HSJOIN INDEX: TPCDS ( 11) ( 12) R_WSITE_PK 1159.05 552.793 Q4 296 64.7961 | /----------+-----------\ 9091 141.042 294 TABLE: TPCDS NLJOIN TBSCAN ITEM ( 13) ( 18) Q3 491.967 57.1966 60.7961 4 /----------+-----------\ | 28.4059 4.96526 300 FETCH FETCH TABLE: TPCDS ( 14) ( 16) R_PROMOTION 20.1213 16.6765 Q6 2 2.06986 /---+----\ /----+-----\ 28.4059 73049 4.96526 362707 IXSCAN TABLE: TPCDS IXSCAN DP-TABLE: TPCDS ( 15) R_DATE_DIM ( 17) WEB_SALES 11.6459 Q5 8.02307 Q7 1 1 | | 73049 362707 INDEX: TPCDS DP-INDEX: TPCDS R_D_DATE WS_SOLDDATE Q5 Q7

The access plan in Listing 13 shows the cardinality drops to a value below 1 at the NLJOIN(3) operator. A cardinality estimate below 1 that is not in a correlation path, such as the inner input to a NLJOIN operator, is a good indicator of an error in the cardinality estimate, and a potentially sub-optimal access plan. You can consider the join between the query block WS and the table WEB_RETURNS as the starting point for defining a statistical view, but WS represents the result from a separate query block. Statistical views are not supported on joins across separate query blocks.

Because you cannot take advantage of a statistical view for the join, you can take a look at the inputs to the join to determine whether the cardinality can be improved. If so, this improvement in cardinality will usually propagate up the access plan. The root query involves a join between WEB_RETURNS, CUSTOMER_ADDRESS, and the query block WS. The join order begins with WS at TBSCAN(6), so you can start improving the cardinality by creating a statistical view that covers the joins in WS, as shown in Listing 14.

##### Listing 14. Statistical view definition for query block WS

CREATE VIEW TPCDS.DERIVEDSV AS SELECT P_CHANNEL_TV, I_CURRENT_PRICE, D_DATE FROM TPCDS.WEB_SALES AS WS, TPCDS.PROMOTION AS P, TPCDS.DATE_DIM AS D, TPCDS.WEB_SITE AS W, TPCDS.ITEM AS I WHERE P.P_PROMO_SK = WS.WS_PROMO_SK AND D.D_DATE_SK = WS.WS_SOLD_DATE_SK AND W.WEB_SITE_SK = WS.WS_WEB_SITE_SK AND I.I_ITEM_SK = WS.WS_ITEM_SK; ALTER VIEW TPCDS.DERIVEDSV ENABLE QUERY OPTIMIZATION; RUNSTATS ON TABLE TPCDS.DERIVEDSV WITH DISTRIBUTION;

Listing 15 shows the access plan chosen after the statistical view is created.

##### Listing 15. Access plan for query from Listing 12

Rows RETURN ( 1) Cost I/O | 0.00723418 DTQ ( 2) 3929.33 659.796 | 0.00361709 ^NLJOIN ( 3) 3926.75 659.796 /------------+-------------\ 0.272015 0.0132974 HSJOIN FETCH ( 4) ( 19) 3909.71 17.0436 657.796 2 /-------+--------\ /----+-----\ 36212 5000 1 50000 TBSCAN BTQ IXSCAN TABLE: TPCDS ( 5) ( 6) ( 20) R_CUSTOMER_ADDRESS 1236.09 2204.35 9.51674 Q1 297 360.796 1 | | | 36212 5000 50000 DP-TABLE: TPCDS ^NLJOIN INDEX: TPCDS WEB_RETURNS ( 7) R_CA_PK Q2 1795.75 Q1 360.796 /----+-----\ 235.248 100.306 BTQ IXSCAN ( 8) ( 18) 1785.52 8.67816 360.796 0 | | 117.624 30 HSJOIN^ INDEX: SYSIBM ( 9) SQL130423185656050 1768.53 Q4 360.796 /-------------+-------------\ 7736.27 138.222 TBSCAN ^HSJOIN ( 10) ( 11) 1159.05 552.793 296 64.7961 | /----------+-----------\ 9091 141.042 294 TABLE: TPCDS NLJOIN TBSCAN ITEM ( 12) ( 17) Q3 491.967 57.1966 60.7961 4 /----------+-----------\ | 28.4059 4.96526 300 FETCH FETCH TABLE: TPCDS ( 13) ( 15) R_PROMOTION 20.1213 16.6765 Q6 2 2.06986 /---+----\ /----+-----\ 28.4059 73049 4.96526 362707 IXSCAN TABLE: TPCDS IXSCAN DP-TABLE: TPCDS ( 14) R_DATE_DIM ( 16) WEB_SALES 11.6459 Q5 8.02307 Q7 1 1 | | 73049 362707 INDEX: TPCDS DP-INDEX: TPCDS R_D_DATE WS_SOLDDATE Q5 Q7

Comparing the access plans in Listings 13 and 15, you can see that the final cardinality has increased and the join order has changed, joining WEB_RETURNS to the WS query block first, using a HSJOIN operation instead of a NLJOIN operation. The cardinality and operations below HSJOIN(9) is the same between the two access plans. The statistical view only provides statistics on the full join. You will also notice that the output cardinality of IXSCAN(18), 100.306, is larger than the input cardinality, 30. This is a side effect of creating a single statistical view on the full join. Multiple statistical views must be created covering each join within the WS query block to correct any errors in the cardinality estimates below the NLJOIN(7) operator. This example is revisited later in this article to show how you can simplify the collection of the multiple statistical views required to cover all the joins in the WS query block.

## Simplifying the collection of statistical views

DB2 version 10.1 introduces improved functionality related to statistical views. One of those improvements can significantly simplify the collection of a statistical view for star schemas if referential integrity (RI) constraints can be defined on the joins. Prior to version 10.1, there are two types of RI constraints that can be defined: ENFORCED or NOT ENFORCED. Version 10.1 added a third type, which you can refer to as statistical RI constraints. A statistical RI constraint is defined using the NOT ENFORCED NOT TRUSTED options of the ALTER TABLE command when defining a FOREIGN KEY.

We will use the WEB_SALES star schema to illustrate how we can take advantage of statistical referential. Each of the dimensions join to the WEB_SALES fact table on the primary key of the dimension table, so you can define statistical RI for each join key in the WEB_SALES table. Listing 16 provides the list of statistical RI defined on the WEB_SALES star schema.

##### Listing 16. Statistical RI covering the WEB_SALES star schema

alter table tpcds.web_sales add foreign key (ws_sold_date_sk) references tpcds.date_dim(d_date_sk) not enforced not trusted; alter table tpcds.web_sales add foreign key (ws_item_sk) references tpcds.item(i_item_sk) not enforced not trusted; alter table tpcds.web_sales add foreign key (ws_bill_customer_sk) references tpcds.customer(c_customer_sk) not enforced not trusted; alter table tpcds.web_sales add foreign key (ws_promo_sk) references tpcds.promotion(p_promo_sk) not enforced not trusted; alter table tpcds.web_sales add foreign key (ws_sold_time_sk) references tpcds.time_dim(t_time_sk) not enforced not trusted; alter table tpcds.web_sales add foreign key (ws_bill_cdemo_sk) references tpcds.customer_demographics(cd_demo_sk) not enforced not trusted; alter table tpcds.web_sales add foreign key (ws_bill_hdemo_sk) references tpcds.household_demographics(hd_demo_sk) not enforced not trusted; alter table tpcds.web_sales add foreign key (ws_ship_mode_sk) references tpcds.ship_mode(sm_ship_mode_sk) not enforced not trusted; alter table tpcds.web_sales add foreign key (ws_warehouse_sk) references tpcds.warehouse(w_warehouse_sk) not enforced not trusted; alter table tpcds.web_sales add foreign key (ws_web_site_sk) references tpcds.web_site(web_site_sk) not enforced not trusted;

The query in Listing 17 and its associated access plan in Listing 18 are used to illustrate how these statistical RI constraints can be useful

##### Listing 17. Query on star schema that includes the WEB_SALES fact table

SELECT W_WAREHOUSE_NAME, W_WAREHOUSE_SQ_FT, W_CITY, W_COUNTY, W_STATE, W_COUNTRY, D_YEAR, AVG(WS_SALES_PRICE * WS_QUANTITY) FROM TPCDS.WEB_SALES, TPCDS.WAREHOUSE, TPCDS.DATE_DIM, TPCDS.CUSTOMER, TPCDS.CUSTOMER_ADDRESS WHERE WS_WAREHOUSE_SK = W_WAREHOUSE_SK AND WS_SOLD_DATE_SK = D_DATE_SK AND WS_BILL_CUSTOMER_SK = C_CUSTOMER_SK AND C_CURRENT_ADDR_SK = CA_ADDRESS_SK AND D_YEAR IN (1998,1999,2000,2001,2002) AND CA_CITY = 'Lakeside' AND CA_ZIP = '69532' AND GROUP BY W_WAREHOUSE_NAME, W_WAREHOUSE_SQ_FT, W_CITY, W_COUNTY, W_STATE, W_COUNTRY, D_YEAR

##### Listing 18. Access plan for query in Listing 17 before any statistical views are created

Rows RETURN ( 1) Cost I/O | 0.0513182 GRPBY ( 2) 5815.77 538.128 | 0.0513182 MDTQ ( 3) 5815.76 538.128 | 0.0256591 GRPBY ( 4) 5812.94 538.128 | 0.0256591 TBSCAN ( 5) 5812.93 538.128 | 0.0256591 SORT ( 6) 5812.87 538.128 | 0.0256591 ^NLJOIN ( 7) 5812.76 538.128 /------------+------------\ 0.0256591 1 ^NLJOIN FETCH ( 8) ( 19) 5803.63 9.12772 537.128 1 /---------+---------\ ...-+----\ 1.06421 0.024111 5 NLJOIN FETCH TABLE: TPCDS ( 9) ( 17) R_WAREHOUSE 5785.47 17.1562 Q6 535 2 /------------+-------------\ ...-+----\ 0.295203 3.60501 73049 BTQ FETCH TABLE: TPCDS ( 10) ( 15) R_DATE_DIM 5713.2 72.2721 Q5 526 9 | /----+-----\ 0.147601 3.60501 362707 NLJOIN IXSCAN DP-TABLE: TPCDS ( 11) ( 16) WEB_SALES 5710.24 64.3587 Q7 526 8 /-------+-------\ | 0.147601 1 362707 TBSCAN FETCH DP-INDEX: TPCDS ( 12) ( 13) WS_BILLCUST 5693.24 17.001 Q7 524 2 | /---+----\ 50000 1 50306 TABLE: TPCDS IXSCAN TABLE: TPCDS R_CUSTOMER_ADDRESS ( 14) CUSTOMER Q3 9.52322 Q4 1 | 50306 INDEX: TPCDS C_CURADDR Q4

The access plan in Listing 18 shows that the cardinality estimate is below 1 for most operations, which is a good indication that there is an error in the cardinality estimate. Prior to version 10.1, you could define a statistical view for each join. This would require a minimum of four statistical views to cover each pair-wise join. Covering the pair-wise joins only can remove a significant portion of the error from the cardinality estimate, but in many cases you would also require covering each combination of joins. This can add up to a large number of statistical views to create, which is not only time consuming to evaluate which ones are necessary, but a large number of statistical views can lead to increased query compilation time.

In version 10.1, if any form of RI constraints are defined between the fact table and its dimensions, then you only need to create a single statistical view that covers all the joins in the star. Listing 19 shows a statistical view that covers all the joins between WEB_SALES and its dimensions.

##### Listing 19. Single statistical view covering the WEB_SALES star

CREATE VIEW TPCDS.WS_GENERAL_SV AS (SELECT WS.*, D.*, I.*, SM.*, C.*, CA.*, P.*, T.*, CD.*, HD.*, IB.*, W.*, WES.* FROM TPCDS.WEB_SALES WS, TPCDS.DATE_DIM D, TPCDS.ITEM I, TPCDS.SHIP_MODE SM, TPCDS.CUSTOMER C, TPCDS.CUSTOMER_ADDRESS CA, TPCDS.PROMOTION P, TPCDS.TIME_DIM T, TPCDS.CUSTOMER_DEMOGRAPHICS CD, TPCDS.HOUSEHOLD_DEMOGRAPHICS HD, TPCDS.INCOME_BAND IB, TPCDS.WAREHOUSE W, TPCDS.WEB_SITE WES WHERE WS.WS_SOLD_DATE_SK = D.D_DATE_SK AND WS.WS_ITEM_SK = I.I_ITEM_SK AND WS.WS_SHIP_MODE_SK = SM.SM_SHIP_MODE_SK AND WS.WS_BILL_CUSTOMER_SK = C.C_CUSTOMER_SK AND C.C_CURRENT_ADDR_SK = CA.CA_ADDRESS_SK AND WS.WS_PROMO_SK = P.P_PROMO_SK AND WS.WS_SOLD_TIME_SK = T.T_TIME_SK AND WS.WS_BILL_CDEMO_SK = CD.CD_DEMO_SK AND WS.WS_BILL_HDEMO_SK = HD.HD_DEMO_SK AND HD.HD_INCOME_BAND_SK = IB.IB_INCOME_BAND_SK AND WS.WS_WAREHOUSE_SK = W.W_WAREHOUSE_SK AND WS.WS_WEB_SITE_SK = WES.WEB_SITE_SK); ALTER VIEW TPCDS.WS_GENERAL_SV ENABLE QUERY OPTIMIZATION; RUNSTATS ON VIEW TPCDS.WS_GENERAL_SV WITH DISTRIBUTION TABLESAMPLE SYSTEM(10);

The query in Listing 18 only covers a subset of the joins defined in the statistical view, but the statistical RI defined in Listing 16 allows the optimizer to use the statistics from this statistical view for any subset of joins covered by the view. You will notice the RUNSTATS command on the statistical view includes the TABLESAMPLE SYSTEM clause. If all the joins in the view are covered by any form of RI constraint, then RUNSTATS will allow system-level sampling on the fact (or child) table as described by the RI constraints. Listing 20 shows the access plan after the statistical view is created.

##### Listing 20. Access plan for query in Listing 17 after statistical view created

0.320232 ^NLJOIN ( 7) 5811.22 537.976 /-----------+------------\ 0.329753 0.971127 ^NLJOIN FETCH ( 8) ( 19) 5794.25 16.9719 536 1.97552 /----------+--------------\ +----\ 0.329753 1 73049 NLJOIN FETCH TABLE: TPCDS ( 9) ( 17) R_DATE_DIM 5785.12 9.12772 Q5 535 1 /------------+-------------\ +----\ 0.295203 1.11704 5 BTQ FETCH TABLE: TPCDS ( 10) ( 15) R_WAREHOUSE 5713.2 71.9233 Q6 526 9 | /----+-----\ 0.147601 1.11704 362707 NLJOIN IXSCAN DP-TABLE: TPCDS ( 11) ( 16) WEB_SALES 5710.24 64.3587 Q7 526 8 /-------+-------\ | 0.147601 1 362707 TBSCAN FETCH DP-INDEX: TPCDS ( 12) ( 13) WS_BILLCUST 5693.24 17.001 Q7 524 2 | /---+----\ 50000 1 50306 TABLE: TPCDS IXSCAN TABLE: TPCDS R_CUSTOMER_ADDRESS ( 14) CUSTOMER Q3 9.52322 Q4 1 | 50306 INDEX: TPCDS C_CURADDR Q4

You will notice the increase in some of the cardinality estimates in the access plan in Listing 20, but most of them are still below 1. The next section explains how to improve this. You will also notice that the join order changed slightly at the top of the access plan, so although the statistical view did not significantly reduce the errors in the cardinality estimates, it did have some influence on the access plan chosen.

## Collecting column group statistics on statistical views

DB2 version 10.1 also extended the functionality in the optimizer to make use of column group statistics defined on statistical views. Additional information sources are listed in Related topics, which includes links to articles to help you understand how to take advantage of column group statistics.

If you examine the query in Listing 17, there are 2 local equal predicates applied on the CUSTOMER_ADDRESS table, and an IN predicate with a values list on the DATE_DIM table. You can collect a column group statistic on the set of columns referenced by these predicates, as shown in Listing 21. Listing 22 shows the access plan after the column group statistics are collected.

##### Listing 21. RUNSTATS command to collect column group statistic on the statistical view

RUNSTATS ON VIEW TPCDS.WS_GENERAL_SV ON ALL COLUMNS AND COLUMNS ((D_YEAR,CA_CITY,CA_ZIP)) WITH DISTRIBUTION TABLESAMPLE SYSTEM(10);

##### Listing 22. Access plan after column group statistic collected

73.6448 ^NLJOIN ( 7) 5804.36 537 /----------+-----------\ 0.351539 209.492 ^NLJOIN FETCH ( 8) ( 19) 5794.26 16.7146 536 1.94134 /---------+--------------\ ...+----\ 0.351539 1 73049 NLJOIN FETCH TABLE: TPCDS ( 9) ( 17) R_DATE_DIM 5785.13 9.12772 Q5 535 1 /------------+-------------\ ...+----\ 0.295203 1.19084 5 BTQ FETCH TABLE: TPCDS ( 10) ( 15) R_WAREHOUSE 5713.2 71.9337 Q6 526 9 | /----+-----\ 0.147601 1.19084 362707 NLJOIN IXSCAN DP-TABLE: TPCDS ( 11) ( 16) WEB_SALES 5710.24 64.3587 Q7 526 8 /-------+-------\ | 0.147601 1 362707 TBSCAN FETCH DP-INDEX: TPCDS ( 12) ( 13) WS_BILLCUST 5693.24 17.001 Q7 524 2 | /---+----\ 50000 1 50306 TABLE: TPCDS IXSCAN TABLE: TPCDS R_CUSTOMER_ADDRESS ( 14) CUSTOMER Q3 9.52322 Q4 1 | 50306 INDEX: TPCDS C_CURADDR Q4

Notice in Listing 22 that the cardinality estimates below NLJOIN(7) are nearly identical to the estimates in Listing 20. Because the RUNSTATS is performed with sampling, you might see a slight variation in some of the cardinality estimates (as well as the statistics on the view). The main difference between the two access plans is at NLJOIN(7) and its inner input FETCH(19). Because the column group statistic covered all three columns referenced by the local predicates, the optimizer cannot make use of the statistic until all the predicates have been applied in the access plan operators. The DATE_DIM table is joined in last, so it is at that point that the statistical view column group statistic can be used to correct errors in the estimate.

To take better advantage of the column group statistics on statistical view, each combination of two or more local predicate columns, per table, should be collected. In this scenario, you will have two local predicates on CUSTOMER_ADDRESS and one on DATE_DIM, so you only need to collect one extra column group statistic covering the two local predicates on CUSTOMER_ADDRESS. The RUNSTATS command is shown in Listing 23 and the resulting access plan is shown in Listing 24.

##### Listing 23. RUNSTATS command to collect column group statistics on the statistical view

RUNSTATS ON VIEW TPCDS.WS_GENERAL_SV ON ALL COLUMNS AND COLUMNS ((D_YEAR,CA_CITY,CA_ZIP),(CA_CITY,CA_ZIP)) WITH DISTRIBUTION TABLESAMPLE SYSTEM(10);

##### Listing 24. Access plan after column group statistics collected

143.391 ^NLJOIN ( 7) 6411.26 662.715 /--------+---------\ 149.71 0.957793 ^HSJOIN FETCH ( 8) ( 20) 6027.83 16.8384 644.006 1.95779 /-------------+-------------\ ...-+----\ 149.71 5 73049 NLJOIN TBSCAN TABLE: TPCDS ( 9) ( 19) R_DATE_DIM 6017.12 9.322 Q5 643.006 1 /---------------+----------------\ | 0.295203 507.142 5 BTQ FETCH TABLE: TPCDS ( 10) ( 15) R_WAREHOUSE 5713.2 303.927 Q6 526 117.006 | /----+-----\ 0.147601 507.142 362707 NLJOIN RIDSCN DP-TABLE: TPCDS ( 11) ( 16) WEB_SALES 5710.24 122.472 Q7 526 8 /-------+-------\ | 0.147601 1 507.142 TBSCAN FETCH SORT ( 12) ( 13) ( 17) 5693.24 17.001 122.42 524 2 8 | /---+----\ | 50000 1 50306 507.142 TABLE: TPCDS IXSCAN TABLE: TPCDS IXSCAN R_CUSTOMER_ADDRESS ( 14) CUSTOMER ( 18) Q3 9.52322 Q4 103.093 1 8 | | 50306 362707 INDEX: TPCDS DP-INDEX: TPCDS C_CURADDR WS_BILLCUST Q4 Q7

You will see in Listing 24 that the cardinality estimate is further improved starting at NLJOIN(9) because you have column group statistics on the CUSTOMER_ADDRESS predicate columns. This can be used to reduce the error in the estimates at the joins with WEB_SALES, as well as the subsequent joins.

You will also notice that the join between CUSTOMER and CUSTOMER_ADDRESS did not change. This join is a snowflake in the star schema, with the CUSTOMER table as the root of the snowflake, joining to the WEB_SALES fact table. The statistical view on the WEB_SALES star does not consider the snowflakes. You will need to create a statistical view for each snowflake as well, with the root of the snowflake as the fact table for that view. As with the WEB_SALES star, RI constraints will be required if the snowflake consists of more than two joins and you want to only create a single statistical view for the snowflake.

Listing 25 shows the statistical view for the join between CUSTOMER and CUSTOMER_ADDRESS. As with the statistical view on WEB_SALES, a column group statistic on the CUSTOMER_ADDRESS predicate columns is collected to reduce the error in the cardinality estimates.

##### Listing 25. Statistical view on snowflake with CUSTOMER at root

CREATE VIEW TPCDS.CUSTOMER_SV AS (SELECT C.*, CA.* FROM TPCDS.CUSTOMER C, TPCDS.CUSTOMER_ADDRESS CA WHERE C.C_CURRENT_ADDR_SK = CA.CA_ADDRESS_SK ); ALTER VIEW TPCDS.CUSTOMER_SV ENABLE QUERY OPTIMIZATION; RUNSTATS ON VIEW TPCDS.CUSTOMER_SV ON ALL COLUMNS AND COLUMNS ((CA_CITY, CA_ZIP)) WITH DISTRIBUTION DEFAULT NUM_FREQVALUES 100 TABLESAMPLE SYSTEM(10);

##### Listing 26. Access plan after statistical view on snowflake with CUSTOMER at root is created

138.242 ^NLJOIN ( 7) 9294.65 1032.23 /--------+---------\ 154.291 0.895979 ^HSJOIN FETCH ( 8) ( 20) 8908.38 16.3731 1013.52 1.89598 /-----------+------------\ ...--+----\ 154.291 5 73049 NLJOIN TBSCAN TABLE: TPCDS ( 9) ( 19) R_DATE_DIM 8897.64 9.322 Q5 1012.52 1 /-------------+-------------\ | 42.1599 3.65967 5 BTQ FETCH TABLE: TPCDS S ( 10) ( 17) R_WAREHOUSE 5867.52 72.2798 Q6 633.082 9 | /----+-----\ 21.0799 3.65967 362707 NLJOIN IXSCAN DP-TABLE: TPCDS ( 11) ( 18) WEB_SALES 5862.88 64.3587 Q7 633.082 8 /-------+-------\ | 0.147601 142.817 362707 TBSCAN FETCH DP-INDEX: TPCDS ( 12) ( 13) WS_BILLCUST 5693.24 169.634 Q7 524 109.082 | /---+----\ 50000 142.817 50306 TABLE: TPCDS RIDSCN TABLE: TPCDS R_CUSTOMER_ADDRESS ( 14) CUSTOMER Q3 23.395 Q4 1 | 142.817 SORT ( 15) 23.3435 1 | 142.817 IXSCAN ( 16) 19.2859 1 | 50306 INDEX: TPCDS C_CURADDR Q4

In the section Improving cardinality for a query with multiple query blocks, you saw that the cardinality estimates in the portion of the access plan representing the query block WS in the query in Listing 12 could be improved. The query covers joins from the WEB_SALES star, so the statistical view collected in Listing 19 is used by the optimizer when compiling the query. Listing 27 shows its effect on the access plan.

##### Listing 27. Access plan for query in Listing 12 after statistical view on WEB_SALES is created

Rows RETURN ( 1) Cost I/O | 0.00723418 DTQ ( 2) 7267.07 1396.98 | 0.00361709 ^NLJOIN ( 3) 7264.49 1396.98 /--------+---------\ 0.272015 ... HSJOIN R_CUSTOMER_ADDRESS ( 4) Q1 7247.45 1394.98 /-------------------------+-------------------------\ 36212 5000 TBSCAN BTQ ( 5) ( 6) 1236.09 5542.09 297 1097.98 | | 36212 5000 DP-TABLE: TPCDS DTQ WEB_RETURNS ( 7) Q2 5133.49 1097.98 | 11798.4 ^HSJOIN ( 8) 4411.98 1097.98 /----------+-----------\ 11902.7 30 ^HSJOIN IXSCAN ( 9) ( 18) 4305.59 3.66352 1097.98 0 /----------+-----------\ | 12145.4 294 30 ^HSJOIN TBSCAN INDEX: TPCDS ( 10) ( 17) R_WSITE_PK 4141.57 57.1966 Q4 1093.98 4 /----------+-----------\ | 14229.1 7736.27 300 NLJOIN TBSCAN TABLE: TPCDS ( 11) ( 16) R_PROMOTION 2799.28 1159.05 Q6 797.981 296 /----------+-----------\ | 28.4059 500.921 9091 FETCH FETCH TABLE: TPCDS ( 12) ( 14) ITEM 20.1213 235.905 Q3 2 108.895 /---+----\ /----+-----\ 28.4059 73049 500.921 362707 IXSCAN TABLE: TPCDS IXSCAN DP-TABLE: TPCDS ( 13) R_DATE_DIM ( 15) WEB_SALES 11.6459 Q5 10.0069 Q7 1 1 | | 73049 362707 INDEX: TPCDS DP-INDEX: TPCDS R_D_DATE WS_SOLDDATE Q5 Q7

## Conclusion

The optimizer is dependent on accurate cardinality estimates to accurately compute the cost of each query access plan considered. You can leverage statistical views to provide the optimizer more information to more accurately model complex relationships in order to choose an optimal query access plan.

#### Downloadable resources

#### Related topics

- Refer to the developerWorks article "Understand column group statistics in DB2" (developerWorks, December 2006) to improve cardinality estimates in an access plan.
- Refer to "Further understand column group statistics in DB2" (developerWorks, December 2008) to learn how to leverage the extended use of multi-column statistics in DB2 9.5 to improve cardinality estimates.
- Refer to the growing list of Database-related best practice papers available on the DB2 for Linux®, UNIX®, and Windows™ Best Practices developerWorks website.
- See the current list of IBM DB2 for Linux, UNIX, and Windows Information Centers. Also, see "Accessing different versions of the DB2 Information Center" in the IBM DB2 Version 10.1 Information Center.