Improve DB2 query performance in a business intelligence environment

Specialized query tuning methods for data warehousing and BI

Running large queries efficiently is a top performance challenge in a business intelligence environment. Learn techniques for improving IBM® DB2® data server query performance in this environment. Walk through various methods, step-by-step, then experiment on your own system. Each method is applied to a single SQL statement, using the db2batch tool to measure performance.

Vincent Corvinelli, Advisory Software Developer, IBM

Vincent Corvinelli is an advisory software developer in the DB2® for Linux®, UNIX®, and Windows® (DB2) Optimizer group in IBM Canada. He has extensive experience in the diagnosis and resolution of customer problems.



Samir Kapoor, DB2 UDB Advanced Support Analyst, IBM

Samir Kapoor photoSamir Kapoor is an IBM Certified Advanced Technical Expert for DB2. Samir currently works with the DB2 UDB Advanced Support -- Down system division (DSD) team and has in-depth knowledge in the engine area.



22 March 2007

Also available in Chinese Russian

Acknowledgements:

  • Xiaomei Wang
  • Wenbin Ma

Introduction

This article focuses on methods that enable large queries to perform efficiently in decision support systems (DSS). These are usually select-only queries accessing a considerable amount of data. Here are the specific methods we'll discuss:

  1. Establishing appropriate referential integrity constraints
  2. Using materialized query tables (MQTs) to replicate tables to other database partitions, enabling collocated joins on non-partitioning key columns
  3. Using multidimensional clustering (MDC)
  4. Using table partitioning (a new feature of DB2® 9)
  5. Using table partitioning in combination with multidimensional clustering
  6. Using a MQT to pre-compute the aggregated results

The examples in this article are specific to DB2 9 running on the Windows platforms. However, the concepts and information provided are relevant to any platform. Our examples also use the DB2 Database Partitioning Feature (DPF) to divide the data across multiple physical and logical partitions, since most business intelligence (BI) environments use DPF.


Database layout and setup

This section describes the physical and logical layout of the database used to perform the tests on our system.

Star schema layout details

The following star-schema is used:

Listing 1. Star schema
                                  PRODUCT_DIM             DATE_DIM
                                            \            /
                                             \          /
                                              SALES_FACT
                                                  |
                                                  |
		                         STORE_DIM

where the tables are defined as follows:

TABLE NAME TYPE COLUMN NAME DATA TYPE COLUMN DESCRIPTION
SALES_FACTFACT TABLEDATE_IDDATEDate when product sold.
PRODUCT_IDINTIdentifier of product purchased.
STORE_IDINTIdentifier of store where product was purchased.
QUANTITYINTNumber of products sold in this transaction.
PRICEINTProduct purchase price. [For simplicity, an integer field is used, but a decimal field would be more relevant]
TRANSACTION_DETAILSCHAR(100)Description/details about the transaction.
DATE_DIMDIMENSION TABLEDATE_ID NOT NULLDATEUnique identifier
MONTHINTMonth that the date record belongs to.
QUARTERINTQuarter (1st, 2nd, 3rd, or 4th) the date record belongs to.
YEARINTYear that the date record belongs to.
PRODUCT_DIMDIMENSION TABLEPRODUCT_ID NOT NULLINTUnique product identifier.
PRODUCT_DESCCHAR(20)Description of the product.
MODELCHAR(200)The model of the product.
MAKECHAR(50)The make of the product.
STORE_DIMDIMENSION TABLESTORE_ID NOT NULLINTUnique store identifier.
LOCATIONCHAR(15)Store Location.
DISTRICTCHAR(15)District that the store belongs to.
REGIONCHAR(15)Region that the store belongs to.

The fact table, SALES_FACT, contains overall sales information for the year 2006. It includes the date when the product was sold, the ID of the product, the ID of the store that sold the product, the quantity of the specific product sold and the price of the product. The column TRANSACTION_DETAILS is added to the fact table to generate more I/O when accessing data from the fact table.

The dimension table, DATE_DIM, contains unique dates along with corresponding month, quarter and year information for when the stores were open.

The dimension table, PRODUCT_DIM, contains the different products that the company sells. Each product has a unique product ID as well as a product description, model, and make.

The dimension table, STORE_DIM, contains different store ID's along with their location, district and region information.

Database partition information
DATABASE PARTITION GROUP NAMEDATABASE PARTITION NUMBER
FACT_GROUP0,1,2,3
DATE_GROUP1
PRODUCT_GROUP2
STORE_GROUP3

Each table resides in its own partition group. The three dimension tables are small in size, so they reside on a single database partition. But, the fact table is divided across all four partitions.

Tablespace information
TABLESPACE NAMEDATABASE PARTITION GROUPTABLE
FACT_SMSFACT_GROUPSALES_FACT
DATE_SMSDATE_GROUPDATE_DIM
PRODUCT_SMSPRODUCT_GROUPPRODUCT_DIM
STORE_SMSSTORE_GROUPSTORE_DIM

Each table resides in it own tablespace. A common alternative method is to group the three dimension tables into the same tablespace.

Bufferpool information

The default bufferpool, IBMDEFAULTBP, consisting of 1,000 4K pages, was used in the tests. Each tablespace shared the same bufferpool in our tests. In a typical BI environment, separate bufferpools are created.

Main query

The following query is used to test the various methods discussed in this article. This query performs an outer join to compare the sales information for ten particular stores, for the month of January and the month of November.

Listing 2. Main Query [Query1.sql]
WITH TMP1 (MONTH_1,STORE,REGION,DISTRICT,AMOUNT_1) AS
( SELECT
     D.MONTH AS MONTH, 
     S.STORE_ID AS STORE_ID,
     S.DISTRICT AS DISTRICT,
     S.REGION AS REGION,
     SUM(F1.QUANTITY * F1.PRICE) AS AMOUNT

  FROM
     SKAPOOR.SALES_FACT F1,
     SKAPOOR.DATE_DIM D,
     SKAPOOR.PRODUCT_DIM P,
     SKAPOOR.STORE_DIM S
     
  WHERE
     P.MODEL LIKE '%model%' AND
     F1.DATE_ID=D.DATE_ID AND
     F1.PRODUCT_ID=P.PRODUCT_ID AND
     F1.STORE_ID=S.STORE_ID AND
     F1.DATE_ID BETWEEN '2006-01-01' AND '2006-01-31' AND
     F1.STORE_ID IN (29, 30, 42, 55, 67, 72, 82, 99, 123, 199) AND		
     D.MONTH = 1 

  GROUP BY
     S.STORE_ID,S.DISTRICT,S.REGION,D.MONTH) ,
     
TMP2 (MONTH_11,STORE,REGION,DISTRICT,AMOUNT_11) AS
( SELECT 
     D1.MONTH AS MONTH,
     S1.STORE_ID AS STORE_ID,
     S1.DISTRICT AS DISTRICT,
     S1.REGION AS REGION,
     SUM(F2.QUANTITY * F2.PRICE) AS AMOUNT

  FROM
     SKAPOOR.SALES_FACT F2,
     SKAPOOR.DATE_DIM D1,
     SKAPOOR.PRODUCT_DIM P1,
     SKAPOOR.STORE_DIM S1

  WHERE
     P1.MODEL LIKE '%model%' AND
     F2.DATE_ID=D1.DATE_ID AND
     F2.PRODUCT_ID=P1.PRODUCT_ID AND
     F2.STORE_ID=S1.STORE_ID AND
     F2.DATE_ID BETWEEN '2006-11-01' AND '2006-11-30' AND
     F2.STORE_ID IN (29, 30, 42, 55, 67, 72, 82, 99, 123, 199) AND		
     D1.MONTH=11

  GROUP BY
     S1.STORE_ID,S1.DISTRICT,S1.REGION,D1.MONTH) 

SELECT 
     A.*, 
     B.*
FROM
     TMP1 A LEFT OUTER JOIN TMP2 B ON
       (A.STORE=B.STORE AND A.REGION=B.REGION AND A.DISTRICT=B.DISTRICT)
ORDER BY A.AMOUNT_1 DESC, B.AMOUNT_11 DESC;

Environment Setup

The tests were performed using the following environment:

Listing 3. db2level
DB2 9 Enterprise Edition:

DB21085I  Instance "DB2" uses "32" bits and DB2 code release "SQL09010" with
level identifier "02010107".
Informational tokens are "DB2 9.1.0.356", "s060629", "NT32", and Fix Pack "0".
Product is installed at "C:\PROGRA~1\IBM\SQLLIB" with DB2 Copy Name "DB2COPY1".
Listing 4. Operating system
System: WIN32_NT SKAPOOR Service Pack 2 5.1 x86 Family 15, model 4, stepping 3
Listing 5. Hardware
CPU: total:2 online:2 Cores per socket:1 Threading degree per core:2
Physical Memory(MB): total:2551 free:1988 available:1949
Virtual  Memory(MB): total:4950 free:6575
Swap     Memory(MB): total:2399 free:4587
1 Physical disk Size 100GB

Space requirement

Up to 20Gb of disk space is required, for both data and log files, to recreate all test cases described in this article. Close to 13Gb of space is allotted for log files. Circular logging was used, with 100 primary logs allocated:

Listing 6. Database configuration for logs
Log file size (4KB)                         (LOGFILSIZ) = 8192
Number of primary log files                (LOGPRIMARY) = 100
Number of secondary log files               (LOGSECOND) = 150

You can reduce the amount of disk space required for log files with some slight modifications to the scripts in order to populate data in the fact table. We discuss this later in this article.


Setting up the database

The first step is to create a test database.

In our tests, four logical data partitions were created. In the etc\services file, make sure you have enough ports available to create the four data partitions. In our test environment, the file C:\WINDOWS\system32\drivers\etc\services contained the following for the instance "DB2":

Listing 7. Contents of services file
DB2_DB2		60000/tcp
DB2_DB2_1		60001/tcp
DB2_DB2_2		60002/tcp
DB2_DB2_END	60003/tcp
DB2c_DB2		50000/tcp

To add additional database partitions to the instance, the following command was issued using the DB2 CLP:

Listing 8. Creating database partitions using the db2ncrt command
db2stop 
db2ncrt /n:1 /u:username,password /i:DB2 /m:machine /p:1
db2ncrt /n:2 /u:username,password /i:DB2 /m:machine /p:2
db2ncrt /n:3 /u:username,password /i:DB2 /m:machine /p:3

The username and password for the /u option, the machine name for the /m option, and the instance name for the /i option should be modified to fit your environment.

Creating the database

Create the database DSS_DB. The D: drive was used to store the database. Please adjust this to fit your environment.

Listing 9. Command to create the database
db2 create database dss_db on D:\;

The database and database manager is configured using the settings below. The db2_all tool is used to set the database configuration and database manager configuration changes on all database partitions.

Listing 10. Statements to update database manager configuration
db2_all update dbm cfg \
     using cpuspeed 2.282997e-007 intra_parallel NO comm_bandwidth 100.000000

db2_all update db cfg for DSS_DB \
     using locklist 2450 dft_degree 1 maxlocks 60 \
           avg_appls 1 stmtheap 16384 dft_queryopt 5

Creating the database partition groups and tablespaces

Use the following statements to create the database partition groups and tablespaces. You can copy them into a file named STORAGE.ddl and execute them using:

db2 -tvf STORAGE.ddl -z storage.log
Listing 11. Statements to create the database partition groups and tablespaces
CONNECT TO DSS_DB;

--------------------------------------------------
-- DDL Statements for DATABASE PARTITION GROUPS --
--------------------------------------------------
 
CREATE DATABASE PARTITION GROUP "FACT_GROUP" ON DBPARTITIONNUMS 
		(0,
		 1,
		 2,
		 3);

CREATE DATABASE PARTITION GROUP "DATE_GROUP" ON DBPARTITIONNUMS 
		(1);

CREATE DATABASE PARTITION GROUP "PRODUCT_GROUP" ON DBPARTITIONNUMS 
		(2);

CREATE DATABASE PARTITION GROUP "STORE_GROUP" ON DBPARTITIONNUMS 
		(3);

COMMIT WORK;

------------------------------------
-- DDL Statements for TABLESPACES --
------------------------------------

CREATE REGULAR TABLESPACE FACT_SMS IN DATABASE PARTITION GROUP FACT_GROUP
    PAGESIZE 4096 MANAGED BY SYSTEM 
	 USING ('d:\database\fact_tbsp0') ON DBPARTITIONNUMS (0)
	 USING ('d:\database\fact_tbsp1') ON DBPARTITIONNUMS (1)
	 USING ('d:\database\fact_tbsp2') ON DBPARTITIONNUMS (2)
	 USING ('d:\database\fact_tbsp3') ON DBPARTITIONNUMS (3)
	 EXTENTSIZE 32
	 PREFETCHSIZE AUTOMATIC
	 BUFFERPOOL IBMDEFAULTBP
	 OVERHEAD 7.500000
	 TRANSFERRATE 0.060000 
	 NO FILE SYSTEM CACHING  
	 DROPPED TABLE RECOVERY ON;

CREATE REGULAR TABLESPACE DATE_SMS IN DATABASE PARTITION GROUP
DATE_GROUP PAGESIZE 4096 MANAGED BY SYSTEM 
	USING ('d:\database\date_group') ON DBPARTITIONNUMS (1)
	EXTENTSIZE 32
	 PREFETCHSIZE AUTOMATIC
	 BUFFERPOOL IBMDEFAULTBP
	 OVERHEAD 7.500000
	 TRANSFERRATE 0.060000 
	 NO FILE SYSTEM CACHING  
	 DROPPED TABLE RECOVERY ON;

CREATE REGULAR TABLESPACE PRODUCT_SMS IN DATABASE PARTITION GROUP
PRODUCT_GROUP PAGESIZE 4096 MANAGED BY SYSTEM 
	USING ('d:\database\product_group') ON DBPARTITIONNUMS (2)
	EXTENTSIZE 32
	 PREFETCHSIZE AUTOMATIC
	 BUFFERPOOL IBMDEFAULTBP
	 OVERHEAD 7.500000
	 TRANSFERRATE 0.060000 
	 NO FILE SYSTEM CACHING  
	 DROPPED TABLE RECOVERY ON;

CREATE REGULAR TABLESPACE STORE_SMS IN DATABASE PARTITION GROUP
STORE_GROUP PAGESIZE 4096 MANAGED BY SYSTEM 
	USING ('d:\database\store_group') ON DBPARTITIONNUMS (3)
	EXTENTSIZE 32
	 PREFETCHSIZE AUTOMATIC
	 BUFFERPOOL IBMDEFAULTBP
	 OVERHEAD 7.500000
	 TRANSFERRATE 0.060000 
	 NO FILE SYSTEM CACHING  
	 DROPPED TABLE RECOVERY ON;

COMMIT WORK;

-- Mimic tablespace

ALTER TABLESPACE SYSCATSPACE
      PREFETCHSIZE AUTOMATIC
      OVERHEAD 7.500000
      TRANSFERRATE 0.060000;


ALTER TABLESPACE TEMPSPACE1
      PREFETCHSIZE AUTOMATIC
      OVERHEAD 7.500000
      TRANSFERRATE 0.060000;


ALTER TABLESPACE USERSPACE1
      PREFETCHSIZE AUTOMATIC
      OVERHEAD 7.500000
      TRANSFERRATE 0.060000;

COMMIT WORK;

------------------------------------------------
-- Update the bufferpool to use 1000 4K pages --
------------------------------------------------

ALTER BUFFERPOOL IBMDEFAULTBP SIZE 1000;

COMMIT WORK;
CONNECT RESET;

Note: The tablespaces are defined with "NO FILE SYSTEM CACHING" to avoid file system caching from skewing the measured results of testing the various methods.

Using the db2batch tool to measure performance

The db2batch program is used to run the main query in Listing 2. In order to run the query using db2batch command, you need to save the query in a file ending with a semicolon and use the following options so that db2batch tool will look at the timings:

Listing 12. Using db2batch to measure the performance of the query
db2batch -d <dbname> -f <input_file> -i <elapsed_time> -iso <isolation level>
-o p <perf_detail> o <optlevel> r <rows_out> -r <result_file>

Where <dbname> is the database name and <input_file> is the file containing the query ending with a semicolon

.

  • -iso <isolation level>:
    In our tests, the default isolation level is CS, but the db2batch tool uses isolation level RR by default. MQTs created using isolation level CS won't be considered if executing a query using isolation level RR. To bypass this problem, use the -iso option with isolation level of CS in the db2batch command to provide the required isolation level so that the MQT is chosen. Also, your application might use the default CS isolation level and running db2batch without the -iso option causes it to use RR isolation and possible locking contentions.
  • -o - options options:
    • p <perf_detail>: Performance details. Return a snapshot for the database manager, the database, the application, and the statement (the latter is returned only if autocommit is off, and single statements, not blocks of statements, are being processed). Also return a snapshot for the buffer pools, table spaces and FCM (an FCM snapshot is only available in a multi-database-partition environment). We used the highest detailed output, for example p 5, but you could use a different level of performance output.
    • o <optlevel>: Query optimization level. (Optimization 5 is used in this article, which doesn't need to be explicitly specified here since it is the database's default optimization level, as described in Listing 10.)
    • r <rows_out>: Number of fetched rows to be sent to the output. In our case, we sent no rows, for example r 0.
  • -r <result_file>: Results file. In our case, results.txt is the output filename in which db2batch outputs the results.

In this article, we used the following:

db2batch -d DSS_DB -f QUERY1.SQL -i complete -iso CS -o p 5 o 5 r 0 -r <output filename>

Methods to improve query performance

In this section, let's step through various methods to improve the performance of the query described in Listing 2. Before discussing any of the methods, the base fact and dimension tables must be created.

STEP A: Once the tablespaces have been created, create the fact and dimension tables as follows. You can adjust the schema name from SKAPOOR to a schema that fits your environment. If you do so, ensure that the query in Listing 2 is updated to reflect the appropriate schema name. You can copy the statements below into a file, TEST1.ddl, and execute it using:

db2 -tvf TEST1.ddl -z test1.log
Listing 13. Contents of TEST1.ddl
CONNECT TO DSS_DB;

---------------------------------------------------
-- DDL Statements for table "SKAPOOR "."SALES_FACT"
---------------------------------------------------
 

CREATE TABLE "SKAPOOR "."SALES_FACT"  (
		  "DATE_ID" DATE , 
		  "PRODUCT_ID" INTEGER , 
		  "STORE_ID" INTEGER , 
		  "QUANTITY" INTEGER , 
		  "PRICE" INTEGER , 
		  "TRANSACTION_DETAILS" CHAR(100) )   
		 DISTRIBUTE BY HASH("DATE_ID")   
		   IN "FACT_SMS" ;

-------------------------------------------------
-- DDL Statements for table "SKAPOOR "."DATE_DIM"
-------------------------------------------------
 

CREATE TABLE "SKAPOOR "."DATE_DIM"  (
		  "DATE_ID" DATE NOT NULL , 
		  "MONTH" INTEGER , 
		  "QUARTER" INTEGER , 
		  "YEAR" INTEGER )   
		 IN "DATE_SMS" ; 


-- DDL Statements for primary key on Table "SKAPOOR "."DATE_DIM"
-- DATE_ID is the unique identifier
ALTER TABLE "SKAPOOR "."DATE_DIM" 
	ADD PRIMARY KEY
		("DATE_ID");



----------------------------------------------------
-- DDL Statements for table "SKAPOOR "."PRODUCT_DIM"
----------------------------------------------------
 

CREATE TABLE "SKAPOOR "."PRODUCT_DIM"  (
		  "PRODUCT_ID" INTEGER NOT NULL , 
		  "PRODUCT_DESC" CHAR(20) , 
		  "MODEL" CHAR(10) , 
		  "MAKE" CHAR(10) )   
		 IN "PRODUCT_SMS" ; 


-- DDL Statements for primary key on Table "SKAPOOR "."PRODUCT_DIM"
-- PRODUCT_ID is the unique identifier
ALTER TABLE "SKAPOOR "."PRODUCT_DIM" 
	ADD PRIMARY KEY
		("PRODUCT_ID");



--------------------------------------------------
-- DDL Statements for table "SKAPOOR "."STORE_DIM"
--------------------------------------------------
 

CREATE TABLE "SKAPOOR "."STORE_DIM"  (
		  "STORE_ID" INTEGER NOT NULL , 
		  "LOCATION" CHAR(15) , 
		  "DISTRICT" CHAR(15) , 
		  "REGION" CHAR(15) )   
		 IN "STORE_SMS" ; 


-- DDL Statements for primary key on Table "SKAPOOR "."STORE_DIM"
-- STORE_ID is the unique identifier
ALTER TABLE "SKAPOOR "."STORE_DIM" 
	ADD PRIMARY KEY
		("STORE_ID");


COMMIT WORK;

CONNECT RESET;

STEP B: Once you create the tables, insert data into the three dimension tables as follows and adjust the schema to fit your environment:

Listing 14. Populating the DATE_DIM table
db2 -td@ -vf date_insert.txt -z date_insert.log
Listing 15. Populating the PRODUCT_DIM table
db2 -td@ -vf product_insert.txt -z product_insert.log
Listing 16. Populating the STORE_DIM table
db2 -td@ -vf store_insert.txt -z store_insert.log

The contents of the three files are:

The DATE_DIM table is populated with values from all 365 days in 2006.

Listing 17. Contents of date_insert.txt
connect to dss_db@

begin atomic
  declare cnt INT default 1;
  declare dat DATE default '01/01/2006';
  declare yer INT default 2006;
  declare quart INT default 1;

while (cnt <= 365) do    
    if (int(dat + cnt DAYS)/100) between 200601 and 200603 then
           set quart=1;
    elseif (int(dat + cnt DAYS)/100) between 200604 and 200606 then
           set quart=2;
    elseif (int(dat + cnt DAYS)/100) between 200607 and 200609 then
           set quart=3;
    elseif (int(dat + cnt DAYS)/100) between 200610 and 200612 then
           set quart=4;    
    end if;
    
    insert into SKAPOOR.DATE_DIM values (
	dat + cnt DAYS,
	(int(dat + cnt DAYS)/100) - 200600,
	quart,
	yer
    );
       
    set cnt=cnt+1;
end while;

end@

connect reset@

The PRODUCT_DIM table is populated with 60,000 products.

Listing 18. Contents of product_insert.txt
connect to dss_db@

drop sequence seq1@
drop sequence seq2@

create sequence seq1 as integer start with 1 increment by 1@
create sequence seq2 as integer start with 1 increment by 1@

begin atomic
   declare cnt INT default 1;

   while (cnt < 60001) do
	insert into SKAPOOR.PRODUCT_DIM values (
	    nextval for SEQ2,
	    'product desc' concat char(nextval for SEQ1),
            'model ' concat char(integer(rand()*1000)),
	    'maker ' concat char(integer(rand()*500))
	);
 	set cnt=cnt+1;
    	end while;
end@

drop sequence seq1@
drop sequence seq2@

connect reset@

The STORE_DIM table is populated with 201 stores.

Listing 19. Contents of store_insert.txt
connect to dss_db@

drop sequence seq2@

create sequence seq2 as integer start with 0 increment by 1@

begin atomic
    declare cnt INT default 1;

    while (cnt < 202) do
       insert into SKAPOOR.STORE_DIM values (
	    nextval for SEQ2,
	    'location' concat char(integer(rand()*500)),
	    'district' concat char(integer(rand()*10)),
	    'region' concat char(integer(rand()*5))
       );
       set cnt=cnt+1;

    end while;
end@

drop sequence seq2@

connect reset@

STEP C: Insert data into the SALES_FACT table. Adjust the schema to fit your environment. It took approximately 1.5 hours to insert data into the fact table in our test environment.

Listing 20. Populating the SALES_FACT table
db2 -td@ -vf sales_fact_insert.ddl -z sales_fact_insert.log
Listing 21. Contents of sales_fact_insert.ddl
connect to dss_db@

VALUES (CURRENT TIMESTAMP)@

begin atomic

   declare cnt INT default 1;
   declare cnt1 INT default 1;
   declare dat DATE default '01/01/2006';

   while (cnt <= 365) do    
    
    INSERT INTO SKAPOOR.SALES_FACT
    with v(DATE_ID, PRODUCT_ID, STORE_ID, QUANTITY, PRICE, TRANSACTION_DETAILS, U_ID) as
         (values(dat + CNT1 DAYS, 1, 1, 1, 1, '', 1) union all
          select DATE_ID, int(rand()*59999) + 1, int(rand()*200),  int(rand()*50) + 1,
          INT(RAND()*200 + 1), RESERVE, U_ID + 1
          from   v
          where  U_ID < 60000)
     select date_id, product_id, store_id, quantity, price, transaction_details from v;

     set cnt1 = cnt1 + 1;
     set cnt  = cnt + 1;
   end while;

end@

VALUES (CURRENT TIMESTAMP)@

connect reset@

Note: In Listing 21, the SALES_FACT table is populated in a single transaction, which will require a large amount of disk space for logging. To reduce the impact of logging, you can create a stored procedure and commit the insert in stages:

Listing 22. Alternative method to populate the SALES_FACT table
connect to dss_db@

VALUES (CURRENT TIMESTAMP)@

-- Create a procedure to populate the SALES_FACT table
-- committing the inserts in stages to reduce the impact
-- of logging

create procedure salesFactPopulate()
specific salesFactPopulate
language sql

begin

   declare cnt INT default 1;
   declare cnt1 INT default 1;
   declare dat DATE default '01/01/2006';

   while (cnt <= 365) do    
    
    INSERT INTO SKAPOOR.SALES_FACT
    with v(DATE_ID, PRODUCT_ID, STORE_ID, QUANTITY, PRICE, TRANSACTION_DETAILS, U_ID) as
    (
       values(dat + CNT1 DAYS, 1, 1, 1, 1, '', 1)
      union all
       select DATE_ID, int(rand()*59999) + 1, int(rand()*200),  int(rand()*50) + 1,
              INT(RAND()*200 + 1), RESERVE, U_ID + 1
         from v
        where U_ID < 60000
    )
    select date_id, product_id, store_id, quantity, price, TRANSACTION_DETAILS from v;

    commit work;

    set cnt1 = cnt1 + 1;
    set cnt=cnt+1;

   end while;

end@

-- populate the SALES_FACT table
invoke salesFactPopulate@

VALUES (CURRENT TIMESTAMP)@

connect reset@

STEP D: To understand how the various methods affect the query access plan chosen, we need to Explain the queries in order to view the access plans chosen by the DB2 query optimizer. To do this, use the EXPLAIN tool, which requires the existence of the EXPLAIN tables. To create the EXPLAIN tables, perform the following steps:

  1. Change the directory to the location of your sqllib\misc directory.
    In our test environment, it is "C:\Program Files\IBM\SQLLIB\MISC".
  2. Execute db2 connect to dss_db.
  3. Execute db2 -tvf EXPLAIN .DDL.

METHOD 1: Define appropriate referential integrity constraints between the fact table and the three dimension tables

In DB2, primary key and foreign key constraints can be defined to allow the database manager to enforce the referential integrity of the data. Referential constraints such as foreign keys can also help improve query performance. For example, if the subexpression TMP1 in the query described in Listing 2 is modified to remove the local predicate on the PRODUCT_DIM table, then the join between SALES_FACT and PRODUCT_DIM can be eliminated by the optimizer if you create a foreign key constraint on SALES_FACT.PRODUCT_ID. If the foreign key constraint is created, then such a join is considered lossless and can be removed from the query since all the data required from the PRODUCT_DIM, to satisfy the query, is available from the SALES_FACT table; only the primary key of PRODUCT_DIM is referenced in the join with SALES_FACT and no other columns are referenced from PRODUCT_DIM.

In the star schema described in the section Star-Schema Layout Details, each DATE_ID, PRODUCT_ID and STORE_ID that exists in the dimension must also exist in the fact table. Each ID is unique in the dimension tables, identified by the primary key constraints created for each dimension table. Hence, the fact table keeps the historical data, (quantitative measurement) for specific dates that the products were sold. The following table describes the primary keys and foreign keys that should be created for this schema. Each unique ID in the dimensions has a corresponding foreign key constraint in the fact table.

TABLECOLUMNPK/FKTARGET TABLE (COLUMN)
DATE_DIMDATE_IDPKnone
PRODUCT_DIMPRODUCT_IDPKnone
STORE_DIMSTORE_IDPKnone
SALES_FACTDATE_IDFKDATE_DIM (DATE_ID)
SALES_FACTPRODUCT_IDFKPRODUCT_DIM (PRODUCT_ID)
SALES_FACTSTORE_IDFKSTORE_DIM (STORE_ID)

STEP 1A: ALTER the fact table and create the proper FK relationships with the dimension tables. Refer to the above table to see the relationships between the fact and dimension tables. We also create an index on SALES_FACT columns (DATE_ID,STORE_ID) to properly compare against the MDC method described in Method 3, which uses a block index on (DATE_ID,STORE_ID).

Listing 23. Creating foreign key constraints and indexes in the SALES_FACT table
db2 -tvf alter_sales_fact.txt -z alter_sales_fact.log
Listing 24. Contents of the alter_sales_fact.txt file
CONNECT TO DSS_DB;
ALTER TABLE SKAPOOR.SALES_FACT ADD CONSTRAINT DATE_DIM_FK FOREIGN KEY
(DATE_ID) REFERENCES DATE_DIM;
ALTER TABLE SKAPOOR.SALES_FACT ADD CONSTRAINT STORE_DIM_FK FOREIGN KEY
(STORE_ID) REFERENCES STORE_DIM;
ALTER TABLE SKAPOOR.SALES_FACT ADD CONSTRAINT PRODUCT_DIM_FK FOREIGN KEY
(PRODUCT_ID) REFERENCES PRODUCT_DIM;
CREATE INDEX SKAPOOR.IND_DATE_STORE ON SKAPOOR.SALES_FACT (DATE_ID, STORE_ID);
CONNECT RESET;

STEP 1B: Collect statistics on all the tables:

The optimizer relies on statistics to properly cost alternative query execution plans (QEPs) and choose the most optimal plan. Before proceeding, we'll need to collect some statistics.

Listing 25. Collecting statistics on all tables
db2 -tvf runstats.ddl -z runstats.log
Listing 26. Contents of runstats.ddl
CONNECT TO DSS_DB;
RUNSTATS ON TABLE SKAPOOR.DATE_DIM WITH DISTRIBUTION AND INDEXES ALL;
RUNSTATS ON TABLE SKAPOOR.STORE_DIM WITH DISTRIBUTION AND INDEXES ALL;
RUNSTATS ON TABLE SKAPOOR.PRODUCT_DIM WITH DISTRIBUTION AND INDEXES ALL;
RUNSTATS ON TABLE SKAPOOR.SALES_FACT WITH DISTRIBUTION AND INDEXES ALL;
CONNECT RESET;

With the foreign keys now created, we can take a look at how the DB2 optimizer can use the referential integrity constraints to eliminate joins.

STEP 1C: Explain the query:

Listing 27. Query with lossless join
SELECT
		D.MONTH AS MONTH, 
		S.STORE_ID AS STORE_ID,
		S.DISTRICT AS DISTRICT,
		S.REGION AS REGION,
		SUM(F1.QUANTITY * F1.PRICE) AS AMOUNT

  FROM
		SKAPOOR.SALES_FACT F1,
		SKAPOOR.DATE_DIM D,
		SKAPOOR.PRODUCT_DIM P,
		SKAPOOR.STORE_DIM S
  WHERE
		F1.DATE_ID=D.DATE_ID AND
		F1.PRODUCT_ID=P.PRODUCT_ID AND
		F1.STORE_ID=S.STORE_ID AND
		F1.DATE_ID BETWEEN '2006-01-01' AND '2006-01-31' AND
		F1.STORE_ID IN (29, 30, 42, 55, 67, 72, 82, 99, 123, 199) AND		
		D.MONTH = 1 
  GROUP BY
		S.STORE_ID,S.DISTRICT,S.REGION,D.MONTH)

One method to explain the query is as follows:

db2 connect to dss_db
db2 set current explain mode explain
db2 -tvf JOIN_ELIM_QUERY.SQL -z JOIN_ELIM_QUERY.log
db2 set current explain mode no
db2 connect reset

Where the contents of JOIN_ELIM_QUERY.SQL include just the query in Listing 27, ending with a semicolon.

The query execution plan can be viewed using the db2exfmt tool as follows:

db2exfmt -d dss_db -g TIC -w -1 -n % -s % -# 0 -o join_elim.exfmt

The output is in join_elim.exfmt. For details about the db2exfmt tool, use the -h option.

Please open the JOIN_ELIM file in the Downloads section to see an access plan generated by the query optimizer where the join with PRODUCT_DIM is eliminated.

You can also look at the "Optimized Statement" section in the db2exfmt output and notice that the PRODUCT_DIM table was removed from the query.

Note:Insert, delete, and update operations may not perform as well when referential constraints such as foreign keys are used. If performance is critical for these operations, but the join elimination optimization can be useful in your queries, you can define the foreign key constraints as informational only. The exercises at the end of this method look into this option.

STEP 1D: Explain and run the full query.

To explain the query, follow the same steps as in STEP 1C:

db2 connect to dss_db
db2 set current explain mode explain
db2 -tvf QUERY1.SQL -z QUERY1.log
db2 set current explain mode no
db2 connect reset

Where the contents of QUERY1.SQL include just the query in Listing 2, ending with a semicolon.

The query execution plan can be viewed using the db2exfmt tool:

db2exfmt -d dss_db -g TIC -w -1 -n % -s % -# 0 -o test1.txt

Your query execution plan should be similar to the one represented in Test 1, found in the Downloads section.

To run the query, the db2batch tool will be used to measure the performance. Before doing so, the db2 instance should be recycled to properly compare each method, so as to avoid other factors affecting performance (for instance, later methods may benefit from a primed buffer pool and skew the measured results).

Note: Our test system was idle, with no other activity running, when we ran these tests.

Stop db2 using db2stop force and restart it using db2start. Use db2batch to get the elapsed time information, as follows:

db2batch -d DSS_DB -f QUERY1.SQL -i complete -iso CS -o p 5 o 5 r 0 -r test1.results.txt

The file test1.results.txt will contain the elapsed time to compile and run the query similar to the following:

* Prepare Time is:       7.278206 seconds
* Execute Time is:     107.729436 seconds
* Fetch Time is:         0.000102 seconds
* Elapsed Time is:     115.007744 seconds (complete)

Exercises:

  1. In STEP 1A, foreign key constraints were created on the SALES_FACT table, but they could impact performance of insert, update, and delete operations since the database manager must enforce the referential integrity. If performance of these operations is critical, informational constraints can be created to still take advantage of join elimination, under the assumption that the referential integrity is enforced by other methods. If this is not the case, providing information constraints could lead to incorrect results.

    Informational constraints are defined in the same manner as referential constraints, but with the added not enforced keywords at the end, for example:

    ALTER TABLE SKAPOOR.SALES_FACT ADD CONSTRAINT DATE_DIM_FK FOREIGN KEY (DATE_ID) REFERENCES DATE_DIM NOT ENFORCED;

    To complete this exercise, drop the foreign key constraints created on SALES_FACT table and repeat steps 1A-1D using information constraints instead.


METHOD 2: Replicated Materialized Query Tables on dimensions tables

This test involves the same query and same tables as in Method 1, but this method further creates replicated MQTs on the dimension tables.

In Method 1, the dimension tables reside on different partitions, and the data has to be shipped between partitions. MQTs can be used to replicate the dimension tables to the other partitions, enabling collocated joins which avoid shipping the data between partitions, and potentially improving query execution performance

STEP 2A: Create the replicated MQTs as follows:

db2 -tvf replicated.ddl
Listing 28. Contents of replicated.ddl file
connect to dss_db;

drop table skapoor.store_dim_rep;
drop table skapoor.product_dim_rep;
drop table skapoor.date_dim_rep;

create table skapoor.store_dim_rep as (select * from skapoor.store_dim)
data initially deferred refresh deferred in FACT_SMS replicated;
create table skapoor.product_dim_rep as (select * from skapoor.product_dim)
data initially deferred refresh deferred in FACT_SMS replicated;
create table skapoor.date_dim_rep as (select * from skapoor.date_dim)
data initially deferred refresh deferred in FACT_SMS replicated;

refresh table skapoor.store_dim_rep;
refresh table skapoor.product_dim_rep;
refresh table skapoor.date_dim_rep;

create index skapoor.store_dim_id_rep on skapoor.store_dim_rep (store_id);
create index skapoor.product_dim_id_rep on skapoor.product_dim_rep (product_id);
create index skapoor.date_dim_id_rep on skapoor.date_dim_rep (date_id);

runstats on table skapoor.store_dim_rep with distribution and indexes all;
runstats on table skapoor.product_dim_rep with distribution and indexes all;
runstats on table skapoor.date_dim_rep with distribution and indexes all;

connect reset;

To ensure collocation is achievable, the replicated dimensions must reside in the same database partition group as the fact table. For simplicity, we used the same tablespace as the fact table, but they could be different as long as the tablespace shares the same database partition group. Furthermore, to make the replicated tables comparable to the underlying tables when the optimizer costs different access alternatives, the replicated tables should have the same indexes defined as the underlying table and the same stats should be collected as well. Unique indexes cannot be created on a MQT, so regular indexes are created on the primary key of the underlying table.

Replicating the dimension tables results in an extra copy of the table. In DB2 9, row compression was added to yield storage capacity savings. To reduce the overhead of the extra copy of the dimension table, it can be compressed. When deciding on using such a technique, we recommend the replicated MQTs are compressed as well. Otherwise, the optimizer could decide to perform a non-collocated join with the underlying dimension tables since they are compressed and thus smaller in size than the replicate MQT.

STEP 2B: Update the database configuration for database DSS_DB to set dft_refresh_age to "ANY" in order for the replicated MQTs to be selected by the optimizer:

Listing 29. Update the database configuration
db2_all db2 update db cfg for DSS_DB using dft_refresh_age any
db2 terminate

STEP 2C: Generate the db2exfmt output for the main query as in STEP 1C in Method 1. Examine the access plan and verify that the replicated MQTs are accessed (that is, date_dim_rep, product_dim_rep, and store_dim_rep are chosen). Open Test 2 in the Downloads section to see an example of this access plan should look.

In the above access plan, table queue (TQ) operators do not exist between the joins as in Method 1 since the optimizer chose to use the replicated dimension tables, which enable the joins to be collocated.

STEP 2D: Once you have confirmed that the MQTs are being accessed in the access plan, proceed to measure the performance using the db2batch tool as in STEP 1D in Method 1. Ensure the db2 instance is recycled prior to running db2batch. Then, record your results.

Note: For this method, the database configuration parameter, DFT_REFRESH_AGE, is set to ANY on all Database Partitions. If you want to run the test in Method 1 again, you will need to update the DFT_REFRESH_AGE database configuration parameter to "0". Otherwise, the replicate MQTs will be used instead of the base dimension tables.

Exercises

  1. Use row compression to compress the base dimension tables STORE_DIM, PRODUCT_DIM, and DATE_DIM. You will need to recollect statistics on all three dimension tables. Once statistics are recollected, repeat steps 2C-2D.

  2. If the optimizer didn't choose to access the replicated MQT in the first exercise, repeat it and compress the replicated MQTs as well.


METHOD 3: MDC fact table with replicated MQTs on dimensions

This test is similar to Method 2, but replacing the SALES_FACT table with an MDC fact table. MDC provides an automatic method for clustering data in tables along multiple dimensions and can significantly improve query performance, if appropriate dimension columns and extensize are chosen.

STEP 3A Calculating the extentsize.

An EXTENTSIZE of 12 is chosen for the tablespace and is calculated as follows:

  1. Using the guidelines from the Info Center for help with the MDC table dimensions, the (date_id,store_id) columns were selected to be dimension for the MDC table.

    The following query was used to compute the number of unique combinations of (date_id, store_id) in the sales_fact table:

    Listing 30. Query to compute number of unique combinations of (date_id, store_id)
    WITH TMP (DATE_ID, STORE_ID) AS 
      (SELECT DISTINCT DATE_ID, STORE_ID FROM SALES_FACT)
    SELECT COUNT(*) AS CELL_COUNT FROM TMP;
    
    CELL_COUNT
    -----------
          73097
  2. The query below computes the average number of rows per cell (RPC), the minimum rows per cell, and maximum rows per cell.

    Listing 31. Determining average number of rows
    WITH CELL_TABLE(DATE_ID,STORE_ID,RPC) AS 
    (
       SELECT DISTINCT DATE_ID,STORE_ID, COUNT(*) 
         FROM SALES_FACT 
       GROUP BY DATE_ID,STORE_ID
    )
    SELECT 
    	AVG(RPC) AS RPC, MIN(RPC) AS MINRPC, MAX(RPC) AS MAXRPC 
    FROM CELL_TABLE;
    
    RPC         MINRPC      MAXRPC
    ----------- ----------- -----------
            298           1         380
    
      1 record(s) selected.
  3. To calculate space per cell, we used the following formula from the administration guide for DB2 9, in the Space requirements for user table data section.

    The number of 4KB pages for each user table in the database can be estimated as follows. First, determine the average row length. In our case, the columns are of fixed data types, and we can add each column's length to find the row length. The column lengths can be retrieved using the DESCRIBE statement below:

    Listing 32. DESCRIBE statement
    DB2 DESCRIBE SELECT * FROM SALES_FACT
    
    SQLDA Information
    
     sqldaid : SQLDA     sqldabc: 896  sqln: 20  sqld: 6
    
     Column Information
    
     sqltype               sqllen  sqlname.data                    sqlname.length
     --------------------  ------  ------------------------------  --------------
     385   DATE                10  DATE_ID                                      7
     497   INTEGER              4  PRODUCT_ID                                  10
     497   INTEGER              4  STORE_ID                                     8
     497   INTEGER              4  QUANTITY                                     8
     497   INTEGER              4  PRICE                                        5
     453   CHARACTER          100  TRANSACTION_DETAILS                                     8

    The "sqllen" column in the results from the DESCRIBE statement denotes the length of each column.

    The formula to compute the average number of records per page is:

    RECORDS_PER_PAGE = ROUND DOWN( 4028 / (AVG ROW SIZE + 10)).

    In our case, AVG ROW SIZE = 126 bytes (the sum of the column lengths: 10+4+4+4+4+100).

    Hence, RECORDS_PER_PAGE = ROUND DOWN (4028 / (126+10)) = 29.

    The extra 10 bytes in the RECORDS_PER_PAGE formula are for overhead.

    The number of 4K pages needed to store an average of 298 records, (RPC as in Listing 31), can be calculated as follows:

    NUMBER_OF_PAGES = (NUMBER_OF_RECORDS / RECORDS_PER_PAGE) * 1.1 where NUMBER_OF_RECORDS = RPC=298

    NUMBER_OF_PAGES = ( 298 records / 29 records per page ) * 1.1 = 11.3 ~ 12 4K pages

    Hence, the EXTENTSIZE is 12.

STEP 3B: Create the MDC tablespace with extentsize of 12:

Listing 33. Creating MDC tablespace
db2 -tvf mdc_tablespace.ddl -z mdc_tablespace.log
Listing 34. Contents of mdc_tablespace.ddl
CREATE REGULAR TABLESPACE FACT_SMS_MDC_EX IN DATABASE PARTITION GROUP
FACT_GROUP PAGESIZE 4096 MANAGED BY SYSTEM
USING ('d:\database\fact_tbsp_mdc_EX120') ON DBPARTITIONNUMS (0)
USING ('d:\database\fact_tbsp_mdc_EX121') ON DBPARTITIONNUMS (1)
USING ('d:\database\fact_tbsp_mdc_EX122') ON DBPARTITIONNUMS (2)
USING ('d:\database\fact_tbsp_mdc_EX123') ON DBPARTITIONNUMS (3)
	 EXTENTSIZE 12
	 PREFETCHSIZE 24
	 BUFFERPOOL IBMDEFAULTBP
	 OVERHEAD 7.500000
	 TRANSFERRATE 0.060000 
	 NO FILE SYSTEM CACHING  
	 DROPPED TABLE RECOVERY ON;

STEP 3C: Create the MDC table

Listing 35. Creating MDC table
db2 -tvf sales_fact_mdc.ddl -z sales_fact_mdc.log
Listing 36. Contents of sales_fact_mdc.ddl file
CONNECT TO DSS_DB;

---------------------------------------------------------
-- DDL Statements for table "SKAPOOR "."SALES_FACT_MDC_1"
---------------------------------------------------------
 

CREATE TABLE "SKAPOOR "."SALES_FACT_MDC_1"  (
		  "DATE_ID" DATE , 
		  "PRODUCT_ID" INTEGER , 
		  "STORE_ID" INTEGER , 
		  "QUANTITY" INTEGER , 
		  "PRICE" INTEGER , 
		  "RESERVE" CHAR(100) )   
		 DISTRIBUTE BY HASH("DATE_ID")   
		   IN "FACT_SMS_MDC_EX"  
		 ORGANIZE BY ( 
		  ( "DATE_ID" ) , 
		  ( "STORE_ID" ) ) 
		 ; 

COMMIT WORK;
CONNECT RESET;

Note: The block indexes used for the MDC table are automatically created on the fact tables dimension columns (date_id, store_id).

STEP 3D: Insert data into the MDC table. It took approximately four hours to insert the data into the MDC table in our test environment.

Listing 37. Insertion of data into the MDC table
db2 -tvf sales_fact_mdc_insert_alter.ddl -z sales_fact_mdc_insert.log
Listing 38. Contents of sales_fact_mdc_insert_alter.ddl
CONNECT TO DSS_DB;

VALUES(CURRENT TIMESTAMP);

-----------------------------------
-- SET OPTLEVEL 0 TO FAVOUR INDEX ACCESS TO IMPROVE PERFORMANCE OF INSERT.
SET CURRENT QUERY OPTIMIZATION 0;
-----------------------------------

-- INSERTING THE DATA IN THE ORDER OF THE MDC COLUMNS IMPROVES 
-- THE PERFORMANCE OF THE INSERT.
INSERT INTO SKAPOOR.SALES_FACT_MDC_1 SELECT *
FROM SKAPOOR.SALES_FACT ORDER BY DATE_ID,STORE_ID;

ALTER TABLE SKAPOOR.SALES_FACT_MDC_1
ADD CONSTRAINT DATE_DIM_FK FOREIGN KEY (DATE_ID) REFERENCES SKAPOOR.DATE_DIM;
ALTER TABLE SKAPOOR.SALES_FACT_MDC_1
ADD CONSTRAINT STORE_DIM_FK FOREIGN KEY (STORE_ID) REFERENCES SKAPOOR.STORE_DIM;
ALTER TABLE SKAPOOR.SALES_FACT_MDC_1
ADD CONSTRAINT PRODUCT_DIM_FK FOREIGN KEY (PRODUCT_ID)
REFERENCES SKAPOOR.PRODUCT_DIM;

VALUES(CURRENT TIMESTAMP);

RUNSTATS ON TABLE SKAPOOR.SALES_FACT_MDC_1 WITH DISTRIBUTION AND INDEXES ALL;

STEP 3E: The query in Listing 2 was modified to change the table name from "SALES_FACT" to "SALES_FACT_MDC_1", in order to test the benefits of MDC. Below, Listing 39 describes the new query. Compile the query in explain mode and generate the db2exfmt output for the main query as in Method 1 STEP 1C. Verify that the access plan utilizes the MDC indexes and looks similar to Test 3, found in the Downloads section.

Listing 39. MDC Query
WITH TMP1 (MONTH_1,STORE,REGION,DISTRICT,AMOUNT_1) AS
( SELECT
		D.MONTH AS MONTH, 
		S.STORE_ID AS STORE_ID,
		S.DISTRICT AS DISTRICT,
		S.REGION AS REGION,
		SUM(F1.QUANTITY * F1.PRICE) AS AMOUNT

  FROM
		SKAPOOR.SALES_FACT_MDC_1 F1,
		SKAPOOR.DATE_DIM D,
		SKAPOOR.PRODUCT_DIM P,
		SKAPOOR.STORE_DIM S
  WHERE
		P.MODEL LIKE '%model%' AND
		F1.DATE_ID=D.DATE_ID AND
		F1.PRODUCT_ID=P.PRODUCT_ID AND
		F1.STORE_ID=S.STORE_ID AND
		F1.DATE_ID BETWEEN '2006-01-01' AND '2006-01-31' AND
		F1.STORE_ID IN (29, 30, 42, 55, 67, 72, 82, 99, 123, 199) AND
		D.MONTH = 1 
  GROUP BY
		S.STORE_ID,S.DISTRICT,S.REGION,D.MONTH) ,
TMP2 (MONTH_11,STORE,REGION,DISTRICT,AMOUNT_11) AS
(	SELECT 

		D1.MONTH AS MONTH,
		S1.STORE_ID AS STORE_ID,
		S1.DISTRICT AS DISTRICT,
		S1.REGION AS REGION,
		SUM(F2.QUANTITY * F2.PRICE) AS AMOUNT
		
	FROM
		SKAPOOR.SALES_FACT_MDC_1 F2,
		SKAPOOR.DATE_DIM D1,
		SKAPOOR.PRODUCT_DIM P1,
		SKAPOOR.STORE_DIM S1
	WHERE
		P1.MODEL LIKE '%model%' AND
		F2.DATE_ID=D1.DATE_ID AND
		F2.PRODUCT_ID=P1.PRODUCT_ID AND
		F2.STORE_ID=S1.STORE_ID AND
		F2.DATE_ID BETWEEN '2006-11-01' AND '2006-11-30' AND
		F2.STORE_ID IN (29, 30, 42, 55, 67, 72, 82, 99, 123, 199) AND
		D1.MONTH=11
	GROUP BY
		S1.STORE_ID,S1.DISTRICT,S1.REGION,D1.MONTH) 
SELECT 
	A.*, 
	B.*
FROM
	TMP1 A LEFT OUTER JOIN TMP2 B ON
          (A.STORE=B.STORE AND A.REGION=B.REGION AND A.DISTRICT=B.DISTRICT)
ORDER BY A.AMOUNT_1 DESC, B.AMOUNT_11 DESC;

STEP 3F:Recycle the instance and use the db2batch tool to measure the performance as in Method 1, STEP 1D.

Note: The query in QUERY1.SQL file was changed to reflect the query in Listing 39.

Record your results.


METHOD 4: Table partitioning with replicated MQTs on dimensions

This test is similar to Method 2, but replaces the SALES_FACT table with a table partitioned fact table. Table partitioning is a new feature in DB2 9. It is a data organization scheme in which table data is divided across multiple storage objects called data partitions or ranges according to values in one or more table columns. Each data partition is a separate physical entity and can reside in different tablespaces, in the same tablespace, or a combination of both. This scheme is beneficial for very large tables in a BI environment, offering simplified roll-in and roll-out of data and efficient query execution by eliminating the scanning of partitions that don't need to be accessed based on the predicates applied.

STEP 4A: Create the partitioned table.

The first step is to determine an appropriate range within which to partition. The date is a common range to partition, so we will partition the table on the DATE_ID column of SALES_FACT. The Info Center provides further details about defining ranges on partitioned tables. Since the SALES_FACT table consists of transactions for the full year, and our query compares sales within a month time frame, each range consists of data from a single month.

To illustrate the simplified roll-in of data with partitioned tables, we start with a partitioned table that contains a single range for the month of January, and attach partitions representing each subsequent month:

Listing 40. Create the partitioned SALES_FACT table
db2 -tvf tablepart.ddl -z tablepart.log
Listing 41. Contents of tablepart.ddl
CONNECT TO DSS_DB;

CREATE REGULAR TABLESPACE FACT_TPART_SMS IN DATABASE PARTITION GROUP FACT_GROUP
PAGESIZE 4096 MANAGED BY SYSTEM 
	 USING ('d:\database\fact_tpart_tbsp0') ON DBPARTITIONNUMS (0)
	 USING ('d:\database\fact_tpart_tbsp1') ON DBPARTITIONNUMS (1)
	 USING ('d:\database\fact_tpart_tbsp2') ON DBPARTITIONNUMS (2)
	 USING ('d:\database\fact_tpart_tbsp3') ON DBPARTITIONNUMS (3)
	 EXTENTSIZE 32
	 PREFETCHSIZE AUTOMATIC
	 BUFFERPOOL IBMDEFAULTBP
	 OVERHEAD 7.500000
	 TRANSFERRATE 0.060000 
	 NO FILE SYSTEM CACHING;

-- CREATE THE SALES_FACT TABLE PARTITIONED ON DATE_ID
-- WITH A SINGLE PARTITION TO START WITH, CONTAINING
-- ALL SALES FROM JANUARY
CREATE TABLE "SKAPOOR "."SALES_FACT_TPART"  (
		  "DATE_ID" DATE , 
		  "PRODUCT_ID" INTEGER , 
		  "STORE_ID" INTEGER , 
		  "QUANTITY" INTEGER , 
		  "PRICE" INTEGER , 
		  "TRANSACTION_DETAILS" CHAR(100) )
		 DISTRIBUTE BY HASH("DATE_ID")   
                   IN "FACT_TPART_SMS"
                 PARTITION BY ("DATE_ID")
                    (PART Jan STARTING ('1/1/2006') ENDING ('1/31/2006'))
;

VALUES (CURRENT TIMESTAMP);

-- POPULATE THE SALES FROM JAN
INSERT INTO "SKAPOOR"."SALES_FACT_TPART"
SELECT * FROM "SKAPOOR"."SALES_FACT"
WHERE DATE_ID BETWEEN '1/1/2006' AND '1/31/2006';

commit work;

VALUES (CURRENT TIMESTAMP);

-- CREATE TABLES FOR SALES FROM EACH MONTH
-- WHICH WILL THEN BE ATTACHED TO SALES_FACT_TPART TABLE
CREATE TABLE "SKAPOOR"."SALES_FEB" (
		  "DATE_ID" DATE , 
		  "PRODUCT_ID" INTEGER , 
		  "STORE_ID" INTEGER , 
		  "QUANTITY" INTEGER , 
		  "PRICE" INTEGER , 
		  "TRANSACTION_DETAILS" CHAR(100) )
		 DISTRIBUTE BY HASH("DATE_ID")
                   IN "FACT_TPART_SMS";
CREATE TABLE "SKAPOOR"."SALES_MAR" (
		  "DATE_ID" DATE , 
		  "PRODUCT_ID" INTEGER , 
		  "STORE_ID" INTEGER , 
		  "QUANTITY" INTEGER , 
		  "PRICE" INTEGER , 
		  "TRANSACTION_DETAILS" CHAR(100) )
		 DISTRIBUTE BY HASH("DATE_ID")
                   IN "FACT_TPART_SMS";
CREATE TABLE "SKAPOOR"."SALES_APR" (
		  "DATE_ID" DATE , 
		  "PRODUCT_ID" INTEGER , 
		  "STORE_ID" INTEGER , 
		  "QUANTITY" INTEGER , 
		  "PRICE" INTEGER , 
		  "TRANSACTION_DETAILS" CHAR(100) )
		 DISTRIBUTE BY HASH("DATE_ID")
                   IN "FACT_TPART_SMS";
CREATE TABLE "SKAPOOR"."SALES_MAY" (
		  "DATE_ID" DATE , 
		  "PRODUCT_ID" INTEGER , 
		  "STORE_ID" INTEGER , 
		  "QUANTITY" INTEGER , 
		  "PRICE" INTEGER , 
		  "TRANSACTION_DETAILS" CHAR(100) )
		 DISTRIBUTE BY HASH("DATE_ID")
                   IN "FACT_TPART_SMS";
CREATE TABLE "SKAPOOR"."SALES_JUN" (
		  "DATE_ID" DATE , 
		  "PRODUCT_ID" INTEGER , 
		  "STORE_ID" INTEGER , 
		  "QUANTITY" INTEGER , 
		  "PRICE" INTEGER , 
		  "TRANSACTION_DETAILS" CHAR(100) )
		 DISTRIBUTE BY HASH("DATE_ID")
                   IN "FACT_TPART_SMS";
CREATE TABLE "SKAPOOR"."SALES_JUL" (
		  "DATE_ID" DATE , 
		  "PRODUCT_ID" INTEGER , 
		  "STORE_ID" INTEGER , 
		  "QUANTITY" INTEGER , 
		  "PRICE" INTEGER , 
		  "TRANSACTION_DETAILS" CHAR(100) )
		 DISTRIBUTE BY HASH("DATE_ID")
                   IN "FACT_TPART_SMS";
CREATE TABLE "SKAPOOR"."SALES_AUG" (
		  "DATE_ID" DATE , 
		  "PRODUCT_ID" INTEGER , 
		  "STORE_ID" INTEGER , 
		  "QUANTITY" INTEGER , 
		  "PRICE" INTEGER , 
		  "TRANSACTION_DETAILS" CHAR(100) )
		 DISTRIBUTE BY HASH("DATE_ID")
                   IN "FACT_TPART_SMS";
CREATE TABLE "SKAPOOR"."SALES_SEP" (
		  "DATE_ID" DATE , 
		  "PRODUCT_ID" INTEGER , 
		  "STORE_ID" INTEGER , 
		  "QUANTITY" INTEGER , 
		  "PRICE" INTEGER , 
		  "TRANSACTION_DETAILS" CHAR(100) )
		 DISTRIBUTE BY HASH("DATE_ID")
                   IN "FACT_TPART_SMS";
CREATE TABLE "SKAPOOR"."SALES_OCT" (
		  "DATE_ID" DATE , 
		  "PRODUCT_ID" INTEGER , 
		  "STORE_ID" INTEGER , 
		  "QUANTITY" INTEGER , 
		  "PRICE" INTEGER , 
		  "TRANSACTION_DETAILS" CHAR(100) )
		 DISTRIBUTE BY HASH("DATE_ID")
                   IN "FACT_TPART_SMS";
CREATE TABLE "SKAPOOR"."SALES_NOV" (
		  "DATE_ID" DATE , 
		  "PRODUCT_ID" INTEGER , 
		  "STORE_ID" INTEGER , 
		  "QUANTITY" INTEGER , 
		  "PRICE" INTEGER , 
		  "TRANSACTION_DETAILS" CHAR(100) )
		 DISTRIBUTE BY HASH("DATE_ID")
                   IN "FACT_TPART_SMS";
CREATE TABLE "SKAPOOR"."SALES_DEC" (
		  "DATE_ID" DATE , 
		  "PRODUCT_ID" INTEGER , 
		  "STORE_ID" INTEGER , 
		  "QUANTITY" INTEGER , 
		  "PRICE" INTEGER , 
		  "TRANSACTION_DETAILS" CHAR(100) )
		 DISTRIBUTE BY HASH("DATE_ID")
                   IN "FACT_TPART_SMS";

VALUES (CURRENT TIMESTAMP);

-- POPULATE EACH TABLE WITH SALES FOR THE CORRESPONDING MONTH
INSERT INTO "SKAPOOR"."SALES_FEB"
SELECT * FROM "SKAPOOR"."SALES_FACT"
WHERE DATE_ID BETWEEN '2/1/2006' AND '2/28/2006';

commit work;

INSERT INTO "SKAPOOR"."SALES_MAR"
SELECT * FROM "SKAPOOR"."SALES_FACT"
WHERE DATE_ID BETWEEN '3/1/2006' AND '3/31/2006';

commit work;

INSERT INTO "SKAPOOR"."SALES_APR"
SELECT * FROM "SKAPOOR"."SALES_FACT"
WHERE DATE_ID BETWEEN '4/1/2006' AND '4/30/2006';

commit work;

INSERT INTO "SKAPOOR"."SALES_MAY"
SELECT * FROM "SKAPOOR"."SALES_FACT"
WHERE DATE_ID BETWEEN '5/1/2006' AND '5/31/2006';

commit work;

INSERT INTO "SKAPOOR"."SALES_JUN"
SELECT * FROM "SKAPOOR"."SALES_FACT"
WHERE DATE_ID BETWEEN '6/1/2006' AND '6/30/2006';

commit work;

INSERT INTO "SKAPOOR"."SALES_JUL"
SELECT * FROM "SKAPOOR"."SALES_FACT"
WHERE DATE_ID BETWEEN '7/1/2006' AND '7/31/2006';

commit work;

INSERT INTO "SKAPOOR"."SALES_AUG"
SELECT * FROM "SKAPOOR"."SALES_FACT"
WHERE DATE_ID BETWEEN '8/1/2006' AND '8/31/2006';

commit work;

INSERT INTO "SKAPOOR"."SALES_SEP"
SELECT * FROM "SKAPOOR"."SALES_FACT"
WHERE DATE_ID BETWEEN '9/1/2006' AND '9/30/2006';

commit work;

INSERT INTO "SKAPOOR"."SALES_OCT"
SELECT * FROM "SKAPOOR"."SALES_FACT"
WHERE DATE_ID BETWEEN '10/1/2006' AND '10/31/2006';

commit work;

INSERT INTO "SKAPOOR"."SALES_NOV"
SELECT * FROM "SKAPOOR"."SALES_FACT"
WHERE DATE_ID BETWEEN '11/1/2006' AND '11/30/2006';

commit work;

INSERT INTO "SKAPOOR"."SALES_DEC"
SELECT * FROM "SKAPOOR"."SALES_FACT"
WHERE DATE_ID BETWEEN '12/1/2006' AND '12/31/2006';

commit work;

VALUES (CURRENT TIMESTAMP);

-- Attach SALES from February and March
ALTER TABLE "SKAPOOR"."SALES_FACT_TPART" 
  ATTACH PARTITION Feb STARTING FROM '2/1/2006' ENDING AT '2/28/2006' 
  FROM "SKAPOOR"."SALES_FEB";
ALTER TABLE "SKAPOOR"."SALES_FACT_TPART" 
  ATTACH PARTITION Mar STARTING FROM '3/1/2006' ENDING AT '3/31/2006'
  FROM "SKAPOOR"."SALES_MAR";

-- Make the partitions visible
SET INTEGRITY FOR "SKAPOOR"."SALES_FACT_TPART"
IMMEDIATE CHECKED;

commit work;

-- Attach SALES from April to June
ALTER TABLE "SKAPOOR"."SALES_FACT_TPART" 
  ATTACH PARTITION Apr STARTING FROM '4/1/2006' ENDING AT '4/30/2006'
  FROM "SKAPOOR"."SALES_APR";
ALTER TABLE "SKAPOOR"."SALES_FACT_TPART" 
  ATTACH PARTITION May STARTING FROM '5/1/2006' ENDING AT '5/31/2006'
  FROM "SKAPOOR"."SALES_MAY";
ALTER TABLE "SKAPOOR"."SALES_FACT_TPART" 
  ATTACH PARTITION Jun STARTING FROM '6/1/2006' ENDING AT '6/30/2006'
  FROM "SKAPOOR"."SALES_JUN";

SET INTEGRITY FOR "SKAPOOR"."SALES_FACT_TPART"
IMMEDIATE CHECKED;

commit work;

-- Attach SALES from July to Dec
ALTER TABLE "SKAPOOR"."SALES_FACT_TPART" 
  ATTACH PARTITION Jul STARTING FROM '7/1/2006' ENDING AT '7/31/2006' 
  FROM "SKAPOOR"."SALES_JUL";
ALTER TABLE "SKAPOOR"."SALES_FACT_TPART" 
  ATTACH PARTITION Aug STARTING FROM '8/1/2006' ENDING AT '8/31/2006' 
  FROM "SKAPOOR"."SALES_AUG";
ALTER TABLE "SKAPOOR"."SALES_FACT_TPART" 
  ATTACH PARTITION Sep STARTING FROM '9/1/2006' ENDING AT '9/30/2006' 
  FROM "SKAPOOR"."SALES_SEP";
ALTER TABLE "SKAPOOR"."SALES_FACT_TPART" 
  ATTACH PARTITION Oct STARTING FROM '10/1/2006' ENDING AT '10/31/2006'
  FROM "SKAPOOR"."SALES_OCT";
ALTER TABLE "SKAPOOR"."SALES_FACT_TPART" 
  ATTACH PARTITION Nov STARTING FROM '11/1/2006' ENDING AT '11/30/2006'
  FROM "SKAPOOR"."SALES_NOV";
ALTER TABLE "SKAPOOR"."SALES_FACT_TPART" 
  ATTACH PARTITION Dec STARTING FROM '12/1/2006' ENDING AT '12/31/2006'
  FROM "SKAPOOR"."SALES_DEC";

SET INTEGRITY FOR "SKAPOOR"."SALES_FACT_TPART"
IMMEDIATE CHECKED;
yo
commit work;

VALUES(CURRENT TIMESTAMP);

RUNSTATS ON TABLE SKAPOOR.SALES_FACT_TPART WITH DISTRIBUTION;

commit work;

CONNECT RESET;

In our test environment, it took approximately three hours to populate all partitions. The attaching of each partition was quick. Once a table is attached, it becomes a physical entity of the SALES_FACT_TPART table and cannot be queried as a separate table any longer. To do so, it must be detached from the SALES_FACT_TPART table.

STEP 4B: The query in Listing 2 was modified to change the table name from "SALES_FACT" to "SALES_FACT_TPART", to test the benefits of partition elimination. Below, Listing 42 describes the new query. Compile the query in explain mode and generate the db2exfmt output for the main query as in Method 1, STEP 1C. Verify that the access plan utilizes the partitioned table and looks similar to TPART, found in the Downloads section.

Listing 42. Partitioned table query
WITH TMP1 (MONTH_1,STORE,REGION,DISTRICT,AMOUNT_1) AS
( SELECT
		D.MONTH AS MONTH, 
		S.STORE_ID AS STORE_ID,
		S.DISTRICT AS DISTRICT,
		S.REGION AS REGION,
		SUM(F1.QUANTITY * F1.PRICE) AS AMOUNT

  FROM
		SKAPOOR.SALES_FACT_TPART F1,
		SKAPOOR.DATE_DIM D,
		SKAPOOR.PRODUCT_DIM P,
		SKAPOOR.STORE_DIM S
  WHERE
		P.MODEL LIKE '%model%' AND
		F1.DATE_ID=D.DATE_ID AND
		F1.PRODUCT_ID=P.PRODUCT_ID AND
		F1.STORE_ID=S.STORE_ID AND
		F1.DATE_ID BETWEEN '2006-01-01' AND '2006-01-31' AND
		F1.STORE_ID IN (29, 30, 42, 55, 67, 72, 82, 99, 123, 199) AND
		D.MONTH = 1 
  GROUP BY
		S.STORE_ID,S.DISTRICT,S.REGION,D.MONTH) ,
TMP2 (MONTH_11,STORE,REGION,DISTRICT,AMOUNT_11) AS
(	SELECT 

		D1.MONTH AS MONTH,
		S1.STORE_ID AS STORE_ID,
		S1.DISTRICT AS DISTRICT,
		S1.REGION AS REGION,
		SUM(F2.QUANTITY * F2.PRICE) AS AMOUNT
		
	FROM
		SKAPOOR.SALES_FACT_TPART F2,
		SKAPOOR.DATE_DIM D1,
		SKAPOOR.PRODUCT_DIM P1,
		SKAPOOR.STORE_DIM S1
	WHERE
		P1.MODEL LIKE '%model%' AND
		F2.DATE_ID=D1.DATE_ID AND
		F2.PRODUCT_ID=P1.PRODUCT_ID AND
		F2.STORE_ID=S1.STORE_ID AND
		F2.DATE_ID BETWEEN '2006-11-01' AND '2006-11-30' AND
		F2.STORE_ID IN (29, 30, 42, 55, 67, 72, 82, 99, 123, 199) AND
		D1.MONTH=11
	GROUP BY
		S1.STORE_ID,S1.DISTRICT,S1.REGION,D1.MONTH) 
SELECT 
	A.*, 
	B.*
FROM
	TMP1 A LEFT OUTER JOIN TMP2 B ON
          (A.STORE=B.STORE AND A.REGION=B.REGION AND A.DISTRICT=B.DISTRICT)
ORDER BY A.AMOUNT_1 DESC, B.AMOUNT_11 DESC;

In the db2exfmt output, the details about the partitioned table access indicate if partition elimination occurred and which partitions are accessed:

	14) TBSCAN: (Table Scan)
		Cumulative Total Cost: 		15378
		Cumulative CPU Cost: 		8.77067e+008
		Cumulative I/O Cost: 		15213
		Cumulative Re-Total Cost: 	15378
		Cumulative Re-CPU Cost: 	8.77065e+008
		Cumulative Re-I/O Cost: 	15213
		Cumulative First Row Cost: 	8.22883
		Cumulative Comm Cost:		0
		Cumulative First Comm Cost:	0
		Estimated Bufferpool Buffers: 	15213

		Arguments:
		---------
		DPESTFLG: (Number of data partitions accessed are Estimated)
			FALSE
		DPLSTPRT: (List of data partitions accessed)
			10
		DPNUMPRT: (Number of data partitions accessed)
			1
			
	...
		DP Elim Predicates:
		------------------
		Range 1)
			Stop  Predicate: (Q10.DATE_ID <= '11/30/2006')
			Start Predicate: ('11/01/2006' <= Q10.DATE_ID)
        ...

The DPESTFLG argument indicates whether the partition elimination was estimated (TRUE) or computed exactly (FALSE) at compile time. If estimated, then the actual partition elimination is determined at run time. In this example, the partition elimination was computed at compile time. The DPLSTPRT argument indicates which partition(s) are accessed, and DPNUMPRT indicates the number of partitions accessed. If DPESTFLG is TRUE, then these are just estimates computed by the optimizer. In this example, only a single partition, partition 10, is accessed; the rest of the partitions are ignored.

The DP Elim Predicates section lists the predicates used to determine which partitions must be accessed.

STEP 4C: Recycle the instance and use the db2batch tool to measure the performance as in Method 1, STEP 1D.

Note: The query in QUERY1.SQL file was changed to reflect the query in Listing 42.

Record your results.

Exercises

  1. Indexes could help improve the performance of the query using the partitioned table SALES_FACT_TPART. Create one or more indexes you think might be useful and re-do steps 4B and 4C. Don't forget to collect statistics on the indexes.
  2. Try detaching one or more partitions from the SALES_FACT_TPART table to get a feel for how the roll-out of data can be easily and efficiently achieved using partitioned tables.

METHOD 5: Table partitioning with MDC and replicated MQTs on dimensions

This test is similar to Method 4, but replacing the SALES_FACT_TPART table with a partitioned MDC fact table. MDC and table partitioning can be combined to further improve query execution performance. Using the same techniques as Method 3, the DATE_ID and STORE_ID columns are used as the dimension columns, and the same range, DATE_ID is used as in Method 4.

STEP 5A: Create the partitioned MDC table

Listing 43. Create the partitioned MDC SALES_FACT table
db2 -tvf tablepart_mdc.ddl -z tablepart_mdc.log
Listing 44. Contents of tablepart_mdc.ddl
CONNECT TO DSS_DB;

CREATE TABLE "SKAPOOR "."SALES_FACT_TPART_MDC" ( "DATE_ID" DATE ,
 "PRODUCT_ID" INTEGER ,
 "STORE_ID" INTEGER ,
 "QUANTITY" INTEGER ,
 "PRICE" INTEGER ,
 "TRANSACTION_DETAILS" CHAR(100))
DISTRIBUTE BY HASH("DATE_ID")  
  PARTITION BY RANGE("DATE_ID")
  (PART "JAN" STARTING('2006-01-01') ENDING('2006-01-31') IN "FACT_TPART_SMS",
   PART "FEB" STARTING('2006-02-01') ENDING('2006-02-28') IN "FACT_TPART_SMS",
   PART "MAR" STARTING('2006-03-01') ENDING('2006-03-31') IN "FACT_TPART_SMS",
   PART "APR" STARTING('2006-04-01') ENDING('2006-04-30') IN "FACT_TPART_SMS",
   PART "MAY" STARTING('2006-05-01') ENDING('2006-05-31') IN "FACT_TPART_SMS",
   PART "JUN" STARTING('2006-06-01') ENDING('2006-06-30') IN "FACT_TPART_SMS",
   PART "JUL" STARTING('2006-07-01') ENDING('2006-07-31') IN "FACT_TPART_SMS",
   PART "AUG" STARTING('2006-08-01') ENDING('2006-08-31') IN "FACT_TPART_SMS",
   PART "SEP" STARTING('2006-09-01') ENDING('2006-09-30') IN "FACT_TPART_SMS",
   PART "OCT" STARTING('2006-10-01') ENDING('2006-10-31') IN "FACT_TPART_SMS",
   PART "NOV" STARTING('2006-11-01') ENDING('2006-11-30') IN "FACT_TPART_SMS",
   PART "DEC" STARTING('2006-12-01') ENDING('2006-12-31') IN "FACT_TPART_SMS")
 ORGANIZE BY (
        DATE_ID,
 	STORE_ID) 
 ;

COMMIT WORK ;

INSERT INTO SKAPOOR.SALES_FACT_TPART_MDC
  SELECT * FROM SKAPOOR.SALES_FACT_MDC_1;

COMMIT WORK;

RUNSTATS ON TABLE SKAPORR.SALES_FACT_TPART_MDC WITH DISTRIBUTION AND INDEXES ALL;

COMMIT WORK;
CONNECT RESET;

STEP 5B: The query in Listing 2 was modified to change the table name from "SALES_FACT" to "SALES_FACT_TPART_MDC", to test the benefits of combining MDC and partition elimination. Below, Listing 45 describes the new query. Compile the query in explain mode and generate the db2exfmt output for the main query as in Method 1, STEP 1C. Verify that the access plan utilizes the partitioned table and block index, and looks similar to TPART_MDC, found in the Downlaods section.

Listing 45. Partitioned MDC table query
WITH TMP1 (MONTH_1,STORE,REGION,DISTRICT,AMOUNT_1) AS
( SELECT
		D.MONTH AS MONTH, 
		S.STORE_ID AS STORE_ID,
		S.DISTRICT AS DISTRICT,
		S.REGION AS REGION,
		SUM(F1.QUANTITY * F1.PRICE) AS AMOUNT

  FROM
		SKAPOOR.SALES_FACT_TPART_MDC F1,
		SKAPOOR.DATE_DIM D,
		SKAPOOR.PRODUCT_DIM P,
		SKAPOOR.STORE_DIM S
  WHERE
		P.MODEL LIKE '%model%' AND
		F1.DATE_ID=D.DATE_ID AND
		F1.PRODUCT_ID=P.PRODUCT_ID AND
		F1.STORE_ID=S.STORE_ID AND
		F1.DATE_ID BETWEEN '2006-01-01' AND '2006-01-31' AND
		F1.STORE_ID IN (29, 30, 42, 55, 67, 72, 82, 99, 123, 199) AND
		D.MONTH = 1 
  GROUP BY
		S.STORE_ID,S.DISTRICT,S.REGION,D.MONTH) ,
TMP2 (MONTH_11,STORE,REGION,DISTRICT,AMOUNT_11) AS
(	SELECT 

		D1.MONTH AS MONTH,
		S1.STORE_ID AS STORE_ID,
		S1.DISTRICT AS DISTRICT,
		S1.REGION AS REGION,
		SUM(F2.QUANTITY * F2.PRICE) AS AMOUNT
		
	FROM
		SKAPOOR.SALES_FACT_TPART_MDC F2,
		SKAPOOR.DATE_DIM D1,
		SKAPOOR.PRODUCT_DIM P1,
		SKAPOOR.STORE_DIM S1
	WHERE
		P1.MODEL LIKE '%model%' AND
		F2.DATE_ID=D1.DATE_ID AND
		F2.PRODUCT_ID=P1.PRODUCT_ID AND
		F2.STORE_ID=S1.STORE_ID AND
		F2.DATE_ID BETWEEN '2006-11-01' AND '2006-11-30' AND
		F2.STORE_ID IN (29, 30, 42, 55, 67, 72, 82, 99, 123, 199) AND
		D1.MONTH=11
	GROUP BY
		S1.STORE_ID,S1.DISTRICT,S1.REGION,D1.MONTH) 
SELECT 
	A.*, 
	B.*
FROM
	TMP1 A LEFT OUTER JOIN TMP2 B ON
          (A.STORE=B.STORE AND A.REGION=B.REGION AND A.DISTRICT=B.DISTRICT)
ORDER BY A.AMOUNT_1 DESC, B.AMOUNT_11 DESC;

STEP 5C:Recycle the instance and use the db2batch tool to measure the performance as in Method 1.

Note: The query in QUERY1.SQL file was changed to reflect the query in Listing 39.

Record your results.


Method 6: MQT to pre-compute the aggregated results

This test is similar to Method 1 with the addition of an MQT to pre-compute the aggregated results. Using MQTs to materialize results expressed as aggregates may significantly improve query performance. In the query described in Listing 2, each arm of the outer join consists of an aggregation on the same join. The only difference is the local predicates applied to the fact table. If the joins can be pre-computed prior to executing the query, you can gain significantly in query execution performance.

STEP 6A: Create and refresh the MQT:

Listing 46. Creating the MQT table
db2 -tvf mqt2.ddl -z mqt2.log
Listing 47. Contents of mqt2.ddl file
CONNECT TO DSS_DB;

------------------------------------------------
-- DDL STATEMENTS FOR TABLE "SKAPOOR "."MQT2"

CREATE TABLE SKAPOOR.MQT2 AS
( 
  SELECT 
     D.MONTH AS MONTH, 
     S.STORE_ID AS STORE_ID,
     S.DISTRICT AS DISTRICT, 
     S.REGION AS REGION, 
     SUM(F1.QUANTITY * F1.PRICE) AS AMOUNT , 
     F1.DATE_ID 
  FROM
     SKAPOOR.SALES_FACT F1, 
     SKAPOOR.DATE_DIM D,
     SKAPOOR.PRODUCT_DIM P, 
     SKAPOOR.STORE_DIM S 
  WHERE 
     F1.DATE_ID=D.DATE_ID AND 
     F1.PRODUCT_ID=P.PRODUCT_ID AND 
     F1.STORE_ID=S.STORE_ID AND 
     P.MODEL LIKE '%MODEL%' 
  GROUP BY S.STORE_ID,S.DISTRICT,S.REGION,D.MONTH, F1.DATE_ID
) 
DATA INITIALLY DEFERRED REFRESH DEFERRED IN FACT_SMS;

REFRESH TABLE "SKAPOOR "."MQT2";


-- DDL STATEMENTS FOR INDEXES ON TABLE "SKAPOOR "."MQT2"

CREATE INDEX "SKAPOOR "."MQT2_IND3" ON "SKAPOOR "."MQT2" 
		("MONTH" ASC,
		 "DATE_ID" ASC)
		ALLOW REVERSE SCANS;


-- DDL STATEMENTS FOR INDEXES ON TABLE "SKAPOOR "."MQT2"

CREATE INDEX "SKAPOOR "."MQT2_IND4" ON "SKAPOOR "."MQT2" 
		("DATE_ID" ASC,
		 "STORE_ID" ASC,
		 "DISTRICT" ASC,
		 "REGION" ASC,
		 "MONTH" ASC,
		 "AMOUNT" ASC)
		ALLOW REVERSE SCANS;

The two indexes created in Listing 41, MQT2_IND3 and MQT2_IND4, are created to improve performance of accessing data from the MQT.

STEP 6B: Collect statistics on the MQT as follows, and adjust the schema to fit your environment:

Listing 48. Collecting statistics on the MQT table
DB2 RUNSTATS ON TABLE SKAPOOR.MQT2 WITH DISTRIBUTION AND INDEXES ALL

STEP 6C: Explain the query from Listing 2 and generate the db2exfmt output as in STEP 1C in Method 1. Then, examine the access plan. You should see the MQT, MQT2 chosen with a single join operator to complete the outer join of the two MQTs. The plan should look similar to the Test 6, found in the Downloads section.

If the MQT is not chosen, ensure the DFT_REFRESH_AGE is set to "ANY" in your database configuration on all database partitions; otherwise, the MQT will not be considered by the optimizer.

STEP 6D: Recycle the instance and measure the performance using the db2batch tool as in STEP 1D in Method 1.

Now, record your results.


Examining the effects of each method on query execution performance

Note: All the tests were executed in an environment with no other activity running.

The following table lists the elapsed time (in seconds) measured for each method tested on our system using the db2batch tool.

METHODQUERYELAPSED TIME (SECS)
1. Referential Integrity ConstraintsListing 2115.00
2. Replicated MQTsListing 2103.42
3. Multidimensional Clustering and Replicated MQTsListing 3938.36
4. Table Partitioning and Replicated MQTsListing 42197.74
5. Table Partitioning, MDC, and Replicated MQTsListing 4532.21
6. MQT to Pre-compute Aggregation ResultsListing 27.61

The results show that using MQTs to pre-compute the result, expressed as aggregates, provided the most significant improvements to query performance. Multidimensional clustering, and table partitioning in combination with MDC also showed significant improvement in query performance as compared to the base star-schema layout in Method 1.

Replicating the dimension tables improved performance slightly in our test environment. This is because all four database partitions are logical and are created on a single physical machine; if multiple physical machines are used for the database partitions, this method should provide significant savings, especially when a large amount of data is shipped between the database partitions.

Table partitioning, on its own, actually made performance worse. In our test, no indexes were created to compare partition elimination on its own. Additional predicates on the fact table further filter the single partition accessed in each leg of the outer join. Better performance is another advantage of creating one or more indexes on the table. This is an exercise presented for the reader in Method 4.

These tests show that the BI query performance can be improved using various features in DB2 9.


Conclusion

The methods discussed in this article are just some of the ways to improve query performance for BI environments. Please be encouraged to try the exercises in the next section and those sprinkled throughout the article.


Exercises

Try this:

  1. You can use the DB2 Design Advisor to recommend indexes, multidimensional clustering, database partitionings, and materialized query tables. Use the Design Advisor to see which indexes, MDC, and MQTs are recommended for the query in Listing 2, using the base SALES_FACT table.
  2. Try some of the methods discussed in this article on a sample workload of queries.
  3. Repeat Exercise 1 on a workload of queries.

Downloads

DescriptionNameSize
JOIN_ELIMjoin_elim.exfmt49KB
Test 1test1.exfmt117KB
Test 2test2.exfmt120KB
Test 3test3.exfmt117KB
Test 6test6.exfmt50KB
TPARTtpart.exfmt102KB
TPART_MDCtpart_mdc.exfmt118KB

Resources

Learn

Get products and technologies

  • Download IBM product evaluation versions and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
  • Build your next development project with IBM trial software, available for download directly from developerWorks.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=203713
ArticleTitle=Improve DB2 query performance in a business intelligence environment
publish-date=03222007