Zigzag join enablement for DB2 star schema queries

IBM® DB2® 10.1 Linux®, UNIX®, and Windows® introduces a new join method called zigzag join. Zigzag join improves consistency in performance as well as reduces the execution time of queries in data warehouse or data mart environments with large volumes of potentially partitioned data, complex ad hoc queries, and where database design uses a star schema. This article provides insight into the eligibility criteria of selecting the zigzag plan, how the zigzag join method works, using a zigzag join with a gap in the fact table multi-column index, using the ZZJOIN operator, and using the index advisor for zigzag join.

Lee Jung Chu (leejung@us.ibm.com), DB2 Linux, UNIX, and Windows FVT, Next Generation Data Analytics, IBM

Author PhotoLeejung Chu is the team lead for DB2 Linux, UNIX, Windows FVT, next generation data analytics. She has more than 10 years of experience, and was the original Red Brick Warehouse QA who tested the first version of the zigzag algorithm.



Sudhir Singh (sudkuma9@in.ibm.com), Software Engineer, IBM

Author photoSudhir K Singh is a software engineer with IBM India Software Labs. He is a certified DB2 administrator and quality engineer with two years of experience with star schema query optimization in DB2.



21 March 2013

Also available in Chinese

Introduction

This article discusses an analytical process that will help you to understand the zigzag plan and exploit it for the star-shaped queries in a database having star schema design. You will be able to explain the optimizer plan and check the explain output to see the details about a zigzag join when it is being used to execute a query. You will also be able to troubleshoot the issues with multi-column indexes on a fact table with the help of index advisor.


Star schema

A star schema is a denormalization schema architecture used for data warehousing. Refer to the Resources section for more information on star schema. The star schema consists of a few fact tables (usually one) referencing any number of dimension tables.

The fact table holds the main data while the smaller dimension tables describe each value of a dimension.

Dimension tables have a simple primary key, while fact tables have a compound key consisting of the aggregate of relevant dimension keys. Figure 1 shows a simple star schema with one fact table and four dimension tables.

Figure 1. Star schema
This figure shows you how a star schema will look like

The primary key/foreign key relationships between Fact and Dim1 can be established by defining a foreign key constraint on Fact f1 referencing Dim1 d1, or by optimizer after analyzing the join predicate Fact.f1=Dim1.d1.


What is zigzag join?

A zigzag join is a join method used when a fact table and two or more dimension tables in a star schema are joined.

The combinations in the Cartesian product of matched dimension values do not always appear in the fact table. The zigzag join method calculates the Cartesian product of rows from the dimension tables, and probes the fact table through a multi-column index to find matching rows, then returns the next value from the multi-column index. This next value, referred to as feedback, is used to skip over the values provided by the Cartesian product of dimension tables that don't result in a match with the fact table.

The multi-column index helps to filter the fact table along two or more dimension tables simultaneously, and the feedback from multi-column index allows avoiding lookups in the index over the fact table for unproductive dimension key combinations. Combining these two features makes the zigzag join efficient for querying large fact tables.


Eligibility criteria to qualify for zigzag join

The eligibility criteria to qualify for the zigzag join method are as follows.

  • There must be at least one fact table and at least two dimension tables.
  • The dimension tables must have either a primary key, a unique constraint, or a unique index defined on their key columns.
  • The dimensions join with the fact table using equi-join predicates such that there are equi-join predicates on all of columns that compose the primary key in the dimensions.
  • There must be a suitable multi-column index on the fact table(s). The multi-column index must include all of the columns used in equality join predicates between the fact table(s) and dimension tables.

Query analysis for zigzag join method

Consider a simple star schema with the following description.

  • There is one fact table in the schema: aroma_sales.
  • There are more than two dimension tables: aroma_product, aroma_period, aroma_promotion, aroma_store.
  • There is one outboard table: aroma_market.
  • The aroma_sales references dimension table: aroma_store.
  • The aroma_store references outboard table: aroma_market.

The database objects tables, columns, and relationships in star schema are shown in Figure 2.

Figure 2. Diagram of database objects in star schema
This figure shows the database objects tables,columns,and relationships in star schema.

You can execute the query shown in Listing 1 to find the total sum of a sale on a particular date of a particular product.

Listing 1. Query to know the total sum of a sale
 SELECT date, prod_name, sum(dollars)
 FROM aroma_product, aroma_period, aroma_sales
 WHERE aroma_product.classkey = aroma_sales.classkey
 and aroma_period.perkey = aroma_sales.perkey
 and aroma_product.prodkey = aroma_sales.prodkey
 and prod_name like 'Darjeeling%'
 and day in ('SA', 'SU')
 and year = 2002
 group by date, prod_name

The query shown previously in Listing 1 has three tables and you need to decide which is the fact table and which are dimension tables.

Suppose there are unique indexes, a primary key, or a unique constraint defined on two tables called aroma_product and aroma_period using the statements shown in Listing 2.

Listing 2. Create unique index/unique constraint/primary key on dimension tables
db2 "alter table aroma_product add constraint product_uconst unique (classkey, prodkey)"

db2 "create unique index period_uidx on aroma_period (perkey)"

As shown, aroma_product and aroma_period have a unique constraint and unique index defined respectively. These columns are joined with aroma_sales using equi-join predicate. So, aroma_sales should be the fact table, and aroma_product and aroma_period are the dimensions.

Now you will need a multi-column index on aroma_sales, which will provide sorted access to the join columns. A multi-column index covers the join columns in a way that both dimensions tables should be covered, as shown in Listing 3.

Listing 3. Multi-column index on fact table
db2 "create index zz_idx on aroma_sales(prodkey , classkey , perkey)"

This query qualifies all the rules of zigzag join eligibility criteria, hence, the eligible for zigzag join method.

Note: aroma_product's key column is a composite key, so all of the columns in the composite key need to stay together in the multi-column index. In other words, the multi-column index in this example can be on (classkey, prodkey, perkey), or (prodkey, classkey, perkey), but can’t be (classkey, perkey, prodkey).


Working of zigzag algorithm

The working of zigzag algorithm is explained with the help of example data. The dimensions together form a multi-column key that is used to probe the fact table using the multi-column index on the fact table.

The order of the composite keys must match the order of the fact table index for the zigzag join. Dimension table aroma_product has a composite unique key (prodkey, classkey) which should be in same order in the fact table aroma_sales multi-column index.

The fact table index returns the rows matching the input dimension keys. Once all the matches are returned, the fact table index returns the next logical key (higher key for an ascending index, lower key for the descending index). This next logical key is used as the feedback by the dimension cursors to position on the key that is equal to or (logically) higher than the feedback key. Thus the feedback from the fact table index is used to skip some combinations of the dimensions, thus reducing the number of probes into the fact table. Reference Figures 3 and 4 with the following.

  1. The cursor is initially positioned on the dimensions table corresponding to keys at (1,1,1), so probe into multi-column index for this key. This is a match, so consume all duplicate keys.
  2. The next higher key from multi-column index is (1,1,2) returned. This is also called the feedback key, which helps to position the cursor according to key part to get the next candidate key to start the probe.
  3. (1,1,3) is the next valid candidate key logically higher from the feedback key (1,1,2) to start the probe. Probe for (1,1,3) in multi-column index. This is a miss.
  4. The next higher key (1,2,2) is returned. Position the cursor based on key parts of the feedback key to get the next candidate key. (1,2,3) is the next valid key to start the probe.
  5. Probe for (1,2,3) in multi-column index. This is also a miss.
  6. The next higher key (1,3,2) is returned. Position the cursor based on key parts of the feedback key to get the next candidate key. (1,3,3) is the next valid key to start the probe.
  7. Probe for (1,3,3) in multi-column index. This is also a miss.
  8. The next higher key (2,1,1) is returned. Position the cursor based on key parts of the feedback key to get the next candidate key. (2,1,1) is the next valid key to start the probe.
  9. Probe for (2,1,1) in multi-column index. This is a match, so consume all duplicate keys. This process will continue until all feedback keys are returned or all probes from dimensions are consumed and no records are left in dimensions.
Figure 3. Dimension tables key column
This figure shows the key column values in dimension tables aroma_product and aroma_period.
Figure 4. Zigzag algorithm to skip unproductive probes
This figure talks about zigzag algorithm in which It shows how cartesian product of dimension tables aroma_product and aroma_period key column values are being probed in fact table using multi-column index on aroma_sales(fact table).

You can see in the diagram in Figures 3 and 4 how many unproductive probes are skipped using the zigzag algorithm.


Performance improvement

Consider the query shown in Listing 4, which has one fact called table aroma_sales, and the remaining tables are dimension tables. Suppose there are unique indexes, a primary key, or a unique constraint defined on dimension tables.

Check the performance of the query shown in Listing 4 in the three scenarios shown in Listing 5, 6, and 7, and then compare the execution time taken in each case.

Listing 4. Query
SELECT date, prod_name, dollars
FROM aroma_promotion, aroma_product, aroma_period, aroma_store, aroma_sales
WHERE aroma_promotion.promokey = aroma_sales.promokey
and aroma_product.classkey = aroma_sales.classkey
and aroma_period.perkey = aroma_sales.perkey
and aroma_product.prodkey = aroma_sales.prodkey
and aroma_store.storekey = aroma_sales.storekey
and prod_name like 'Darjeeling%'
and day in ('SA', 'SU')
and year = 2002
and city in ('Los Gatos', 'San Jose')

You will see the performance of this query in the following three scenarios and do the comparison analysis of execution time taken by the query in all three cases.

As shown in Listing 5, set the DB2_REDUCED_OPTIMIZATION to ZZJN ON to turn on the zigzag join, which is enabled by default. First, you can see the execution time taken by the query without the fact table multi-column index.

Listing 5. Without multi-column index on fact table aroma_sales
* Elapsed Time is:       0.094289 seconds

* Summary Table:

Type      Number   Repetitions Total Time (s) Min Time (s)   Max Time (s)   Row(s) Fetched
--------- -------- ----------- -------------- -------------- -------------- --------------
Statement     1           1       0.094289     0.094289       0.094289           150      

* Total Entries:              1
* Total Time:                 0.094289 seconds
* Minimum Time:               0.094289 seconds
* Maximum Time:               0.094289 seconds
* Arithmetic Mean Time:       0.094289 seconds
* Geometric Mean Time:        0.094289 seconds
---------------------------------------------

As shown in Listing 6, create the multi-column index on fact table aroma_sales and execute the query to see the time taken, in this case db2 “create index zz_idx on aroma_sales(prodkey DESC, classkey, storekey, promokey, perkey)”.

Listing 6. With multi-column index on fact table aroma_sales
* Elapsed Time is:       0.022553 seconds

* Summary Table:

Type      Number      Repetitions Total Time (s) Min Time (s)  Max Time (s) Row(s) Fetched
--------- ----------- ----------- -------------- ------------ ------------- --------------
Statement     1           1       0.022553       0.022553       0.022553            150

* Total Entries:              1
* Total Time:                 0.022553 seconds
* Minimum Time:               0.022553 seconds
* Maximum Time:               0.022553 seconds
* Arithmetic Mean Time:       0.022553 seconds
* Geometric Mean Time:        0.022553 seconds
---------------------------------------------

As shown in Listing 7, turn off the zigzag join by setting DB2_REDUCED_OPTIMIZATION to ZZJN OFF, and execute the query to see the time taken.

Listing 7. With multi-column index on fact table aroma_sales
* Elapsed Time is:       0.159272 seconds

* Summary Table:

Type      Number      Repetitions Total Time (s) Min Time (s) Max Time (s) Row(s) Fetched
--------- ----------- ----------- -------------- ------------ -----------  --------------
Statement     1           1       0.159272       0.159272       0.159272           150

* Total Entries:              1
* Total Time:                 0.159272 seconds
* Minimum Time:               0.159272 seconds
* Maximum Time:               0.159272 seconds
* Arithmetic Mean Time:       0.159272 seconds
* Geometric Mean Time:        0.159272 seconds
---------------------------------------------

It is clear that when the multi-column index on the fact table exists, and zigzag join is turned on to exploit the new plan, execution time taken by star query is lower.


Gap in fact table multi-column index

A multi-column index on the fact table is necessary for the zigzag join. To exploit zigzag joins in queries that join different dimension tables with the fact table, a large number of multi-column indexes on the fact table would be needed. Because this would not be cost-effective, IBM DB2 10.1 Linux, UNIX and Windows supports the concept of gaps in the index, so that the number of multi-column indexes can be greatly reduced without sacrificing performance.

A gap in the index approach is also known as a jump scan. A database is said to support jump scans when it can use a multi-column index and query predicates that constrain only some of the columns in the index to retrieve table data.

Consider the query shown in Listing 8 to understand the concept of gap in index.

Listing 8. Query with gap
SELECT date, prod_name, sum(dollars)
FROM product, period, sales
WHERE sales.classkey = product.classkey
and sales.prodkey = product.prodkey 
and sales.perkey = period.perkey
and prod_name like 'Darjeeling%' 
and day in ('SA', 'SU') 
and year = 2002
group by date, prod_name

The available multi-column index on fact table is as follows: aroma_sales (prodkey DESC, classkey, storekey, promokey, perkey).

There is no join predicate on aroma_sales storekey and promokey column, so there is no need to create a separate index to cover just (prodkey, classkey, perkey) to qualify zigzag join criteria. DB2 Optimizer is able to recognize storekey and promokey as the gap in an existing index and still choose zigzag if it is the best plan.

A gap in an index allows you to create a small number of multi-column indexes and exploit them to answer a large number of queries.


Verification of zigzag plan

You can confirm if a query chooses zigzag join or not by using operator ZZJOIN. There are two ways to confirm if queries have selected the zigzag plan over other plans. They are as follows.

  • If db2exfmt shows the access plan of the query executed in which the presence of operator ZZJOIN confirms the selection of zigzag join.
    db2exfmt -d <database_name> -1 -g O -f O -o <output_file_name>
  • Querying explain tables for ZZJOIN operator which confirms the selection of zigzag join.
    db2 "select count(*) from explain_operator where operator_type ='ZZJOIN' and operator_id not in (select operator_id from explain_argument where argument_type = 'BACKJOIN' and argument_value = 'TRUE')"

Sample scenario: Enable zigzag plan for star schema query

Consider the star schema mentioned previously. As shown in Listing 9, you can see how to enable zigzag join for the query mentioned in Listing 4.

Listing 9. Create fact table and dimension tables
create db aroma using codeset UTF-8 territory US pagesize 32 k;
connect to aroma;
set schema aroma;

create table aroma_market(mktkey integer not null,hq_city char(20),
hq_state char(20),district char(20),region char(20));

create table aroma_period ( perkey integer , date date, day character(3), week integer, 
month character(5), qtr character(3), year integer, aroma_period_xml xml);

create table aroma_store(storekey integer not null,mktkey integer not null,
store_type char(10),store_name char(30),street char(30),city char(20),
state char(5),zip char (10),aroma_store_xml xml);

create table aroma_product ( classkey integer not null,  prodkey integer not null, 
prod_name char(30), pkg_type char(20), aroma_product_ts xml, aroma_product_xml xml);

create table aroma_promotion ( promokey integer not null, promo_type integer not null, 
promo_desc char(40), value dec(7,2), start_date date, end_date date);

create table aroma_sales ( perkey integer , classkey integer, prodkey integer , 
storekey integer , promokey integer, quantity integer, dollars dec(7,2));

As shown in Listings 10 and 11, create a primary key, unique constraint, or unique indexes on dimension tables identified by analysis of query mentioned previously in Listing 4. Also define primary key/foreign key relations among the tables if it exists.

Listing 10. Create primary key or unique constraint or unique indexes on dimension tables
ALTER TABLE aroma_store ADD constraint store_pkc primary key (storekey);

ALTER TABLE aroma_market ADD constraint market_pkc primary key(mktkey);

ALTER TABLE aroma_promotion ADD constraint promotion_pkc primary key (promokey);

alter table aroma_product add constraint product_uconst unique (classkey, prodkey);

create unique index period_uidx on aroma_period (perkey);

ALTER TABLE aroma_store ADD constraint store_fkc foreign key (mktkey) 
references aroma_market (mktkey);

ALTER TABLE aroma_sales ADD constraint sales_store_fkc foreign key (storekey) 
references aroma_store (storekey)
Listing 11. Populate the fact and dimension tables
load from aroma/aroma_market.txt  of del modified by coldel, 
method P (1, 2, 3, 4, 5) warningcount 100 replace into aroma_market;

load from aroma/aroma_store.txt of del xml from aroma/ 
warningcount 100  replace into aroma_store;

load from aroma/aroma_product.txt of del xml from aroma/ 
warningcount 100 replace into aroma_product;

load from aroma/aroma_promo.txt of del modified by coldel, 
method P (1, 2, 3, 4, 5, 6) warningcount 100 replace into aroma_promotion;

load from aroma/aroma_period.txt of del xml from aroma/ 
warningcount 100 replace into aroma_period;

load from aroma/aroma_sales.txt of del modified by coldel, 
method P (1, 2, 3, 4, 5, 6, 7) warningcount 100 replace into aroma_sales;

Explain tables store the information about the operators used in accessing data as well as other details about access plan. Refer to the Resources section for more information on explain tables. It is strongly recommended that you create explain tables using the EXPLAIN.DDL script to generate an access plan of any query, as shown in Listing 12.

Listing 12. Create explain tables
set schema current user;
db2 -tvf EXPLAIN.DDL;

Create a multi-column index required on fact table aroma_sales to get sorted access on join columns for aroma_sales data, as shown in Listings 13, 14, and 15.

Listing 13. Create multi-column index on fact table
set schema aroma;
create index zz_idx on aroma_sales(prodkey DESC, classkey, storekey, promokey, perkey);
Listing 14. Run the star schema query
SELECT date, prod_name, dollars
FROM aroma_promotion, aroma_product, aroma_period, aroma_store, aroma_sales
WHERE aroma_promotion.promokey = aroma_sales.promokey
and aroma_product.classkey = aroma_sales.classkey
and aroma_period.perkey = aroma_sales.perkey
and aroma_product.prodkey = aroma_sales.prodkey
and aroma_store.storekey = aroma_sales.storekey
and prod_name like 'Darjeeling%'
and day in ('SA', 'SU')
and year = 2002
and city in ('Los Gatos', 'San Jose')
Listing 15. Use db2exfmt tool to explain the access plan
db2exfmt -d aroma -1 -g O -f O -o zigzag.txt;

The db2exfmt tool shows the access plan used to access the data from the tables involved in a query. It also gives the order of operations for accessing data, view statistics for selected tables, indexes, or columns of properties for operators involved in query.

In the access plan shown in figure 5, the ZZJOIN operator Z16.S36 suggests zigzag plan is being used while accessing data. Also, the multi-column index on fact table 69941 INDEX: AROMA ZZ_I1 Q5 is used to have sorted access of data from fact table, Hence, the zigzag plan is used by star schema query.

Figure 5. Check the access plan for zigzag join operator (ZZJOIN)
This figure shows you how data access has been done using various operators for the query triggered above and It shows how to check access plan for zigzag join operator(ZZJOIN).

Querying the explain tables for ZZJOIN operator can be another way to know about the selection of a zigzag plan while accessing data, as shown in Listing 16. The only difference here is that it does not give access plan details.

Listing 16. COUNT(*) query to confirm ZZJOIN presence in access plan
db2 "select count(*) from explain_operator where operator_type ='ZZJOIN' 
and operator_id not in (select operator_id from explain_argument 
where argument_type = 'BACKJOIN' and argument_value = 'TRUE')"

1
-----------
          1

  1 record(s) selected.

Index advisor

"Where is my zigzag? I expect to see a zigzag join in an access plan but I don't see it!"

You might come across this kind of problem when you have issues in selecting a zigzag plan by star queries.

Let's concentrate on troubleshooting the issues with multi-column indexes on a fact table with the help of index advisor.

Users are often not aware of the multi-column indexes requirement on a fact table, which is one of the vital eligibility criteria of zigzag join selection. Having a wrong set of indexes is detrimental for query performance. Index advisor increases user awareness by providing an index recommendation to enable DB2 to use zigzag. The index recommendation is based on queries run by users, and it is for a fact table only. Index recommendations that are most likely to increase query performance are provided as an EXPLAIN diagnostic message. For example, run the query in Listing 17.

Listing 17. Query
SELECT date, prod_name, sum(dollars)
FROM product, period, sales
WHERE sales.classkey = product.classkey
and sales.prodkey = product.prodkey 
and sales.perkey = period.perkey
and prod_name like 'Darjeeling%' 
and day in ('SA', 'SU') 
and year = 2002
group by date, prod_name

After running the query, run the db2exfmt tool to get the diagnostic information suggesting on which columns multi-column index needs to be created.

db2exfmt -d <database_name> -1 -g O -f O -o <output_file_name>

The extended diagnostic information section of the db2exfmt output is shown in Listing 18.

Listing 18. Extended diagnostic information
Diagnostic Details: EXP0256I  Analysis of the query shows that the
                    query might execute faster if an additional index
                    was created to enable zigzag join. Schema name:
                    "SUDKUMA9". Table name: "AROMA_SALES". 
		    Column list: "(PRODKEY, CLASSKEY,PERKEY)".

So, the multi-column index recommendation on fact table is (PRODKEY, CLASSKEY, PERKEY). You should check if the query selects the zigzag join method using this multi-column index only. The most attractive point about index advisor is that it gives an index recommendation only if you have created the wrong index, or you have not created a multi-column index.

If you are not an expert in the optimizer area, then you may end up with the wrong set of multi-column indexes on fact tables for complex star queries. This may result in loss of time and effort invested in query analysis to come up with multi-column indexes. By using index advisor, you can get the correct set of multi-column indexes on fact tables, and enable zigzag by simply creating that index for star query. You need not invest time for complex query analysis to come up with a correct set of indexes.


Conclusion

It is the responsibility of DBAs to tune the query for the best plan and improve the query performance as well. It is possible only if you have good analytical skills with join queries and access plans. DBAs can use the tools and techniques discussed in this article to enable zigzag join and get faster execution of a star schema query wherever it is required. Although it is not possible to cover all kinds of queries, if you follow the approach and principles in this article, you will be able to solve the issues with a zigzag plan, and hence, you should be seeing a faster running query.


Download

DescriptionNameSize
Sample DB2 CLP scriptsenabling_zigzag.zip440KB

Resources

Learn

Get products and technologies

  • Build your next development project with IBM trial software, available for download directly from developerWorks.
  • Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=861764
ArticleTitle=Zigzag join enablement for DB2 star schema queries
publish-date=03212013