Star join optimization
IBM Informix Server V11.70 introduces a star join optimization feature that can significantly increase query performance for queries on star schemas and snowflake schemas.
A star join query typically involves a fact table and several dimension tables, with join predicates between the fact table and each dimension table. Table-level predicates typically exist on some dimension tables as well.
The star join plan utilizes push-down hash join technology. The primary purpose of a push-down hash join is to reduce the number of rows processed from the fact table, which probes each of the hash tables. Since each dimension table contains selectivity that can be used to reduce the number of fact table rows for processing, instead of applying the selectivity from different dimension tables in a step-wise fashion. If you can combine two or more to the fact table at the same time, it should help reduce the number of fact table rows to be processed. Since the fact table is the largest table in the query, reducing the number of fact table rows to be processed early on should improve query performance.
A star join plan utilizing push-down hash join technology makes hash joins more effective, primarily due to the reduction of fact table (probe table) rows early on. The larger the probe table is in a hash join, the slower the join is. When hash tables are built, join keys are pushed down and used to filter the rows of the fact table, and performance improves dramatically. The performance improvement depends on the selectivity of the filter on the dimension table from which the join keys are pushed down. If only a few rows are selected from the dimension table so that only a few keys are pushed down, then rows of the fact table might be reduced proportionally. The actual proportion depends on the distribution of rows with the pushed-down keys in the fact table.
The optimizer can choose a star join plan only if a query involves tables from a star or snowflake schema that meets the following requirements.
- Neither the fact table, nor any dimension table participates in an outer join (for example, LEFT OUTER JOIN, RIGHT OUTER JOIN, FULL OUTER JOIN, or Informix style outer join).
- No table is an external table, remote table, view, or derived table.
- The join between the fact table and any dimension table is an equi-join, which makes a hash join possible.
- The values in the join-key columns of the dimension table are unique. The optimizer uses data-distribution statistics to determine whether a column contains unique values. The schema does not need to specify explicitly that a column is a primary or unique key to produce a star join plan.
- The query includes one or more filters on at least one of the dimension tables.
- The number of rows selected from each dimension table is less than the number of rows selected from the fact table after all scan filters are applied.
- All tables involved in the query must have at least UPDATE STATISTICS LOW run against the table.
- PDQ priority must be set to ON, or a number between 1 and 100. Alternatively, implicit PDQ must be set in the current session.
Some other minor restrictions apply, such as rules governing subscripted keys.
The pushed-down join keys can be used in an index pushdown or a bit-vector push-down.
An index push-down uses the pushed-down keys to perform index lookups on a foreign key index of the fact table. Index push-downs require only a single-column index on the fact table, although they can use a multi-column index if the required foreign-key column is the first key in the index.
An index push-down has the following advantages.
- Because the index is used to access the table, I/O is more efficient. The database server optimizes I/O by reading the fact table in a skip scan, where the database server reads the index and sorts the row identifiers (RIDs) so rows can be read from the fact table pages in order, accessing only the table pages that contain required rows.
- For multiple index push-downs, multi-index scan is used to combine RIDs from different index scans and then skip-scan on the fact table is performed.
- The resulting list of RIDs contains exactly those tables that will join, which makes the join efficient.
In a bit-vector pushdown, for each push-down, the database server hashes the keys from the dimension table to create a bit vector, which is shipped to the fact table. Then for each row from the fact table scan, the foreign key is also hashed and probed into the bit vector to determine whether the current fact table row is filtered out. Multiple-bit vectors can be used for multiple bit-vector push-downs.
A bit-vector pushdown has the following advantages.
- It requires no indexes.
- It can be built fast with a low processing cost.
By reducing I/O costs, index push-downs eliminate fact table rows more efficiently than bit-vector push-downs, but their processing cost is higher.
- Orders fact table, containing 100,000,000 rows. The Orders table has a single-column index on each foreign key that is a primary key in the dimension tables: date_time, location_id, and cust_num.
- Time dimension table, containing 10,000 rows. The Time table has primary key date_time.
- Location dimension table, containing 10,000 rows. The Location table has primary key location_id.
- Customer dimension table, containing 10,000 rows. The Customer table has primary key cust_num.
To show how to interpret SET EXPLAIN output for star join plans, you can look at the example query shown in Listing 1.
Listing 1. Example query to interpret SET EXPLAIN output for star join plans
SELECT t.quarter, sum(o.quantity) FROM orders o, location l, customer c, time t WHERE o.location_id = l.location_id AND o.cust_num = c.cust_num AND o.date_time = t.date_time AND c.type <= 3 AND l.region_id = 4 AND t.quarter between 1 and 20 GROUP BY t.quarter
The query plan information shown in the following examples is taken from the sqexplain.out file for this query.
In step 1, the join keys from the Location and Time dimension tables are pushed down to filter the Orders table before any hash joins are performed.
- The location_id join keys that meet the filter requirements on the Location dimension table are pushed down to the index on the Orders table. The retrieved RIDs are sorted and used to read the Orders table in a skip scan.
- The time_id join keys that meet the filter requirements on the Time dimension table are hashed to create a bit vector that filters rows of the Orders table from the skip scan, as shown in Listing 2.
Listing 2. Step 1 in the query plan
1) stores.o: INDEX PATH (SKIP SCAN) (1) Index Name: stores.fk_location Index Keys: location_id (Parallel, fragments: ALL) Index Filter: stores.o.location_id = stream from stores.l.location_id Bit Vector Filter: stores.o.time_id = stream from stores.t.time_id
In steps 2-4, as shown in Listing 3, the hash joins use the filtered rows of the Orders table to probe the hash tables built on the three dimension tables: first the hash join on the Customer dimension table, then the hash join on the Time dimension table, then the hash join on the Location dimension table.
Listing 3. Steps 2-4 in the query plan
2) stores.c: SEQUENTIAL SCAN (Parallel, fragments: ALL) Filters: stores.c.cust_type <= 3 DYNAMIC HASH JOIN Dynamic Hash Filters: stores.o.cust_id = stores.c.cust_id 3) stores.t: SEQUENTIAL SCAN (Parallel, fragments: ALL) Filters: (stores.t.quarter >= 1 AND stores.t.quarter <= 20 ) DYNAMIC HASH JOIN (Bit Vector Push Down Key: stores.t.time_id to stores.o) Dynamic Hash Filters: stores.o.time_id = stores.t.time_id 4) stores.l: SEQUENTIAL SCAN (Parallel, fragments: ALL) Filters: stores.l.region_id = 4 DYNAMIC HASH JOIN (Index Push Down Key: stores.l.location_id to stores.o) Dynamic Hash Filters: stores.o.location_id = stores.l.location_id
Snowflake schema is an extension of star schema where there could be multiple levels of fact-dimension relationships. For example, the Location table is a dimension table for the Orders fact table. However, it could have its own dimension table, a Region table that separates different locations into distinct regions. The Location table contains a foreign key (region_id) that corresponds to the primary key of the Region table. The level of a dimension table corresponds to the number of links it needs to connect to the central fact table. For example, the Location table is a level-1 dimension table, while the Region table is a level-2 dimension table.
The following new optimizer directives are introduced for star join optimization.
- STAR_JOIN - The STAR_JOIN directive instructs the optimizer to favor a star join plan whenever possible.
- AVOID_STAR_JOIN - The AVOID_STAR_JOIN directive instructs the optimizer to skip consideration of star join plans.
- FACT(table) - The FACT directive specifies a fact table to be used during star join optimization. A single table can be specified.
- AVOID_FACT(tab1, tab2, ...) - The AVOID_FACT directive specifies one or more tables that should not be used as fact table during star join optimization. One or more tables can be specified (delimited by comma or whitespace).
All of the new star join directives are query level directives that affect the execution of a query.
New SET OPTIMIZATION ENVIRONMENT options are also introduced for influencing optimizer consideration of star join plans, as shown in Listing 4.
Listing 4. New SET OPTIMIZATION ENVIRONMENT options
SET OPTIMIZATION ENVIRONMENT STAR_JOIN 'enabled'|'disabled'|'forced' SET OPTIMIZATION ENVIRONMENT FACT '<table_list>' SET OPTIMIZATION ENVIRONMENT AVOID_FACT '<table_list>' SET OPTIMIZATION ENVIRONMENT NON_DIM '<table_list>'
The SET OPTIMIZATION ENVIRONMENT STAR_JOIN statement can be used to enable, disable, or force star join plans for all queries in the current session. The default is enabled. When the forced option is used, star join plans will be favored whenever possible for all queries in current session.
The SET OPTIMIZATION ENVIRONMENT FACT statement can be used to specify a list of tables that can be considered as fact tables during star join optimization for all queries in the current session. If this environment is set, only tables specified in the list can be used as fact table.
The SET OPTIMIZATION ENVIRONMENT AVOID_FACT statement can be used to specify a list of tables that should not be considered as fact tables during star join optimization for all queries in the current session.
The SET OPTIMIZATION ENVIRONMENT NON_DIM statement can be used to specify a list of tables that should not be considered in any star join optimization, for all queries in the current session.
The format for <table_list> used in FACT, AVOID_FACT, and NON_DIM option is as follows.
The database name and owner name are optional in table name specification. Multiple table names can be specified in the list, delimited by comma.
If star join directives are used in a query, directives override session environment setting.