Slower queries with unioning of two tables versus single table

Question & Answer

Question

Why does unioning two tables together result in slower query performance than a single table of the same size?

Answer

Although the extent varies depending on the type of query, there tends to be a performance drop off between two or more unioned tables versus a single table. For example, consider the following query, which targets a view that unions two tables of equal sizes:
select count(1) from fact_table_view;

The DDL of this view is:
CREATE or replace VIEW FACT_TABLE_VIEW ( INV_ID, DATE_ID ) AS SELECT FACT_TABLE_1.INV_ID, FACT_TABLE_1.DATE_ID FROM FACT_TABLE_1 UNION ALL SELECT FACT_TABLE_1.INV_ID, FACT_TABLE_1.DATE_ID FROM FACT_TABLE_2 ;

The plan produced to implement this query causes the system to:
1. Scan FACT_TABLE_1 and materialize (SaveTempNode).
2. Scan FACT_TABLE_2 and materialize (SaveTempNode).
3. Derive a count of each SPU's respective version of the table.
4. Send results to the host, where they are then combined into a single result.

When the system materializes the tables, it rewrites the relevant columns to the swap area of the data slice. Depending on the size of the tables involved, this can take considerably longer than running a similar query that targets a single table.

When targeting a single table, the plan produced would:

1. Scan FACT_TABLE, with each SPU returning its count.
2. Send results to the host, where they are then combined into a single result.

In this plan, the fact table is never materialized. This results in a query that runs much faster, even though both queries scan the same number of rows.

The next example illustrates the performance impact a union has when the fact tables are joined to smaller dimension tables.

Approach 1: Target single table with 5M rows

[nz@cs-spubox6 ~]$ time nzsql mydatabase -f 1.sql -e -o /dev/null

select     d.date_id, count(1)

from        fact_table f,

               dim_date   d

where     f.date_id  = d.date_id

and         d.date_val = '2007-11-30'

group by 1

real    0m1.089s

Approach 2: Target view that unions two 2.5M row tables

[nz@cs-spubox6 ~]$ time nzsql mydatabase -f 2.sql -e -o /dev/null

select      d.date_id, count(1)

from        fact_table_view f,

               dim_date           d

where     f.date_id        = d.date_id

and         d.date_val     = '2007-11-30'

group by 1

real    0m3.298s

As the output illustrates, the example using unioned tables took three times longer to process the same query. This can be explained by reviewing the order of operations for the two approaches.

Order of operations targeting a single fixed table

1. Scan the date dimension with restrictions on date_val and broadcast the result.
2. Scan the fact table, hashJoin to the broadcasted date dimension, get per-SPU count of the table, and re-distribute.
3. Perform a final aggregation (on date_id) and return the results to the host.

Order of operations targeting a view that unions two tables

1. Scan the date dimension with restrictions on the date_val and broadcast the result.
2. Scan the first table and materialize.
3. Scan the second table and materialize.
4. hashJoin the unioned result to the broadcasted date dimension, get per-SPU count of the table, and re-distribute.
5. Perform a final aggregation (on date_id) and return the results to the host.

[{"Product":{"code":"SSULQD","label":"IBM PureData System"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":null,"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"1.0.0","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Historical Number

NZ477913

Was this topic helpful?

Document Information

Modified date:
17 October 2019

UID

swg21569666

Tips