Question & Answer
Question
Why does unioning two tables together result in slower query performance than a single table of the same size?
Answer
Although the extent varies depending on the type of query, there tends to be a performance drop off between two or more unioned tables versus a single table. For example, consider the following query, which targets a view that unions two tables of equal sizes:
select count(1) from fact_table_view;
The DDL of this view is:
CREATE or replace VIEW FACT_TABLE_VIEW
(
INV_ID,
DATE_ID
) AS
SELECT FACT_TABLE_1.INV_ID,
FACT_TABLE_1.DATE_ID
FROM FACT_TABLE_1
UNION ALL
SELECT FACT_TABLE_1.INV_ID,
FACT_TABLE_1.DATE_ID
FROM FACT_TABLE_2
;
The plan produced to implement this query causes the system to:
1. Scan FACT_TABLE_1 and materialize (SaveTempNode).
2. Scan FACT_TABLE_2 and materialize (SaveTempNode).
3. Derive a count of each SPU's respective version of the table.
4. Send results to the host, where they are then combined into a single result.
When the system materializes the tables, it rewrites the relevant columns to the swap area of the data slice. Depending on the size of the tables involved, this can take considerably longer than running a similar query that targets a single table.
When targeting a single table, the plan produced would:
1. Scan FACT_TABLE, with each SPU returning its count.
2. Send results to the host, where they are then combined into a single result.
In this plan, the fact table is never materialized. This results in a query that runs much faster, even though both queries scan the same number of rows.
The next example illustrates the performance impact a union has when the fact tables are joined to smaller dimension tables.
Approach 1: Target single table with 5M rows
[nz@cs-spubox6 ~]$ time nzsql mydatabase -f 1.sql -e -o /dev/null
select d.date_id, count(1)
from fact_table f,
dim_date d
where f.date_id = d.date_id
and d.date_val = '2007-11-30'
group by 1
real 0m1.089s
Approach 2: Target view that unions two 2.5M row tables
[nz@cs-spubox6 ~]$ time nzsql mydatabase -f 2.sql -e -o /dev/null
select d.date_id, count(1)
from fact_table_view f,
dim_date d
where f.date_id = d.date_id
and d.date_val = '2007-11-30'
group by 1
real 0m3.298s
As the output illustrates, the example using unioned tables took three times longer to process the same query. This can be explained by reviewing the order of operations for the two approaches.
Order of operations targeting a single fixed table
1. Scan the date dimension with restrictions on date_val and broadcast the result.
2. Scan the fact table, hashJoin to the broadcasted date dimension, get per-SPU count of the table, and re-distribute.
3. Perform a final aggregation (on date_id) and return the results to the host.
Order of operations targeting a view that unions two tables
1. Scan the date dimension with restrictions on the date_val and broadcast the result.
2. Scan the first table and materialize.
3. Scan the second table and materialize.
4. hashJoin the unioned result to the broadcasted date dimension, get per-SPU count of the table, and re-distribute.
5. Perform a final aggregation (on date_id) and return the results to the host.
Historical Number
NZ477913
Was this topic helpful?
Document Information
Modified date:
17 October 2019
UID
swg21569666