MERGE statement explained
Comments (6) Visits (76872)
Fact: "John's birthday is Oct 12, 1965"
How does this relate to databases, DB2 and SQL?
When ingesting new data into a table it is very common that you encounter duplicates and conflicting data.
Over the years different vendors have come up with various schemes to handle this.
in this post I'll discuss some the the legacy statements and the MERGE statement which is meant to solve the problem in ANSI SQL.
Imagine a product inventory. There are product ids, names and inventory levels.
Once a day sales are reported and inventory levels need to be adjusted for all products.
CREATE TABLE product(id INTEGER NOT NULL PRIMARY KEY,
A variety store, presumably....
Here are our sales for the day:
CREATE TABLE sales(id INTEGER NOT NULL PRIMARY KEY,
MS SQL Server and IBM IDS both support a feature called UPDATE FROM which allows to update the product table based on sales:
The SQL standard and DB2 do not support this syntax. In ANSI SQL such a statement gets written like this:
I can't count how often I have explained that the ANSI SQL statement on DB2 is just as efficient as the proprietary UPDATE FROM syntax.
Despite the fact that the update selects twice from sales, DB2 will actually rewrite the query to exactly what UPDATE FROM does.
One can, however hardly argue with the fact that the standard SQL version of this update is hard to understand and a lot more to type.
So why did DB2 not implement UPDATE FROM and bring the syntax forward in the standard?
The reason is that UPDATE FROM only deals with a fraction of the overall problem.
UPDATE FROM assumes that only updates are being done.
But what if we do not only sell inventory, but we also buy things?
Or what if, once the last item is sold, the product should be deleted?
in that case we need a statement that combines UPDATE, DELETE and INSERT into one.
The MERGE statement is meant to do that.
The MERGE statement was introduced in DB2 V8.1 FP1.
It was co-proposed by Oracle and IBM for ANSI SQL and has since been extended beyond the standard and implemented by other database vendors.
Rather than detailing the syntax here, let's build up examples.
UPDATE FROM using MERGE
At the core of the MERGE statement are four constructs:
Let's recode the UPDATE FROM example from above using MERGE
MERGE INTO product AS T USING sales AS S ON S.id = T.id WHEN MATCHED THEN UPDATE SET inventory = T.inventory - S.sold; SELECT * FROM product ORDER BY id; ID NAME INVENTORY ----------- ----
Personally I like to always use correlation names for target ("T") and source ("S"). Obviously there is a high likely hood for column name collisions in the join condition. But also on the right hand side of the UPDATE operation's SET clause you need to ensure to qualify column names since both source and target are within scope. The SET-clause can be of the same complexity as a regular SET clause in an UPDATE statement. You can set multiple column, use row and scalar sub queries and complex expressions.
This statement looks much denser and it is hopefully much easier to read than the UPDATE with the EXISTS predicate. But is it efficient?
Rows RETURN ( 1) Cost I/O | 3 UPDATE ( 2) 33.8903 NA /---+----\ 3 5 ^MSJOIN TABLE: SERGE ( 3) PRODUCT 13.5799 Q1 NA /-------+-------\ 5 0.6 FETCH FILTER ( 4) ( 6) 6.79033 6.78823 NA NA /----+----\ | 5 5 3 IXSCAN TABLE: SERGE TBSCAN ( 5) PRODUCT ( 7) 0.0183508 Q2 6.78823 NA NA | | 5 3 INDEX: SYSIBM SORT SQL120322101103640 ( 8) Q2 6.78774 NA | 3 TBSCAN ( 9) 6.78672 NA | 3 TABLE: SERGE SALES Q3
This is, in fact, an UPDATE FROM plan. DB2 chose a merge join, and it picked a SORT over a table scan instead of exploiting the index to get order.
But given the size of the source table it is quite possible that this is the cheapest option. So: Yes, this is an efficient way to update the target table.
Next let's investigate what DB2 does when there are duplicates in the source by adding an extra diaper into the mix and dropping the primary key.
ALTER TABLE sales DROP PRIMARY KEY; INSERT INTO sales VALUES(5, 1); MERGE INTO product AS T USING sales AS S ON S.id = T.id WHEN MATCHED THEN UPDATE SET inventory = T.inventory - S.sold; SQL0788N The same row of target table "SERGE.PRODUCT" was identified more
DB2 did not like that! The semantic of MERGE is that any target row can only be set once. From a relational perspective this makes sense. If we can update the diaper inventory twice the it is indeterminate which of the source row would win. Also note that the changes do not accumulate.
It is important to understand how DB2 detects the duplicate. The plan will show:
Rows RETURN ( 1) Cost I/O | 2.88 UPDATE ( 2) 52.5818 7.76 /---+----\ 2.88 5 FETCH TABLE: SERGE ( 3) PRODUCT 33.0845 Q1 4.88 /---+----\ 2.88 5 FILTER TABLE: SERGE ( 4) PRODUCT 13.5851 2 | 3 FILTER ( 5) 13.5846 2 | 3 TBSCAN ( 6) 13.5781 2 | 3 SORT ( 7) 13.5772 2 | 3 HSJOIN^ ( 8) 13.5754 2 /-----+------\ 5 3 TBSCAN TBSCAN ( 9) ( 10) 6.78802 6.78672 1 1 | | 5 3 TABLE: SERGE TABLE: SERGE PRODUCT SALES Q2 Q3
That is a very different plan! Aside form the fact that DB2 chose a hash join, the operators 4-7 are new.
With the primary key on SALES.ID and a join condition on that key DB2 was certain that only one row in the source could match to any row in the target.
Without the primary key DB2 cannot be sure.
So what it does is to count (using ROW_NUMBER) the number of rows with the same target rowid produced by the hash join.
If there is more than one for a given rowid the error is raised.
The FETCH operator (3) is there to reposition the cursor after having lost position due to the SORT.
Now, what can you do when you have duplicates in the source?
You need to either eliminate the duplicates or aggregate them.
In this case, since we are providing delta sales, we can aggregate the duplicates:
MERGE INTO product AS T USING (SELECT id, sum(sold) AS sold FROM sales GROUP BY id) AS S ON S.id = T.id WHEN MATCHED THEN UPDATE SET inventory = T.inventory - S.sold; SELECT * FROM product ORDER BY id; ID NAME INVENTORY ----------- ----
That worked. How did the aggregation change the plan?
Rows RETURN ( 1) Cost I/O | 3 UPDATE ( 2) 33.8924 5 /---+----\ 3 5 ^MSJOIN TABLE: SERGE ( 3) PRODUCT 13.582 Q1 2 /-------+-------\ 5 0.6 FETCH FILTER ( 4) ( 6) 6.79033 6.79033 1 1 /----+----\ | 5 5 3 IXSCAN TABLE: SERGE GRPBY ( 5) PRODUCT ( 7) 0.0183508 Q2 6.78966 0 1 | | 5 3 INDEX: SYSIBM TBSCAN SQL120322101103640 ( 8) Q2 6.78927 1 | 3 SORT ( 9) 6.78841 1 | 3 TBSCAN ( 10) 6.78672 1 | 3 TABLE: SERGE SALES Q3
The plan is pretty again! Why is that?
DB2 knows that the GROUP BY id will enforce uniqueness on ID and therefore the join will be safe.
You should always try to enforce uniqueness across the columns in the merge join condition.
GROUP BY is the best way to do that.
But, it seems we have oversold diapers. That was bound to happen eventually.
UPSERT using MERGE
Now is a good time to think about ordering in new items.
In a sense a negative sales constitutes a buy, so we can recycle the same table and simply use negative numbers.
But we also want to widen the selection in our store and buy new things from a catalog of available products.
For that we need to insert new rows into the product table.
We could do this in a separate insert statement.
But that would mean we have to scan the sales table twice: First do all the updates, then do the deletes.
Finally this is where the power of MERGE can be brought to bear.
DELETE FROM sales; INSERT INTO sales VALUES (5,-100), (5, 8), (3, -4), (4, 1), (6, -15); CREATE TABLE catalog(id INTEGER, name VARCHAR(20)); INSERT INTO catalog VALUES (1, 'Car'), (2, 'TV'), (3, 'House'), (4, 'Dog'), (5, 'Diapers'), (6, 'Milk'), (7, 'Book'); MERGE INTO product AS T USING (SELECT sales.id, sum(sold) AS sold, max(catalog.name) as name FROM sales, catalog WHERE sales.id = catalog.id GROUP BY sales.id) AS S ON S.id = T.id WHEN MATCHED THEN UPDATE SET inventory = T.inventory - S.sold WHEN NOT MATCHED THEN INSERT VALUES(S.id, S.name, -S.sold); SELECT * FROM product ORDER BY id; ID NAME INVENTORY ----------- ----
Note the query composing the source. By grouping on sales.id we ensure DB2 derives uniqueness to simplify the merge access plan.
To comply with GROUP BY rules a MAX has been added to the product name although we only max on by value.
Alternatively we could have added the name look-up as a scalar subquery in the insert values clause.
Generally, to get the best plan it is also recommended to specify INSERT always as the last phase.
Here is the access plan:
Rows RETURN ( 1) Cost I/O | 0.32 INSERT ( 2) 31.5008 4.64 /---+----\ 0.32 6 TBSCAN TABLE: SERGE ( 3) PRODUCT 24.7309 Q17 3.64 | 0.32 TEMP ( 4) 24.7186 3.64 | 0.32 UPDATE ( 5) 24.7141 3.64 /---+----\ 0.32 6 FETCH TABLE: SERGE ( 6) PRODUCT 22.5478 Q1 3.32 /---+----\ 0.32 6 NLJOIN TABLE: SERGE ( 7) PRODUCT 20.3807 3 /---
The TEMP(4) operator prevents Halloween style phenomenon similar to what I described in my post on INSENSITIVE cursors.
The truly interesting part however is the inner of NLJOIN(7)
That's a UNION ALL over two VALUES constructs.
One represents the UPDATE SET clause the other the INSERT VALUES clause.
If the outer join HSJOIN(8) produced a NULL row for the target the INSERT branch of the UNION ALL is executed.
If the join matches a row then the UPDATE branch is being executed.
The UPDATE and INSERT operators themselves also execute conditionally.
Most items in our store are selling well and we keep expanding the range of products.
But car's seem not to be our thing. Not everyone is cut to be a car sales man.
This brings up the question how we can shrink our product table.
Obviously we can simply DELETE items, but can it be done with MERGE?
For example, if we exactly sell out of a product (the stock drops to zero) we could delete the entry altogether.
When MERGE matches a source and target row we can DELETE that row.
But that collides with UPDATE.
DB2 supports an extension to the merge-when-clause for that purpose which allows us to refine the behavior:
DELETE FROM sales; INSERT INTO sales VALUES (1, 10), (5, 3), (2, -4); MERGE INTO product AS T USING (SELECT sales.id, sum(sold) AS sold, max(catalog.name) as name FROM sales, catalog WHERE sales.id = catalog.id GROUP BY sales.id) AS S ON S.id = T.id WHEN MATCHED AND T.inventory = S.sold THEN DELETE WHEN MATCHED AND T.inventory < S.sold THEN SIGNAL SQLSTATE '78000' SET MESSAGE_TEXT = 'Oversold: ' || S.name WHEN MATCHED THEN UPDATE SET inventory = T.inventory - S.sold WHEN NOT MATCHED THEN INSERT VALUES(S.id, S.name, -S.sold); SELECT * FROM product ORDER BY id; ID NAME INVENTORY ----------- ----
In addition to allowing deletes the previous statement also introduced an option to raise errors.
Note that we chose not to extend the WHEN MATCHED predicate for the UPDATE. The merge when clauses operate like a case-expression in that the first condition to be true for a row results in the execution of the matching then clause. So MERGE will only execute the UPDATE branch of the statement for rows that match and neither caused an error or a delete. The optimizer plan now shows another more branch in the UNION ALL reflecting the DELETE. In addition there is an extra JOIN which models the conditional SIGNAL:
Rows RETURN ( 1) Cost I/O | 0.36 INSERT ( 2) 36.9278 5.44 /---+----\ 0.36 5 TBSCAN TABLE: SERGE ( 3) PRODUCT 30.1578 Q25 4.44 | 0.36 TEMP ( 4) 30.1456 4.44 | 0.36 UPDATE ( 5) 30.1411 4.44 /---+----\ 0.36 5 FETCH TABLE: SERGE ( 6) PRODUCT 27.7039 Q1 4.08 /---+----\ 0.36 5 TBSCAN TABLE: SERGE ( 7) PRODUCT 25.2663 3.72 | 0.36 TEMP ( 8) 25.254 3.72 | 0.36 DELETE ( 9) 25.2495 3.72 /---+----\ 0.36 5 FETCH TABLE: SERGE ( 10) PRODUCT 22.8123 Q2 3.36 /---+----\ 0.36 5 NLJOIN TABLE: SERGE ( 11) PRODUCT 20.3747 3 /---
Single row MERGE using VALUES
so far we have been been using a query as a source. This is the most common scenario in a warehousing environment. But MERGE is also commonly used in OLTP type of scenarios. In that case A single row of variables represents the source. We refine our MERGE now to singleton modifications:
VARIABLE id INTEGER; VARIABLE sold INTEGER; VARIABLE name VARCHAR(20); BEGIN SET (:id, :name, :sold) = (3, 'House', 5); END; / MERGE INTO product AS T USING (VALUES(CAST(:id AS INTEGER), CAST(:name AS VARCHAR(20)), CAST(:sold AS INTEGER)) ) AS S(ID, NAME, SOLD) ON S.id = T.id WHEN MATCHED AND T.inventory = S.sold THEN DELETE WHEN MATCHED AND T.inventory < S.sold THEN SIGNAL SQLSTATE '78000' SET MESSAGE_TEXT = 'Oversold: ' || S.name WHEN MATCHED THEN UPDATE SET inventory = T.inventory - S.sold WHEN NOT MATCHED THEN INSERT VALUES(S.id, S.name, -S.sold); SQL0438N Application raised error or warning with diagnostic text: "Oversold: House".
DB2 had difficulty deriving the parameter types, so the input variables were CAST.
Looking at the plan we will see some simplifications. Specifically the TEMP below the INSERT has disappeared. The reason is that DB2 knows only one row will be processed. It will never try to match a second row which eliminates the Halloween scenario.
Rows RETURN ( 1) Cost I/O | 0.12 INSERT ( 2) 15.1827 2.24 /---+----\ 0.12 5 UPDATE TABLE: SERGE ( 3) PRODUCT 8.41275 Q23 1.24 /---+----\ 0.12 5 DELETE TABLE: SERGE ( 4) PRODUCT 7.60036 Q1 1.12 /---+----\ 0.12 5 NLJOIN TABLE: SERGE ( 5) PRODUCT 6.78797 Q2 1 /---
Note how the operators are stacked. No temps, no sorts, no repositioning. This is an optimal access path.
Good to know rules
Note that MERGE is a composite statement, very much in the same way as an inline atomic compound.
My apologies to those who awaited a middle-of-the week posting.
This one was long in the making.