DB2 10: Generated columns - More useful than ever.
Comments (5) Visits (14228)
As part of my job I work in a group called the SQL Language council (SLC).
The purpose of that group is, roughly three fold:
I like those.
On the flip side meeting scan be excruciatingly boring and be consumed by discussions of minute details.
I have a tendency not to be found in those.
But what is always the most intriguing about the SLC is the conscious attempt to not paint ourselves into a corner when it comes to defining language.
With product cycles of 1 1/2 to 3 years maintaining a vision across multiple releases is quite a challenge.
Therefore I love it when, at last, syntax and ideas that have been in place of over a decade finally become reality as features pile on top of features and they all lock into place adding to more than the sum of their individual parts,
Generated column support is a tale worth re-telling for that reason since it has gained more and more function over the releases culminating in DB2 10..
Setting the stage
As always let's take a look an example to motivate the problem generated columns are meant to solve.
First assume an expensive function.
Perhaps this is a function deriving your credit rating based on numerous properties of your financial standing.
As a proxy for such a real life expensive function I choose the "3a+1" once more.
The function will converge for any (let's hope) positive whole number eventually to 1 and return the number of steps it took.
CREATE OR REPLACE FUNCTION three_a_plus_one(IN n INTEGER) RETURNS INTEGER SPECIFIC three_a_plus_one CONTAINS SQL DETERMINISTIC NO EXTERNAL ACTION BEGIN DECLARE a INTEGER; DECLARE steps INTEGER DEFAULT 0; SET a = n; WHILE a <> 1 DO SET a = CASE WHEN (a / 2) * 2 = a THEN a / 2 ELSE 3 * a + 1 END, steps = steps + 1; END WHILE; RETURN steps; END / VALUES thre
Already for a small number such as 27 it takes 111 steps for the function to converge on 1.
Now, because these values are so expensive to produce we want to pre-compute them and store the result in a table.
CREATE TABLE three_a (n INTEGER NOT NULL PRIMARY KEY, steps INTEGER NOT NULL); INSERT INTO three_a WITH rec(n, steps) AS (VALUES (1, thre
We can query the table and for any n retrieve the number of steps.
But there are problems with such a table.
To deal with the first problem we can add a check constraint that ensures that the dependency is always true:
ALTER TABLE three_a ADD CONSTRAINT steps CHECK (steps = thre
As long as we add the right values everything is fine:
INSERT INTO three_a VALUES(300000, thre
Let's add one more twist to this before taking a closer look.
Having to add a value that is really generated and which we can only do wrong is annoying.
This is a good usage for a before trigger:
CREATE TRIGGER steps_gen BEFORE INSERT OR UPDATE ON three_a REFERENCING NEW AS n FOR EACH ROW BEGIN SET n.steps = thre
Now, independent of whether we use the trigger to insert the row or we specify the function directly, we are effectively executing this expensive function twice.
The first execution is within the trigger or the VALUES clause.
The second execution is by the check constraint which does nothing else but double check what must be true by design.
The second problem, as previously mentioned is that an existing application cannot exploit the precomputed values without being changed.
SELECT COUNT(*) FROM three_a WHERE three_a_plus_one(n) = 100; 1 ----------- 36 EXPLAIN PLAN FOR SELECT COUNT(*) FROM three_a WHERE three_a_plus_one(n) = 100; ID TYPE OBJECT_SCHEMA OBJECT_NAME PREDICATE_TEXT ------ --------------- ----
Expression Generated Columns
Generated columns are meant to solve both problems stated above:
Using an expression generated column the above table can get expressed like this:
DROP TABLE three_a; CREATE TABLE three_a(n INTEGER NOT NULL PRIMARY KEY, steps INTEGER NOT NULL GENERATED ALWAYS AS (thr
The syntax is very similar to that of an IDENTITY column (also introduced in DB2 7,1) , except that instead of the identity properties an expression is provided.
Since the generated column acts as a check constraint there are several rules that apply to the expression to ensure the constraint is always true:
Identity columns by contrast do allow override.
INSERT INTO three_a(n) WITH rec(n) AS (VALUES (1) UNION ALL SELECT n + 1 FROM rec WHERE n < 100000) SELECT * FROM rec;
There was no need to specify the function anywhere in the insert.
In fact we did not even need to use the column at all.
DB2 automatically will inject the equivalent of a before trigger to produce the required value.
In contrast to the handcrafted schema we first used there is no need to check consistency of the database since DB2 is fully in control.
INSERT INTO three_a(n) VALUES (300000); Optimized Statement: ------------------- INSERT INTO SERGE.THREE_A AS Q3 SELECT 300000, "SERGE "."T
DB2 has injected the SQL function as if it were user specified with no additional overhead.
In the above two INSERT statements the column has been omitted from the INSERT column list altogether.
If a column is omitted from the column list DB2 will default that column.
For a generated column defaulting implies generation of the computed value.
Therefore you can also specify the DEFAULT keyword as a placeholder for the value:
INSERT INTO three_a VALUES (300003, DEFAULT);
You may not, however provide an explicit value, even if that value is correct:
INSERT INTO three_a VALUES (300005, thre
Very similar to inserting DB2 will maintain the generated column value on updates.
If the a value the generated column depends on is changed the generated column will be adjusted accordingly.
SELECT steps FROM three_a WHERE n = 300003; STEPS ----------- 158 UPDATE three_a SET n = 251111 WHERE n = 300003; SELECT steps FROM three_a WHERE n = 251111; STEPS ----------- 256
Despite not having defined an explicit check constraint DB2 not only knows of its existence.
It can also use the knowledge to rewrite the query:
SELECT COUNT(*) FROM three_a WHERE three_a_plus_one(n) = 100; 1 ----------- 36 Optimized Statement: ------------------- SELECT Q3.$C0 FROM (SELECT COUNT(*) FROM (SELECT $RID$ FROM SERGE.THREE_A AS Q1 WHERE (Q1.STEPS = 100) ) AS Q2 ) AS Q3
Note how the optimized SQL returned by db2exfmt shows that the expression has been replaced by the column STEPS.
This opens up an interesting possibility.
If the generated column itself is indexed then the optimizer can choose that index after DB2 has exploited the generated column reference.
That is, de-facto, a generated column provides similar function to an index on an expression.
CREATE INDEX steps ON three_a(steps); SELECT COUNT(*) FROM three_a WHERE three_a_plus_one(n) = 100; 1 ----------- 36 Rows RETURN ( 1) Cost I/O | 1 GRPBY ( 2) 6.81703 1 | 38.8879 IXSCAN ( 3) 6.81315 1 | 10002 INDEX: SERGE STEPS Q1
The difference to an index on an expression, of course, is that here the expression is actually materialized in the table.
This costs a bit more space. It does however also speed up queries that do table scans, such as a hash join over the expression.
Altering generated columns
So what do you do if you have a functionally dependent column today without using a generated column syntax?
If you want to exploit the functionality you want to make that column a generated column.
Another scenario may be where you want to replace the generated column with another expression, perhaps to fix a logic error.
To achieve that you can use the regular ALTER TABLE statement, but with a twist.
Since the table currently has data you cannot simply replace the expression.
When adding a true check constraint DB2 will never change the data in the table.
In this case however this may be necessary.
DB2 not only needs to validate rows, it also needs to fix them.
Therefore whenever you add a generated column or make a non generated column a generated column, you need to place the table into check pending mode and use SET INTEGRITY to recover from that once the ALTER TABLE is complete.
Altering columns to add or remove the generated column properties (along with identity and default) was introduced in DB2 8.2.
In the following example will replace the three_a_plus_one() function with a faster inlined version:
CREATE OR REPLACE FUNCTION three_a_plus_one(IN n INTEGER) RETURNS INTEGER SPECIFIC three_a_plus_one CONTAINS SQL DETERMINISTIC NO EXTERNAL ACTION BEGIN ATOMIC DECLARE a INTEGER; DECLARE steps INTEGER DEFAULT 0; SET a = n; WHILE a <> 1 DO SET a = CASE WHEN (a / 2) * 2 = a THEN a / 2 ELSE 3 * a + 1 END, steps = steps + 1; END WHILE; RETURN steps; END / SQL0478N DROP, ALTER, TRANSFER OWNERSHIP or REVOKE on object type "FUNCTION" cannot be processed because there is an object "SERGE.THREE_A", of type "TABLE", which depends on it.
The function cannot be replaced while the generated column is in place.
So we need to drop the generated column property, replace the function and then add it back.
ALTER TABLE three_a ALTER COLUMN steps DROP EXPRESSION;
DB2 10: Hiding generated columns
Remember how I stated above that one of the benefits of generated columns is that they can be exploited without changing the applications?
That was true for queries which use the expression. But any queries of the form SELECT * or INSERT INTO T VALUES will break when a new column is being added.
This has often been brought forward as a reason not to use generated columns.
In DB2 9.5 ROW CHANGE TOKEN, and ROW CHANGE TIMESTAMP columns were added in support of optimistic locking strategies.
These new kinds of columns can be hidden from applications using an IMPLICITLY HIDDEN property.
In DB2 10, with temporal query support this new feature is available to all columns.
This includes generated columns!
You can hide them from select-star syntax as well as from the INSERT-column list.
This is very convenient!
To show the effect let's replay the example from the beginning.
But this time we will add the generated column after the fact.
DROP TABLE three_a;
Now we add the generated column with the goal of speeding up the select query.
We ultimately want to use an index on the expression.
SET INTEGRITY FOR three_a OFF; ALTER TABLE three_a
Note that implicitly hidden columns can still be referenced explicitly:
SELECT steps FROM three_a WHERE n = 5678; STEPS ----------- 129