Some practical tips on using OLAP functions
Online analytical processing (OLAP) functions are very flexible and powerful. Using them, you may come up with simple solutions to some problems that would otherwise require either iteration through one or several cursors or recursion. In some other cases it is much easier to write a query using OLAP functions or auxiliary tables than to write an equivalent query not using them. There is an excellent article, written by Bob Lyle, which provides a very good description of OLAP functions.
This article does not describe OLAP functions, instead it describes several real life situations when using OLAP functions allows a simple solution.
Generating auxiliary tables with consecutive numbers or dates
Auxiliary tables do not contain users' data. Instead they are used to simplify writing queries. We shall be using two auxiliary tables:
- sequence table, a list of consecutive integer numbers starting with one, and ending with some maximum number
- calendar table, a list of consecutive dates between the begin date and the end date
In this chapter we shall learn how to populate auxiliary tables using OLAP functions. In the next three chapters we shall discuss some examples when auxiliary tables are very useful.
Note: Auxiliary tables are discussed in detail in Joe Celko's book "SQL for Smarties".
Let us create and populate a sequence table. An OLAP function ROW_NUMBER() provides consecutive numbers to rows in a result set, like this:
SELECT ROW_NUMBER() OVER(), TABNAME FROM SYSCAT.TABLES WHERE TABSCHEMA='DB2ADMIN'
11 record(s) selected.
Note that consecutive numbers are provided even if no columns are selected. For example, let us populate a table with consecutive numbers starting with 1. Here is how we do it:
CREATE TABLE NUMBER_SEQ(NUM INT NOT NULL);
INSERT INTO NUMBER_SEQ SELECT ROW_NUMBER() OVER() FROM
Consecutive numbers starting with 1 will be inserted into table NUMBER_SEQ, one number per each row in the system view SYSCAT.COLUMNS. (I have chosen SYSCAT.COLUMNS because it is always present in every database and always has more than 1,000 rows. You can use any table as long as it has enough (but not too many) records.) There will be as many rows in NUMBER_SEQ as there were rows in SYSCAT.COLUMNS:
SELECT MIN(NUM) AS MIN_NUM, MAX(NUM) AS MAX_NUM, COUNT(*) AS NUM_REC FROM NUMBER_SEQ MIN_NUM MAX_NUM NUM_REC ----------- ----------- ----------- 1 3197 3197 1 record(s) selected.
Populating a calendar table with consecutive dates is just as easy:
CREATE TABLE DATE_SEQ(SOME_DATE DATE NOT NULL);
INSERT INTO DATE_SEQ SELECT DATE('01/01/2003') + (ROW_NUMBER()
OVER() - 1) DAYS FROM SYSCAT.COLUMNS;
We have seen how to create and populate auxiliary tables using an OLAP function ROW_NUMBER(). There definitely are other ways to accomplish this task. For example, one could use recursion or insert records in a loop. The biggest advantage of using ROW_NUMBER in this situation is simplicity of coding.
In the next three chapters we shall use these auxiliary tables NUMBER_SEQ and DATE_SEQ to simplify some rather complex queries.
Printing a checkout receipt using a sequence table
Let us suppose that we need to store all the sales for a checkout line in a grocery store. We also need to be able to print a checkout receipt for any sale, like this:
ITEM_NAME PRICE_PER_ITEM ------------------------------ ------------------------ NESCAFE CLASSIC +6.49000000000000E+000 ENGLISH BAGELS 6-PACK +1.49000000000000E+000 ENGLISH BAGELS 6-PACK +1.49000000000000E+000
Due to several reasons, we have chosen not to store duplicate records in the database. That is, we wish to store just one record
('ENGLISH BAGELS 6-PACK', 1.49, 2) rather than two records
('ENGLISH BAGELS 6-PACK', 1.49).
Not storing duplicate rows is usually much more convenient. So here is the table we shall be working with:
CREATE TABLE SALE_ITEM( SALE_ID INT NOT NULL, ITEM_NAME VARCHAR(30) NOT NULL, ITEM_QUANTITY SMALLINT NOT NULL, PRICE_PER_ITEM FLOAT NOT NULL); INSERT INTO SALE_ITEM VALUES (1, 'NESCAFE CLASSIC', 1, 6.49), (1, 'ENGLISH BAGELS 6-PACK', 2, 1.49), (1, 'BABY CARROTS', 3, 0.99);
We need to develop a query generating output like this (in fact, printing a checkout receipt):
ITEM_NAME PRICE_PER_ITEM ------------------------------ ------------------------ NESCAFE CLASSIC +6.49000000000000E+000 ENGLISH BAGELS 6-PACK +1.49000000000000E+000 ENGLISH BAGELS 6-PACK +1.49000000000000E+000 BABY CARROTS +9.90000000000000E-001 BABY CARROTS +9.90000000000000E-001 BABY CARROTS +9.90000000000000E-001
Here is a very simple query that accomplishes it:
SELECT ITEM_NAME, PRICE_PER_ITEM FROM SALE_ITEM JOIN NUMBER_SEQ
ON NUMBER_SEQ.NUM <= SALE_ITEM.ITEM_QUANTITY ORDER BY
Note: it is perfectly legal to join tables on conditions other than equality, as in this example.
As in the previous chapter, one could also use recursion to get the needed results. Recursion might even be more efficient in some cases. However, the code utilizing a sequence table looks much simpler and easier to understand, and you do not have to worry about avoiding an infinite loop a recursion could result in. So, in this situation using ROW_NUMBER is a simple alternative to using recursion.
If creating an auxiliary table is not an option, it is easy to use a table expression and generate consecutive numbers along the way, like this:
SELECT ITEM_NAME, PRICE_PER_ITEM FROM SALE_ITEM JOIN (SELECT ROW_NUMBER() OVER() AS NUM FROM SYSCAT.TABLES) AS NUMBER_SEQ ON NUMBER_SEQ.NUM <= SALE_ITEM.ITEM_QUANTITY ORDER BY PRICE_PER_ITEM DESC
The result will be exactly the same, although the performance will be somewhat slower.
How to simplify a query using a calendar table
I have taken this example from [article on Pivot Tables]. First, let us create a table and insert some data into it:
CREATE TABLE BUSINESS_TRIP(EMPLOYEE_ID INT NOT NULL, DATE_FROM DATE NOT NULL, DATE_TO DATE NOT NULL); INSERT INTO BUSINESS_TRIP VALUES (1, DATE('01/06/2003'), DATE('01/10/2003')), (1, DATE('01/13/2003'), DATE('01/17/2003')), (1, DATE('01/20/2003'), DATE('01/24/2003')), (1, DATE('01/27/2003'), DATE('01/31/2003')), (2, DATE('01/07/2003'), DATE('01/08/2003')), (3, DATE('01/08/2003'), DATE('01/09/2003'));
Given a simple task "select all the days in January 2003 when there were no employees out on business trips", the calendar table DATE_SEQ comes very handy.
SELECT SOME_DATE AS NOBODY_ON_TRIP FROM DATE_SEQ WHERE SOME_DATE BETWEEN DATE('01/01/2003') AND DATE('01/31/2003') AND NOT EXISTS(SELECT * FROM BUSINESS_TRIP WHERE SOME_DATE BETWEEN DATE_FROM AND DATE_TO); NOBODY_ON_TRIP -------------- 01/01/2003 01/02/2003 01/03/2003 01/04/2003 01/05/2003 01/11/2003 01/12/2003 01/18/2003 01/19/2003 01/25/2003 01/26/2003 11 record(s) selected.
The query is very simple. Some definitely more complex alternative solutions are discussed in [article on Pivot Tables].
Given a similar task "select all the days in January 2003 when there were more than 2 employees out on business trips", the calendar table DATE_SEQ also provides a very easy way out:
SELECT SOME_DATE AS THREE_OR_MORE_ON_TRIP FROM DATE_SEQ WHERE SOME_DATE BETWEEN DATE('01/01/2003') AND DATE('01/31/2003') AND (SELECT COUNT(*) FROM BUSINESS_TRIP WHERE SOME_DATE BETWEEN DATE_FROM AND DATE_TO) > 2; THREE_OR_MORE_ON_TRIP ------------------- 01/08/2003
Again, if creating an auxiliary table is not an option, we can use a table expression instead:
SELECT SOME_DATE AS THREE_OR_MORE_ON_TRIP FROM (SELECT DATE('01/01/2003') + ROW_NUMBER() OVER() DAYS AS SOME_DATE FROM SYSCAT.TABLES) AS DATE_SEQ WHERE SOME_DATE BETWEEN DATE('01/01/2003') AND DATE('01/31/2003') AND (SELECT COUNT(*) FROM BUSINESS_TRIP WHERE SOME_DATE BETWEEN DATE_FROM AND DATE_TO) > 2;
Laying out two records on a line using a sequence table
Given this table structure and data:
CREATE TABLE VEHICLE_ACCIDENT( ACCIDENT_ID INT NOT NULL, TAG_NUMBER CHAR(10) , TAG_STATE CHAR(2) ); INSERT INTO VEHICLE_ACCIDENT VALUES(1,'123456','IL'),(1,'234567','IL'),(1,'34567TT','WI');
(other columns omitted to keep things simple). Note that there may be more than 2 vehicles involved in an accident. There is a requirement to lay out two records on one line, like this (when three vehicles were involved):
TAG_NUMBER_1 TAG_STATE_1 TAG_NUMBER_2 TAG_STATE_2 ------------ ----------- ------------ ----------- 123456 IL 234567 IL 3456TT WI
ROW_NUMBER(), this can be done very easily:
WITH VEHICLE_ACCIDENT_RN(ACCIDENT_ID, ROWNUM, TAG_NUMBER, TAG_STATE) AS (SELECT ACCIDENT_ID, ROW_NUMBER() OVER() AS ROWNUM, TAG_NUMBER, TAG_STATE FROM VEHICLE_ACCIDENT) SELECT LEFT_SIDE.TAG_NUMBER AS TAG_NUMBER_1, LEFT_SIDE.TAG_STATE AS TAG_STATE_1, RIGHT_SIDE.TAG_NUMBER AS TAG_NUMBER_2, RIGHT_SIDE.TAG_STATE AS TAG_STATE_2 FROM (SELECT L.*, (L.ROWNUM+1)/2 AS PAGENUM FROM VEHICLE_ACCIDENT_RN L WHERE MOD(ROWNUM,2)=1)AS LEFT_SIDE LEFT OUTER JOIN (SELECT R.*, (R.ROWNUM+1)/2 AS PAGENUM FROM VEHICLE_ACCIDENT_RN R WHERE MOD(ROWNUM,2)=0)AS RIGHT_SIDE ON LEFT_SIDE.PAGENUM = RIGHT_SIDE.PAGENUM WHERE LEFT_SIDE.ACCIDENT_ID=1 AND (RIGHT_SIDE.ACCIDENT_ID=1 OR RIGHT_SIDE.ACCIDENT_ID IS NULL)
This query gives correct results for both odd and even number of involved vehicles. Feel free to add records and verify it. Again, as in the previous chapters, there are several other ways to solve this problem. Using
ROW_NUMBER() allows for a solution that is very simple, quick to develop and easy to understand.
ROW_NUMBER() to assign guests to rooms
Let us consider a simple problem. There are two tables: a list of available hotel rooms and a list of arriving hotel guests. It is necessary to assign available rooms to arriving guests, so that every guest gets one room, if available. To solve such a problem, one usually needs to open two cursors, one for rooms and another for guests, and iterate through both cursors until one or both cursors end. Here are the tables' definitions and some sample data:
CREATE TABLE ROOM( ROOM_ID INT NOT NULL PRIMARY KEY, SMOKING CHAR(1) NOT NULL CHECK(SMOKING IN ('N','Y'))); INSERT INTO ROOM VALUES (121,'Y'),(139,'N'),(142,'N'),(201,'Y'),(202,'N'); CREATE TABLE GUEST( GUEST_ID INT NOT NULL PRIMARY KEY, SMOKER CHAR(1) NOT NULL CHECK(SMOKER IN ('N','Y'))); INSERT INTO GUEST VALUES(321, 'N'),(17,'Y'),(57,'Y'),(91,'Y'),(2,'N'),(444,'N'); CREATE TABLE GUEST_ASSIGNMENT( GUEST_ID INT NOT NULL PRIMARY KEY, ROOM_ID INT NOT NULL UNIQUE, FOREIGN KEY(GUEST_ID) REFERENCES GUEST(GUEST_ID), FOREIGN KEY(ROOM_ID) REFERENCES ROOM(ROOM_ID));
Using an OLAP function
ROW_NUMBER(), there is no need to open two cursors and iterate through rooms and guests. Instead, we can simply join these two tables and get one-to-one correspondence between rooms and guests that we need. To understand how it works, let us first have a look at this select query and its output:
SELECT ROOM_NUMBER, GUEST_NUMBER, ROOM_ID, GUEST_ID FROM (SELECT ROW_NUMBER() OVER(ORDER BY ROOM_ID) AS ROOM_NUMBER, ROOM_ID FROM ROOM) AS R JOIN (SELECT ROW_NUMBER() OVER(ORDER BY GUEST_ID) AS GUEST_NUMBER, GUEST_ID FROM GUEST) AS G ON ROOM_NUMBER=GUEST_NUMBER ROOM_NUMBER GUEST_NUMBER ROOM_ID GUEST_ID -------------------- -------------------- ----------- ----------- 1 1 121 2 2 2 139 17 3 3 142 57 4 4 201 91 5 5 202 321
(For more details on table expressions refer to Sheryl Larsen's article.)
This simple query joins at most 1 record in ROOM table with at most 1 record in GUEST table. Note that instead of specifying any order, like I did (OVER(ORDER BY GUEST_ID)) one could say that order does not matter (OVER()) or even request random order (OVER(ORDER BY RAND()) ). In this case one guest did not get a room, because there weren't enough available. Feel free to add records to ROOM table to verify that the query works in other situations (when there are exactly as many rooms as guests, and when there are fewer rooms than guests).
Having understood how the join works, populating
GUEST_ASSIGNMENT table is easy:
INSERT INTO GUEST_ASSIGNMENT SELECT GUEST_ID, ROOM_ID FROM (SELECT ROW_NUMBER() OVER(ORDER BY ROOM_ID) AS ROOM_NUMBER, ROOM_ID FROM ROOM) AS R JOIN (SELECT ROW_NUMBER() OVER(ORDER BY GUEST_ID) AS GUEST_NUMBER, GUEST_ID FROM GUEST) AS G ON ROOM_NUMBER=GUEST_NUMBER
As we have seen, in this case using
ROW_NUMBER() function saved us a lot of development effort. We did not have to open two cursors and iterate through them.
ROW_NUMBER() OVER(PARTITION ... ) to assign non-smoker guests to "no smoking" rooms
The sample from the previous chapter was very simplistic. Let us make one more step towards real life complexity. Let us make sure that smokers occupy only smoking rooms and non-smokers occupy only non-smoking rooms. Here is the query that accomplishes just that:
Again, having understood how records in GUEST and ROOM are joined, you could populate the
GUEST_ASSIGNMENT table in one simple
INSERT statement. This approach is definitely much easier than iterating through one cursor on guests and another one on rooms.
Distributing cargo boxes to trucks using cumulative sums
The problem that we shall discuss in this chapter is quite similar to the previous two problems. This time I shall demonstrate how using cumulative sums simplifies queries development.
Let us suppose several boxes of one and the same size need to be loaded on several trucks of different capacity. This is in fact a very common problem in resource allocation that is usually solved using cursors.
Here are the tables' definitions and some sample data:
The problem is to populate
BOX_IN_TRUCK table properly, which means to assign boxes to trucks so that no truck is loaded over its capacity. One usually needs to use cursors to accomplish such a task. Using OLAP functions, it can be done without cursors.
INSERT INTO BOX_IN_TRUCK SELECT TRUCK_CUMULATIVE.TRUCK_ID, CARGO_BOX.CARGO_BOX_ID FROM (SELECT SUM(CAPACITY) OVER(ORDER BY TRUCK_ID) - CAPACITY + 1 AS BOX_FROM, SUM(CAPACITY) OVER(ORDER BY TRUCK_ID) AS BOX_TO, CAPACITY, TRUCK_ID FROM TRUCK) AS TRUCK_CUMULATIVE JOIN (SELECT ROW_NUMBER() OVER() AS ROW_NUMBER, CARGO_BOX_ID FROM CARGO_BOX) AS CARGO_BOX ON ROW_NUMBER BETWEEN BOX_FROM AND BOX_TO;
Let us verify that this query has produced the required results:
SELECT * FROM BOX_IN_TRUCK TRUCK_ID CARGO_BOX_ID ----------- ------------ 11 101 11 102 11 103 22 104 22 105 33 106 33 107
To understand how it works, let us retrieve all the columns involved in the query:
SELECT TRUCK_CUMULATIVE.TRUCK_ID, BOX_FROM, BOX_TO, CARGO_BOX.CARGO_BOX_ID, ROW_NUMBER FROM (SELECT SUM(CAPACITY) OVER(ORDER BY TRUCK_ID) - CAPACITY + 1 AS BOX_FROM, SUM(CAPACITY) OVER(ORDER BY TRUCK_ID) AS BOX_TO, CAPACITY, TRUCK_ID FROM TRUCK) AS TRUCK_CUMULATIVE JOIN (SELECT ROW_NUMBER() OVER() AS ROW_NUMBER, CARGO_BOX_ID FROM CARGO_BOX) AS CARGO_BOX ON ROW_NUMBER BETWEEN BOX_FROM AND BOX_TO; TRUCK_ID BOX_FROM BOX_TO CARGO_BOX_ID ROW_NUMBER ----------- ----------- ----------- ------------ -------------------- 11 1 3 101 1 11 1 3 102 2 11 1 3 103 3 22 4 5 104 4 22 4 5 105 5 33 6 8 106 6 33 6 8 107 7 7 record(s) selected.
For more details on cumulative sums, refer to Bob Lyle's article.
- Bob Lyle Talks About OLAP Functions in DB2 for UNIX, Windows, and OS/2, Bob Lyle
- Meet the Experts: Sheryl Larsen on the Power of Table Expressions, Sheryl Larsen
- SQL for Smarties: Advance SQL Programming, Second Edition ISBN: 1558605762; Format: Paperback, 450pp; Pub. Date: October 1999; Publisher: Elsevier Science & Technology Books, Joe Celko
- Turning On Pivot Tables, Oracle Magazine, Jonathan Gennick
We have seen several examples where using OLAP function significantly simplified development of queries. This is just the tip of an iceberg. Be aware of the many advantages OLAP functions provide, and use them, because they make your life much easier.