Using DB2 UDB OLAP functions

Some practical tips on using OLAP functions

Online analytical processing (OLAP) functions are very flexible and powerful. Using them, you may come up with simple solutions to some problems that would otherwise require either iteration through one or several cursors or recursion. In some other cases it is much easier to write a query using OLAP functions or auxiliary tables than to write an equivalent query not using them. There is an excellent article, written by Bob Lyle, which provides a very good description of OLAP functions.

This article does not describe OLAP functions, instead it describes several real life situations when using OLAP functions allows a simple solution.

Generating auxiliary tables with consecutive numbers or dates

Auxiliary tables do not contain users' data. Instead they are used to simplify writing queries. We shall be using two auxiliary tables:

• sequence table, a list of consecutive integer numbers starting with one, and ending with some maximum number
• calendar table, a list of consecutive dates between the begin date and the end date

In this chapter we shall learn how to populate auxiliary tables using OLAP functions. In the next three chapters we shall discuss some examples when auxiliary tables are very useful.
Note: Auxiliary tables are discussed in detail in Joe Celko's book "SQL for Smarties".
Let us create and populate a sequence table. An OLAP function ROW_NUMBER() provides consecutive numbers to rows in a result set, like this:

```SELECT ROW_NUMBER() OVER(), TABNAME FROM SYSCAT.TABLES WHERE TABSCHEMA='DB2ADMIN'
1		TABNAME -------------------------------------------		1 CONSECUTIVE_NUMS		2 DATES		3 EXPLAIN_ARGUMENT		4 EXPLAIN_INSTANCE		5 EXPLAIN_OBJECT		6 EXPLAIN_OPERATOR		7 EXPLAIN_PREDICATE		8 EXPLAIN_STATEMENT		9 EXPLAIN_STREAM		10 SALES_DETAIL		11 SALES_DETAIL_CLS11 record(s) selected.```

Note that consecutive numbers are provided even if no columns are selected. For example, let us populate a table with consecutive numbers starting with 1. Here is how we do it:

```CREATE TABLE NUMBER_SEQ(NUM INT NOT NULL);
INSERT INTO NUMBER_SEQ SELECT ROW_NUMBER() OVER() FROM
SYSCAT.COLUMNS;```

Consecutive numbers starting with 1 will be inserted into table NUMBER_SEQ, one number per each row in the system view SYSCAT.COLUMNS. (I have chosen SYSCAT.COLUMNS because it is always present in every database and always has more than 1,000 rows. You can use any table as long as it has enough (but not too many) records.) There will be as many rows in NUMBER_SEQ as there were rows in SYSCAT.COLUMNS:

```SELECT MIN(NUM) AS MIN_NUM, MAX(NUM) AS MAX_NUM, COUNT(*) AS NUM_REC FROM NUMBER_SEQ
MIN_NUM     MAX_NUM     NUM_REC
----------- ----------- -----------
1        3197        3197
1 record(s) selected.```

Populating a calendar table with consecutive dates is just as easy:

```CREATE TABLE DATE_SEQ(SOME_DATE DATE NOT NULL);
INSERT INTO DATE_SEQ SELECT DATE('01/01/2003') + (ROW_NUMBER()
OVER() - 1) DAYS FROM SYSCAT.COLUMNS;```

We have seen how to create and populate auxiliary tables using an OLAP function ROW_NUMBER(). There definitely are other ways to accomplish this task. For example, one could use recursion or insert records in a loop. The biggest advantage of using ROW_NUMBER in this situation is simplicity of coding.

In the next three chapters we shall use these auxiliary tables NUMBER_SEQ and DATE_SEQ to simplify some rather complex queries.

Printing a checkout receipt using a sequence table

Let us suppose that we need to store all the sales for a checkout line in a grocery store. We also need to be able to print a checkout receipt for any sale, like this:

```ITEM_NAME                      PRICE_PER_ITEM
------------------------------ ------------------------
NESCAFE CLASSIC                  +6.49000000000000E+000
ENGLISH BAGELS 6-PACK            +1.49000000000000E+000
ENGLISH BAGELS 6-PACK            +1.49000000000000E+000```

Due to several reasons, we have chosen not to store duplicate records in the database. That is, we wish to store just one record `('ENGLISH BAGELS 6-PACK', 1.49, 2)` rather than two records `('ENGLISH BAGELS 6-PACK', 1.49)`.

Not storing duplicate rows is usually much more convenient. So here is the table we shall be working with:

```CREATE TABLE SALE_ITEM(
SALE_ID INT NOT NULL,
ITEM_NAME VARCHAR(30) NOT NULL,
ITEM_QUANTITY SMALLINT NOT NULL,
PRICE_PER_ITEM FLOAT NOT NULL);
INSERT INTO SALE_ITEM VALUES
(1, 'NESCAFE CLASSIC', 1, 6.49),
(1, 'ENGLISH BAGELS 6-PACK', 2, 1.49),
(1, 'BABY CARROTS', 3, 0.99);```

We need to develop a query generating output like this (in fact, printing a checkout receipt):

```ITEM_NAME                      PRICE_PER_ITEM
------------------------------ ------------------------
NESCAFE CLASSIC                  +6.49000000000000E+000
ENGLISH BAGELS 6-PACK            +1.49000000000000E+000
ENGLISH BAGELS 6-PACK            +1.49000000000000E+000
BABY CARROTS                     +9.90000000000000E-001
BABY CARROTS                     +9.90000000000000E-001
BABY CARROTS                     +9.90000000000000E-001```

Here is a very simple query that accomplishes it:

```SELECT ITEM_NAME, PRICE_PER_ITEM FROM SALE_ITEM JOIN NUMBER_SEQ
ON NUMBER_SEQ.NUM <= SALE_ITEM.ITEM_QUANTITY ORDER BY
PRICE_PER_ITEM DESC```

Note: it is perfectly legal to join tables on conditions other than equality, as in this example.

As in the previous chapter, one could also use recursion to get the needed results. Recursion might even be more efficient in some cases. However, the code utilizing a sequence table looks much simpler and easier to understand, and you do not have to worry about avoiding an infinite loop a recursion could result in. So, in this situation using ROW_NUMBER is a simple alternative to using recursion.
If creating an auxiliary table is not an option, it is easy to use a table expression and generate consecutive numbers along the way, like this:

```SELECT ITEM_NAME, PRICE_PER_ITEM FROM SALE_ITEM
JOIN
(SELECT ROW_NUMBER() OVER() AS NUM FROM SYSCAT.TABLES) AS NUMBER_SEQ
ON NUMBER_SEQ.NUM <= SALE_ITEM.ITEM_QUANTITY
ORDER BY PRICE_PER_ITEM DESC```

The result will be exactly the same, although the performance will be somewhat slower.

How to simplify a query using a calendar table

I have taken this example from [article on Pivot Tables]. First, let us create a table and insert some data into it:

```CREATE TABLE BUSINESS_TRIP(EMPLOYEE_ID INT NOT NULL, DATE_FROM DATE NOT NULL, DATE_TO DATE NOT NULL);
(1, DATE('01/06/2003'), DATE('01/10/2003')),
(1, DATE('01/13/2003'), DATE('01/17/2003')),
(1, DATE('01/20/2003'), DATE('01/24/2003')),
(1, DATE('01/27/2003'), DATE('01/31/2003')),
(2, DATE('01/07/2003'), DATE('01/08/2003')),
(3, DATE('01/08/2003'), DATE('01/09/2003'));```

Given a simple task "select all the days in January 2003 when there were no employees out on business trips", the calendar table DATE_SEQ comes very handy.

```SELECT SOME_DATE AS NOBODY_ON_TRIP FROM DATE_SEQ
WHERE SOME_DATE BETWEEN DATE('01/01/2003') AND DATE('01/31/2003')
AND NOT EXISTS(SELECT * FROM BUSINESS_TRIP WHERE SOME_DATE BETWEEN DATE_FROM AND DATE_TO);
NOBODY_ON_TRIP
--------------
01/01/2003
01/02/2003
01/03/2003
01/04/2003
01/05/2003
01/11/2003
01/12/2003
01/18/2003
01/19/2003
01/25/2003
01/26/2003
11 record(s) selected.```

The query is very simple. Some definitely more complex alternative solutions are discussed in [article on Pivot Tables].

Given a similar task "select all the days in January 2003 when there were more than 2 employees out on business trips", the calendar table DATE_SEQ also provides a very easy way out:

```SELECT SOME_DATE AS THREE_OR_MORE_ON_TRIP FROM DATE_SEQ
WHERE SOME_DATE BETWEEN DATE('01/01/2003') AND DATE('01/31/2003')
AND (SELECT COUNT(*) FROM BUSINESS_TRIP WHERE SOME_DATE BETWEEN DATE_FROM AND DATE_TO) > 2;

THREE_OR_MORE_ON_TRIP
-------------------
01/08/2003```

Again, if creating an auxiliary table is not an option, we can use a table expression instead:

```SELECT SOME_DATE AS THREE_OR_MORE_ON_TRIP
FROM
(SELECT DATE('01/01/2003') + ROW_NUMBER() OVER() DAYS AS SOME_DATE FROM SYSCAT.TABLES) AS DATE_SEQ
WHERE SOME_DATE BETWEEN DATE('01/01/2003') AND DATE('01/31/2003')
AND (SELECT COUNT(*) FROM BUSINESS_TRIP WHERE SOME_DATE BETWEEN DATE_FROM AND DATE_TO) > 2;```

Laying out two records on a line using a sequence table

Given this table structure and data:

```CREATE TABLE VEHICLE_ACCIDENT(
ACCIDENT_ID INT NOT NULL,
TAG_NUMBER CHAR(10) ,
TAG_STATE CHAR(2) );
INSERT INTO VEHICLE_ACCIDENT VALUES(1,'123456','IL'),(1,'234567','IL'),(1,'34567TT','WI');```

(other columns omitted to keep things simple). Note that there may be more than 2 vehicles involved in an accident. There is a requirement to lay out two records on one line, like this (when three vehicles were involved):

```TAG_NUMBER_1 TAG_STATE_1 TAG_NUMBER_2 TAG_STATE_2
------------ ----------- ------------ -----------
123456       IL          234567       IL
3456TT       WI```

Using `ROW_NUMBER()`, this can be done very easily:

```WITH VEHICLE_ACCIDENT_RN(ACCIDENT_ID, ROWNUM, TAG_NUMBER, TAG_STATE) AS
(SELECT ACCIDENT_ID, ROW_NUMBER() OVER() AS ROWNUM, TAG_NUMBER, TAG_STATE
FROM VEHICLE_ACCIDENT)
SELECT
LEFT_SIDE.TAG_NUMBER AS TAG_NUMBER_1,
LEFT_SIDE.TAG_STATE AS TAG_STATE_1,
RIGHT_SIDE.TAG_NUMBER AS TAG_NUMBER_2,
RIGHT_SIDE.TAG_STATE AS TAG_STATE_2
FROM
(SELECT L.*, (L.ROWNUM+1)/2 AS PAGENUM FROM VEHICLE_ACCIDENT_RN L WHERE MOD(ROWNUM,2)=1)AS LEFT_SIDE
LEFT OUTER JOIN
(SELECT R.*, (R.ROWNUM+1)/2 AS PAGENUM FROM VEHICLE_ACCIDENT_RN R WHERE MOD(ROWNUM,2)=0)AS RIGHT_SIDE
ON LEFT_SIDE.PAGENUM = RIGHT_SIDE.PAGENUM
WHERE LEFT_SIDE.ACCIDENT_ID=1 AND (RIGHT_SIDE.ACCIDENT_ID=1 OR RIGHT_SIDE.ACCIDENT_ID IS NULL)```

This query gives correct results for both odd and even number of involved vehicles. Feel free to add records and verify it. Again, as in the previous chapters, there are several other ways to solve this problem. Using `ROW_NUMBER()` allows for a solution that is very simple, quick to develop and easy to understand.

Using `ROW_NUMBER()` to assign guests to rooms

Let us consider a simple problem. There are two tables: a list of available hotel rooms and a list of arriving hotel guests. It is necessary to assign available rooms to arriving guests, so that every guest gets one room, if available. To solve such a problem, one usually needs to open two cursors, one for rooms and another for guests, and iterate through both cursors until one or both cursors end. Here are the tables' definitions and some sample data:

```CREATE TABLE ROOM(
ROOM_ID INT NOT NULL PRIMARY KEY,
SMOKING CHAR(1) NOT NULL CHECK(SMOKING IN ('N','Y')));

INSERT INTO ROOM VALUES (121,'Y'),(139,'N'),(142,'N'),(201,'Y'),(202,'N');

CREATE TABLE GUEST(
GUEST_ID INT NOT NULL PRIMARY KEY,
SMOKER CHAR(1) NOT NULL CHECK(SMOKER IN ('N','Y')));
INSERT INTO GUEST VALUES(321, 'N'),(17,'Y'),(57,'Y'),(91,'Y'),(2,'N'),(444,'N');
CREATE TABLE GUEST_ASSIGNMENT(
GUEST_ID INT NOT NULL PRIMARY KEY,
ROOM_ID INT NOT NULL UNIQUE,
FOREIGN KEY(GUEST_ID) REFERENCES GUEST(GUEST_ID),
FOREIGN KEY(ROOM_ID) REFERENCES ROOM(ROOM_ID));```

Using an OLAP function `ROW_NUMBER()`, there is no need to open two cursors and iterate through rooms and guests. Instead, we can simply join these two tables and get one-to-one correspondence between rooms and guests that we need. To understand how it works, let us first have a look at this select query and its output:

```SELECT ROOM_NUMBER, GUEST_NUMBER, ROOM_ID, GUEST_ID
FROM
(SELECT ROW_NUMBER() OVER(ORDER BY ROOM_ID) AS ROOM_NUMBER, ROOM_ID FROM ROOM) AS R
JOIN
(SELECT ROW_NUMBER() OVER(ORDER BY GUEST_ID) AS GUEST_NUMBER, GUEST_ID FROM GUEST) AS G
ON ROOM_NUMBER=GUEST_NUMBER

ROOM_NUMBER          GUEST_NUMBER         ROOM_ID     GUEST_ID
-------------------- -------------------- ----------- -----------
1                    1         121           2
2                    2         139          17
3                    3         142          57
4                    4         201          91
5                    5         202         321```

(For more details on table expressions refer to Sheryl Larsen's article.)

This simple query joins at most 1 record in ROOM table with at most 1 record in GUEST table. Note that instead of specifying any order, like I did (OVER(ORDER BY GUEST_ID)) one could say that order does not matter (OVER()) or even request random order (OVER(ORDER BY RAND()) ). In this case one guest did not get a room, because there weren't enough available. Feel free to add records to ROOM table to verify that the query works in other situations (when there are exactly as many rooms as guests, and when there are fewer rooms than guests).

Having understood how the join works, populating `GUEST_ASSIGNMENT` table is easy:

```INSERT INTO GUEST_ASSIGNMENT
SELECT GUEST_ID, ROOM_ID
FROM
(SELECT ROW_NUMBER() OVER(ORDER BY ROOM_ID) AS ROOM_NUMBER, ROOM_ID FROM ROOM) AS R
JOIN
(SELECT ROW_NUMBER() OVER(ORDER BY GUEST_ID) AS GUEST_NUMBER, GUEST_ID FROM GUEST) AS G
ON ROOM_NUMBER=GUEST_NUMBER```

As we have seen, in this case using `ROW_NUMBER()` function saved us a lot of development effort. We did not have to open two cursors and iterate through them.

Using `ROW_NUMBER() OVER(PARTITION ... )` to assign non-smoker guests to "no smoking" rooms

The sample from the previous chapter was very simplistic. Let us make one more step towards real life complexity. Let us make sure that smokers occupy only smoking rooms and non-smokers occupy only non-smoking rooms. Here is the query that accomplishes just that:

```SELECT ROOM_NUMBER, GUEST_NUMBER, SMOKER, SMOKING, ROOM_ID, GUEST_ID
FROM
(SELECT ROW_NUMBER() OVER(PARTITION BY SMOKING ORDER BY ROOM_ID) AS ROOM_NUMBER, ROOM_ID, SMOKING FROM ROOM) AS R
JOIN
(SELECT ROW_NUMBER() OVER(PARTITION BY SMOKER ORDER BY GUEST_ID) AS GUEST_NUMBER, GUEST_ID, SMOKER FROM GUEST) AS G
ON ROOM_NUMBER=GUEST_NUMBER AND SMOKER=SMOKING
ROOM_NUMBER          GUEST_NUMBER         SMOKER SMOKING ROOM_ID     GUEST_ID
-------------------- -------------------- ------ ------- ----------- -----------
1                    1 N      N               139           2
2                    2 N      N               142         321
3                    3 N      N               202         444
1                    1 Y      Y               121          17
2                    2 Y      Y               201          57```

Again, having understood how records in GUEST and ROOM are joined, you could populate the `GUEST_ASSIGNMENT` table in one simple `INSERT` statement. This approach is definitely much easier than iterating through one cursor on guests and another one on rooms.

Distributing cargo boxes to trucks using cumulative sums

The problem that we shall discuss in this chapter is quite similar to the previous two problems. This time I shall demonstrate how using cumulative sums simplifies queries development.

Let us suppose several boxes of one and the same size need to be loaded on several trucks of different capacity. This is in fact a very common problem in resource allocation that is usually solved using cursors.

Here are the tables' definitions and some sample data:

```CREATE TABLE TRUCK(
TRUCK_ID INT NOT NULL PRIMARY KEY,
CAPACITY SMALLINT NOT NULL);

INSERT INTO TRUCK VALUES(11,3), (22, 2), (33,3);

CREATE TABLE CARGO_BOX(
CARGO_BOX_ID INT NOT NULL PRIMARY KEY,
DESCRIPTION VARCHAR(40));

INSERT INTO CARGO_BOX VALUES(101,'PEACHES'),(102,'POTATOES'),(103,'TOMATOES'),(104,'TOMATOES'),(105,'TOMATOES'),
(106,'PINEAPPLES'),(107,'PINEAPPLES');

CREATE TABLE BOX_IN_TRUCK(
TRUCK_ID INT NOT NULL,
CARGO_BOX_ID INT NOT NULL,
FOREIGN KEY(TRUCK_ID) REFERENCES TRUCK(TRUCK_ID),
FOREIGN KEY(CARGO_BOX_ID) REFERENCES CARGO_BOX(CARGO_BOX_ID));```

The problem is to populate `BOX_IN_TRUCK` table properly, which means to assign boxes to trucks so that no truck is loaded over its capacity. One usually needs to use cursors to accomplish such a task. Using OLAP functions, it can be done without cursors.

```INSERT INTO BOX_IN_TRUCK
SELECT
TRUCK_CUMULATIVE.TRUCK_ID,
CARGO_BOX.CARGO_BOX_ID
FROM
(SELECT SUM(CAPACITY) OVER(ORDER BY TRUCK_ID) - CAPACITY + 1 AS BOX_FROM,
SUM(CAPACITY) OVER(ORDER BY TRUCK_ID) AS BOX_TO, CAPACITY, TRUCK_ID FROM TRUCK) AS TRUCK_CUMULATIVE
JOIN
(SELECT ROW_NUMBER() OVER() AS ROW_NUMBER, CARGO_BOX_ID FROM CARGO_BOX) AS CARGO_BOX
ON ROW_NUMBER BETWEEN BOX_FROM AND BOX_TO;```

Let us verify that this query has produced the required results:

```SELECT * FROM BOX_IN_TRUCK
TRUCK_ID    CARGO_BOX_ID
----------- ------------
11          101
11          102
11          103
22          104
22          105
33          106
33          107```

To understand how it works, let us retrieve all the columns involved in the query:

```SELECT
TRUCK_CUMULATIVE.TRUCK_ID, BOX_FROM, BOX_TO,
CARGO_BOX.CARGO_BOX_ID, ROW_NUMBER
FROM
(SELECT SUM(CAPACITY) OVER(ORDER BY TRUCK_ID) - CAPACITY + 1 AS BOX_FROM,
SUM(CAPACITY) OVER(ORDER BY TRUCK_ID) AS BOX_TO, CAPACITY, TRUCK_ID FROM TRUCK) AS TRUCK_CUMULATIVE
JOIN
(SELECT ROW_NUMBER() OVER() AS ROW_NUMBER, CARGO_BOX_ID FROM CARGO_BOX) AS CARGO_BOX
ON ROW_NUMBER BETWEEN BOX_FROM AND BOX_TO;

TRUCK_ID    BOX_FROM    BOX_TO      CARGO_BOX_ID ROW_NUMBER
----------- ----------- ----------- ------------ --------------------
11           1           3          101                    1
11           1           3          102                    2
11           1           3          103                    3
22           4           5          104                    4
22           4           5          105                    5
33           6           8          106                    6
33           6           8          107                    7
7 record(s) selected.```

For more details on cumulative sums, refer to Bob Lyle's article.

Conclusion

We have seen several examples where using OLAP function significantly simplified development of queries. This is just the tip of an iceberg. Be aware of the many advantages OLAP functions provide, and use them, because they make your life much easier.
Good luck!