Meet the Experts

Bob Lyle Talks About OLAP Functions in DB2 for UNIX, Windows, and OS/2

Bob Lyle

While presenting at IDUG EMEA in Florence this year, I was surprised to find that very few users know much about the online analytical processing (OLAP) functions that were introduced into DB2Â® Universal Database Versions 6 and 7. While on the surface they sound like some simple new functions, they are actually extremely powerful. This power is based on how these functions extend the relational model to understand ordering within sets of rows. Previously, if you wanted to perform any analysis with respect to an ordering (be that a date or other ordering) you had to write very complex queries. Now, with the OLAP functions, this complexity is greatly reduced. In this article, I hope to give you a very high level overview of some of the functions we've implemented, and some simple examples of how they're used. All of the functionality documented in this article has been standardized in the OLAP Amendment to the SQL99 standard.

Ranking Functions

The first OLAP functions introduced into DB2 were the ranking functions in DB2 Version 6. These ranking functions provide the ability to define a set (using the PARTITION clause), and then rank the elements of the set with respect to an ordering. For example, suppose we have an employee table and would like to rank the employees' salaries within each department. To do this, we need a function invocation that does the following:

1. Defines the partitions (sets) to be the individual departments
2. Defines the ordering within the set to be on salary.

Traditionally, we'd rank high salaries first, so we will specify a descending ordering for the salary. The example below shows the query and output for this query.

``` select empnum, dept, salary, rank() over (partition by dept       order by salary desc nulls last) as
rank, dense_rank() over (partition by dept order by salary desc
nulls last)as denserank, row_number() over (partition by dept order by salary desc
nulls last)as rownumberfrom emptab; EMPNUM  DEPT SALARY  RANK  DENSERANK  ROWNUMBER ------  ---- ------  ----  ---------  --------- 6       1    78000   1     1           1 2       1    75000   2     2           2 7       1    75000   2     2           3 11      1    53000   4     3           4 5       1    52000   5     4           5 1       1    50000   6     5           6-------------------------------------------------- 9       2    51000   1     1           1 4       2       -    2     2           2```

Examples showing ranking

First, let's look at the function invocation. Notice that there are no arguments to the rank function itself. This is because the rank function doesn't perform any kind of calculations on any arguments. Instead, the rank function merely looks at sets of rows--and the position of each row within the set--as defined by the ordering. So how do we define the set and ordering for the function? Both are defined using the OVER clause. In this case, because we want the ranks within each department, we define the sets by partitioning by department. This has the effect of ensuring that only rows with equivalent values in the dept column are ranked against one each other. For the purposes of the ranking functions, the terms partition and set are equivalent. Following the PARTITION clause, we have an ORDER BY clause, which defines the ordering within the partition. In this case, we'd like to rank high salaries first, so we define the ordering to be on descending salary. In addition to the descending specification, we also specify NULLS LAST. In SQL, nulls collate high, meaning they appear to be greater than all other non-null values. This introduces a problem for ranking, because we probably don't want null salaries ranked first. Thus we use the NULLS LAST clause to change the default ordering, so that nulls are ranked last. (Note that the NULLS LAST clause was introduced in DB2 V7; however, it is possible to force the ordering in V6 using a CASE expression.)

Now, let's look at the output. The first six rows are all for employees of Department 1, and the rank is assigned to each row in descending order of salary. Notice that there are two employees in Department 1 with a salary of 75000, and that both rows are assigned a rank of two. This is because the rank function provides an "Olympic"-style ranking, in which two equal values receive an equal rank. Because there are two rows "tied for second," there is no row ranked third. Instead, the follwing row is ranked fourth, because three rows strictly preceded it with respect to the ordering.

For Department 2, notice that one of the rows has a null salary. Because we specified NULLS LAST, this row is ranked after the non-null row. Had we not specified NULLS LAST, the rankings of the two rows for Department 2 would have been reversed.

By now you may be asking yourself, what are these two other output columns, denserank and rownumber, in the above example? DB2 actually has three different ranking functions. The first, the rank function, provides Olympic-style rankings as already described. Two other functions, dense_rank and row_number, are provided. Dense_rank is much like rank, in that "ties" are ranked equally. The only difference is that for the value following the ties, the rank is incremented by 1 (instead of being incremented by the number of ties, as it is for rank). Thus, there are no gaps in the ranking (hence the name "dense"). Whereas Employee 11's salary was given a rank of 4, denserank returned a value of 3.

The last column gives the output of the row_number function. Row_number also performs a ranking, but in the event of a tie, the ties are arbitrarily (that is, non-deterministically) ordered. This is very useful when performing analytics on data with duplicate values. One interesting aspect of the row_number function is that it is the only ranking function that does not require an ordering. If row_number is invoked without a specified ordering, then all rows are treated as if they are ties and are arbitrarily numbered. This can be useful for numbering rows on output.

Other OLAP Functions

Many other OLAP functions were introduced in DB2 Version 7. Before these functions were introduced, DB2 supported two basic classes of functions, scalar functions and aggregate functions. Scalar functions are those that operate on values within a single row, and return a single result per row. Examples of scalar functions are arithmetic and string functions. For example, the query below uses the DIGITS scalar function to format the salary field. The calculation of the result is done per row, and only the salary value within the current row is used in the computation.

` select empnum, salary,    digits(salary) as digits         from emptab         where dept = 1;  EMPNUM      SALARY      DIGITS  ----------- ----------- ----------           1       50000 0000050000           2       75000 0000075000          5       52000 0000052000...`

Example showing the DIGITS scalar functions

Aggregate functions (also referred to as column or set functions) behave differently. Aggregate functions operate on a set of rows, and aggregate (or combine) them into a single row in the output. An example of an aggregate function is the sum function, where a set of values is summed and placed into an individual result row. For example, the following query computes the sum of all of the employees in each department. The GROUP BY clause is used to indicate that the groups (or partitions) to aggregate are the sets of all rows within departments. A single row is returned for each department, giving the total of all salaries within that department.

` select dept, sum(salary) as sum        from emptab        group by dept;  DEPT        SUM  ----------- -----------           1      383000           2       51000           3      209000           -       84000`

Example showing the SUM aggregate function

The OLAP functions introduced in DB2 V7 introduce a new class of functions, which we'll call scalar-aggregate functions. These functions are like scalar functions, because they return a single value per row, but they're also like aggregate functions, because the calculation is performed on values from multiple rows within a set to compute the result. The scalar-aggregate sum function below does the same calculation as the aggregate sum function, but returns the results without combining the rows:

``` select dept, salary,         sum(salary) over (partition by dept) as
deptsum,        avg(salary) over (partition by dept) as
avgsal,        count(*) over (partition by dept) as
deptcount,        max(salary) over (partition by dept) as
maxsal   from emptab; DEPT  SALARY  DEPTSUM  AVGSAL  DEPTCOUNT
MAXSAL  ----- ------- -------- ------- ---------
--------     1   50000 383000   63833         6
78000     1   75000 383000   63833         6
78000     1   52000 383000   63833         6
78000     1   78000 383000   63833         6
78000     1   75000 383000   63833         6
78000     1   53000 383000   63833         6
78000     2       -  51000   51000         2
51000     2   51000  51000   51000         2
51000     3   79000 209000   69666         3
79000     3   55000 209000   69666         3
79000     3   75000 209000   69666         3
79000     -       -  84000   84000         2
84000     -   84000  84000   84000         2
84000```

Example showing the SUM reporting function

Notice that the query contains no GROUP BY clause. Instead, the OVER clause is used to partition the data so that the sum function is computed over rows in the same department, and the sum of all the salaries in each department is returned for each row within the department. Traditionally, a join was necessary to include such aggregated results per row, but now the OLAP functions provide this much simpler formulation. We refer to this type of function as a reporting function, because the sum is calculated over the set and reported once per row. I've used SUM for the above example and later examples, but most existent aggregate functions (i.e. AVG, MIN, MAX, STDEV, etc...)Â  work with the OVER clause. The additional columns to the right of the DEPTSUM column show the average salary, count of employees in the department, and maximum salary within the department. The only aggregate functions not supported as scalar-aggregates are the linear regression functions.

One powerful use of these reporting functions is in calculating ratios and percentages. To calculate the percentage of one employee's salary versus the total of the entire department's salaries, simply divide the employee's salary by the reporting sum of the salaries.

```select empnum, dept, salary,          sum(salary) over (partition by dept) as
deptsum,        decimal(salary,10,2) /           sum(salary) over(partition by dept)as
percentage  from emptab; EMPNUM DEPT  SALARY   DEPTSUM
PERCENTAGE------ ----- -------- ----------- ----------     1     1    50000      383000     0.1305     2     1    75000      383000     0.1958     5     1    52000      383000     0.1357     6     1    78000      383000     0.2036     7     1    75000      383000     0.1958    11     1    53000      383000     0.1383     4     2        -       51000      9     2    51000       51000     1.0000     8     3    79000      209000     0.3779   10     3    55000      209000     0.2631    12     3    75000      209000     0.3588     0     -        -       84000      3     -    84000       84000     1.0000```

Percentage Example

Employee salary as a percentage of total salary expenditure.

What happens if we introduce an ordering to the sets used for aggregations? The answer is that, instead of computing a reporting function, we compute a cumulative function. A cumulative function is a scalar-aggregate function that operates on the current row, and all rows in the set that precede it with respect to the ordering. Let's use a different table for this example. Suppose we have a table with the monthly sales results for the current calendar year. How do we compute the current year-to-date sales figures for each month? In this case, we want to compute a cumulative sum of the monthly sales figures. This is done as follows:

``` select date, sales,         sum(sales) over (order by date) as
cume_sum,        count(*) over (order by date) as
setcount  from sales  where year(date) = 2000; DATE       SALES        CUME_SUM
SETCOUNT  ---------- ------------ ------------
--------- 01/01/2000    968871.12    968871.12
1 02/01/2000     80050.05   1048921.17
2 03/01/2000    757866.14   1806787.31
3 04/01/2000     58748.13   1865535.44
4 05/01/2000     40711.69   1906247.13
5 06/01/2000    241187.78   2147434.91
6 07/01/2000    954924.16   3102359.07
7 08/01/2000    502822.96   3605182.03
8 09/01/2000     97201.45   3702383.48
9 10/01/2000    853999.45   4556382.93
10 11/01/2000    358775.59   4915158.52
11 12/01/2000    437513.35   5352671.87        12```

Example of calculating cumulative sums

Let's look at the results. For the first row, the cumulative sum is equal to the sales amount for that row. This is because there were no rows that preceded January. For the second row, the cumulative sum is equal to the sum of January and February's sales amounts (968871.12 + 80050.05 = 1048921.17). Similarly, the result for the third row is the sum of the sales for January, February, and March. The column to the right of the CUME_SUM column performs a cumulative count that gives the count of rows that were within the set. For example, the first row had only one row summed (that is, itself), the second row had two rows summed (itself and the preceding row), and so on. The graph above gives a graphical representation of the sales figures as well as the cumulative sum calculated in the preceding query.

If we had multiple years worth of data and wanted to compute the cumulative sum by month within each year, we could also use the PARTITION BY clause as follows:

``` select date, sales,         sum(sales) over (partition by
year(date)                        order by month(date)) as
cume_sum from sales where year(date) >= 2000; DATE       SALES        CUME_SUM  ---------- ------------ ----------- 01/01/2000    968871.12   968871.12 02/01/2000     80050.05  1048921.17 03/01/2000    757866.14  1806787.31 04/01/2000     58748.13  1865535.44 05/01/2000     40711.69  1906247.13 06/01/2000    241187.78  2147434.91 07/01/2000    954924.16  3102359.07 08/01/2000    502822.96  3605182.03 09/01/2000     97201.45  3702383.48 10/01/2000    853999.45  4556382.93 11/01/2000    358775.59  4915158.52 12/01/2000    437513.35  5352671.87  01/01/2001    476851.71   476851.71 02/01/2001    593768.12  1070619.83 03/01/2001    818597.97  1889217.80 ...```

Calculating cumulative sums with the PARTITION BY clause

Now, notice how the cumulative sum gets reset for the January 2001 row. This is because the cume_sum is equal to the sales amount because the data is partitioned by the year, and no rows precede January within the year 2001. The other interesting thing demonstrated by this example is that the arguments used within the OVER clause may be expressions, not just column values. In more complicated examples, it is even possible to nest other aggregate functions within the scalar-aggregate function invocation. This is useful, because it's very common to perform some kind of aggregation (for example, aggregating sales to the month level) before performing the analysis. This raises the following question: When are scalar-aggregate functions computed? The answer is that these functions are computed at the time the rest of the select list is computed. In general, the order of evaluation of a query is as follows:

1. From Clause
2. Where Clause
3. Group By Clause
4. Having Clause
5. Select List

As you can see, the select list is computed after all other parts of the query. This means that if you have predicates (in the WHERE or HAVING clause), or if you have any aggregations as a result of a GROUP BY clause, these will all be applied before the functions are evaluated. For example, let's look at the following query:

``` select year(date) as year, sum(sales) as
sum,         sum(sum(sales)) over (order by year(date)) as
cume_sum from sales where year(date) >= 1995 group by year(date); YEAR        SUM           CUME_SUM  ----------- ------------- ------------        1995    7731162.39   7731162.39        1996    4127017.98  11858180.37        1997    7211584.76  19069765.13        1998    4149296.50  23219061.63        1999    6278023.54  29497085.17        2000    5352671.87  34849757.04        2001    5736777.81  40586534.85```

Cumulative sum over an aggregation

In this case, we access the table (named in the FROM clause) and apply the WHERE clause, then we do the GROUP BY and compute the sum of sales per year. Finally, we compute the select list, including all scalar-aggregate functions.

There is one other point to be made here. Because the scalar-aggregate functions are computed after the WHERE clause, it is not possible to reference scalar-aggregate functions in a predicate. Instead, if you wish to do this, you must nest the scalar-aggregate function invocation, either within a common table expression, or within a nested query. This becomes useful when performing queries that return the top n result. An example would be writing a query to select the 3 years that have the highest sales totals. We can do this by ranking the yearly sales and then selecting the rows where this ranking is 3 or less.

``` with ranked_years (year, sum, rank) as  (select year(date) as year, sum(sales) as
sum,           rank() over (order by sum(sales) desc) as
rank   from sales   group by year(date)  ) select year, sum, rank from ranked_years where rank <= 3; YEAR        SUM           RANK  ----------- ------------- -------        1995    7731162.39       1        1997    7211584.76       2        1999    6278023.54       3```

Top n query example

As you can see in this example, we have a common table expression that performs the aggregation to sum the sales per year, and then ranks the sales totals. Then, the outer select uses this result table and adds a predicate so only rows whose rank is <=3 (i.e. the top 3 sales totals) are returned. Similar types of queries can be done to compute medians, percentiles, and other distribution statistics.

I hope that by now I've managed to convey a rough idea of what these new OLAP functions are and how they're used. There is actually much, much more to these functions than what I've described here. So keep your eyes open for another article that describes these functions in more detail.

The point I'd like to leave you with is that with the implementation of these OLAP functions, DB2 has extended the relational model so that it can now understand ordering with respect to sets of data. If you've ever tried to formulate queries where ordering matters, you know how difficult and complex these queries can get (even simple queries like median). The OLAP functions provide you with tools to efficiently and concisely formulate such queries. With the increasing demands being placed upon the DBMS, it is clear that the traditional relational model must be extended to handle such increasingly complex analyses, and these functions are just one example of where DB2 is pushing the limits.

Bob Lyle has been a DB2 developer for 13 years. He began his career in DB2 for OS/390, working on index management. During his tenure there, he was the architect of the Type 2 Index Manager and large object (LOB) Manager components of DB2. Three years ago, Bob moved to the Almaden Research Center, developing and standardizing the OLAP functions described in this column. Currently, Bob is back in the indexing world, exploring index extensions for DB2.

Bob can be reached at blyle@us.ibm.com

This information represents IBMâ€™s current intentions, goals and objectives, and is subject to change or withdrawal without additional or prior notice.

References in this publication to IBM products, programs, or services do not imply that IBM intends to make these available in all countries in which IBM operates.

The following terms are trademarks or registered trademarks of the IBM Corporation in the United States and/or other countries: DB2, SQL, WebSphere, IBM.

Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries.

Other company, product, and service names may be trademarks or service marks of others.