Meet the Experts

# Bob Lyle Talks About OLAP Functions in DB2 for UNIX, Windows, and OS/2

Â©International Business Machines Corporation 2001. All rights reserved.

##### Bob Lyle

While presenting at IDUG EMEA in Florence this year, I was surprised to find that very few users know much about the online analytical processing (OLAP) functions that were introduced into DB2Â® Universal Database Versions 6 and 7. While on the surface they sound like some simple new functions, they are actually extremely powerful. This power is based on how these functions extend the relational model to understand ordering within sets of rows. Previously, if you wanted to perform any analysis with respect to an ordering (be that a date or other ordering) you had to write very complex queries. Now, with the OLAP functions, this complexity is greatly reduced. In this article, I hope to give you a very high level overview of some of the functions we've implemented, and some simple examples of how they're used. All of the functionality documented in this article has been standardized in the OLAP Amendment to the SQL99 standard.

## Ranking Functions

The first OLAP functions introduced into DB2 were the
*ranking* functions in DB2 Version 6. These ranking functions
provide the ability to define a set (using the PARTITION clause),
and then rank the elements of the set with respect to an ordering.
For example, suppose we have an employee table and would like to
rank the employees' salaries within each department. To do this, we
need a function invocation that does the following:

- Defines the partitions (sets) to be the individual departments
- Defines the ordering within the set to be on salary.

Traditionally, we'd rank high salaries first, so we will specify a descending ordering for the salary. The example below shows the query and output for this query.

select empnum, dept, salary,

rank() over (partition by dept

order by salary desc nulls last) as rank,

dense_rank() over (partition by dept order by salary desc nulls last)as denserank,

row_number() over (partition by dept order by salary desc nulls last)as rownumber

from emptab;

EMPNUM DEPT SALARY RANK DENSERANK ROWNUMBER

------ ---- ------ ---- --------- ---------

6 1 78000 1 1 1

2 1 75000 2 2 2

7 1 75000 2 2 3

11 1 53000 4 3 4

5 1 52000 5 4 5

1 1 50000 6 5 6

--------------------------------------------------

9 2 51000 1 1 1

4 2 - 2 2 2

*Examples showing ranking*

First, let's look at the function invocation. Notice
that there are no arguments to the rank function itself. This is
because the rank function doesn't perform any kind of calculations
on any arguments. Instead, the rank function merely looks at sets
of rows--and the position of each row within the set--as defined by
the ordering. So how do we define the set and ordering for the
function? Both are defined using the OVER clause. In this case,
because we want the ranks within each department, we define the
sets by partitioning by department. This has the effect of ensuring
that only rows with equivalent values in the dept column are ranked
against one each other. For the purposes of the ranking functions,
the terms *partition* and *set* are equivalent. Following
the PARTITION clause, we have an ORDER BY clause, which defines the
ordering within the partition. In this case, we'd like to rank high
salaries first, so we define the ordering to be on descending
salary. In addition to the descending specification, we also
specify NULLS LAST. In SQL, nulls collate high, meaning they appear
to be greater than all other non-null values. This introduces a
problem for ranking, because we probably don't want null salaries
ranked first. Thus we use the NULLS LAST clause to change the
default ordering, so that nulls are ranked last. (Note that the
NULLS LAST clause was introduced in DB2 V7; however, it is possible
to force the ordering in V6 using a CASE expression.)

Now, let's look at the output. The first six rows are all for employees of Department 1, and the rank is assigned to each row in descending order of salary. Notice that there are two employees in Department 1 with a salary of 75000, and that both rows are assigned a rank of two. This is because the rank function provides an "Olympic"-style ranking, in which two equal values receive an equal rank. Because there are two rows "tied for second," there is no row ranked third. Instead, the follwing row is ranked fourth, because three rows strictly preceded it with respect to the ordering.

For Department 2, notice that one of the rows has a null salary. Because we specified NULLS LAST, this row is ranked after the non-null row. Had we not specified NULLS LAST, the rankings of the two rows for Department 2 would have been reversed.

By now you may be asking yourself, what are these two
other output columns, denserank and rownumber, in the above
example? DB2 actually has three different ranking functions. The
first, the *rank* function, provides Olympic-style rankings as
already described. Two other functions, *dense_rank* and
*row_number*, are provided. Dense_rank is much like rank, in
that "ties" are ranked equally. The only difference is that for the
value following the ties, the rank is incremented by 1 (instead of
being incremented by the number of ties, as it is for rank). Thus,
there are no gaps in the ranking (hence the name "dense"). Whereas
Employee 11's salary was given a rank of 4, denserank returned a
value of 3.

The last column gives the output of the row_number function. Row_number also performs a ranking, but in the event of a tie, the ties are arbitrarily (that is, non-deterministically) ordered. This is very useful when performing analytics on data with duplicate values. One interesting aspect of the row_number function is that it is the only ranking function that does not require an ordering. If row_number is invoked without a specified ordering, then all rows are treated as if they are ties and are arbitrarily numbered. This can be useful for numbering rows on output.

## Other OLAP Functions

Many other OLAP functions were introduced in DB2
Version 7. Before these functions were introduced, DB2 supported
two basic classes of functions, *scalar* functions and
*aggregate* functions. Scalar functions are those that operate
on values within a single row, and return a single result per row.
Examples of scalar functions are arithmetic and string functions.
For example, the query below uses the DIGITS scalar function to
format the salary field. The calculation of the result is done per
row, and only the salary value within the current row is used in
the computation.

select empnum, salary,

digits(salary) as digits

from emptab

where dept = 1;

EMPNUM SALARY DIGITS

----------- ----------- ----------

1 50000 0000050000

2 75000 0000075000

5 52000 0000052000

...

*Example showing the DIGITS scalar functions*

*Aggregate functions* (also referred to as
*column* or *set functions*) behave differently.
Aggregate functions operate on a set of rows, and aggregate (or
combine) them into a single row in the output. An example of an
aggregate function is the sum function, where a set of values is
summed and placed into an individual result row. For example, the
following query computes the sum of all of the employees in each
department. The GROUP BY clause is used to indicate that the groups
(or partitions) to aggregate are the sets of all rows within
departments. A single row is returned for each department, giving
the total of all salaries within that department.

select dept, sum(salary) as sum

from emptab

group by dept;

DEPT SUM

----------- -----------

1 383000

2 51000

3 209000

- 84000

*Example showing the SUM
aggregate function*

The OLAP functions introduced in DB2 V7
introduce a new class of functions, which we'll call
*scalar-aggregate* functions. These functions are like scalar
functions, because they return a single value per row, but they're
also like aggregate functions, because the calculation is performed
on values from multiple rows within a set to compute the result.
The scalar-aggregate sum function below does the same calculation
as the aggregate sum function, but returns the results without
combining the rows:

select dept, salary,

sum(salary) over (partition by dept) as deptsum,

avg(salary) over (partition by dept) as avgsal,

count(*) over (partition by dept) as deptcount,

max(salary) over (partition by dept) as maxsal

from emptab;

DEPT SALARY DEPTSUM AVGSAL DEPTCOUNT MAXSAL

----- ------- -------- ------- --------- --------

1 50000 383000 63833 6 78000

1 75000 383000 63833 6 78000

1 52000 383000 63833 6 78000

1 78000 383000 63833 6 78000

1 75000 383000 63833 6 78000

1 53000 383000 63833 6 78000

2 - 51000 51000 2 51000

2 51000 51000 51000 2 51000

3 79000 209000 69666 3 79000

3 55000 209000 69666 3 79000

3 75000 209000 69666 3 79000

- - 84000 84000 2 84000

- 84000 84000 84000 2 84000

*Example showing the SUM
reporting function*

Notice that the query contains no GROUP BY clause.
Instead, the OVER clause is used to partition the data so that the
sum function is computed over rows in the same department, and the
sum of all the salaries in each department is returned for each row
within the department. Traditionally, a join was necessary to
include such aggregated results per row, but now the OLAP functions
provide this much simpler formulation. We refer to this type of
function as a *reporting* function, because the sum is
calculated over the set and reported once per row. I've used SUM
for the above example and later examples, *but most existent
aggregate functions (i.e. AVG, MIN, MAX, STDEV, etc...)Â work with
the OVER clause.* The additional columns to the right of the
DEPTSUM column show the average salary, count of employees in the
department, and maximum salary within the department. The only
aggregate functions not supported as scalar-aggregates are the
linear regression functions.

One powerful use of these reporting functions is in calculating ratios and percentages. To calculate the percentage of one employee's salary versus the total of the entire department's salaries, simply divide the employee's salary by the reporting sum of the salaries.

select empnum, dept, salary,

sum(salary) over (partition by dept) as deptsum,

decimal(salary,10,2) /

sum(salary) over(partition by dept)as percentage

from emptab;

EMPNUM DEPT SALARY DEPTSUM PERCENTAGE

------ ----- -------- ----------- ----------

1 1 50000 383000 0.1305

2 1 75000 383000 0.1958

5 1 52000 383000 0.1357

6 1 78000 383000 0.2036

7 1 75000 383000 0.1958

11 1 53000 383000 0.1383

4 2 - 51000

9 2 51000 51000 1.0000

8 3 79000 209000 0.3779

10 3 55000 209000 0.2631

12 3 75000 209000 0.3588

0 - - 84000

3 - 84000 84000 1.0000

*Percentage Example*

Employee salary as a percentage of total salary expenditure.

What happens if we introduce an ordering to the
sets used for aggregations? The answer is that, instead of
computing a *reporting* function, we compute a
*cumulative* function. A cumulative function is a
scalar-aggregate function that operates on the current row, *and
all rows in the set that precede it* with respect to the
ordering. Let's use a different table for this example. Suppose we
have a table with the monthly sales results for the current
calendar year. How do we compute the current *year-to-date*
sales figures for each month? In this case, we want to compute a
cumulative sum of the monthly sales figures. This is done as
follows:

select date, sales,

sum(sales) over (order by date) as cume_sum,

count(*) over (order by date) as setcount

from sales

where year(date) = 2000;

DATE SALES CUME_SUM SETCOUNT

---------- ------------ ------------ ---------

01/01/2000 968871.12 968871.12 1

02/01/2000 80050.05 1048921.17 2

03/01/2000 757866.14 1806787.31 3

04/01/2000 58748.13 1865535.44 4

05/01/2000 40711.69 1906247.13 5

06/01/2000 241187.78 2147434.91 6

07/01/2000 954924.16 3102359.07 7

08/01/2000 502822.96 3605182.03 8

09/01/2000 97201.45 3702383.48 9

10/01/2000 853999.45 4556382.93 10

11/01/2000 358775.59 4915158.52 11

12/01/2000 437513.35 5352671.87 12

*Example of calculating
cumulative sums*

Let's look at the results. For the first row, the cumulative sum is equal to the sales amount for that row. This is because there were no rows that preceded January. For the second row, the cumulative sum is equal to the sum of January and February's sales amounts (968871.12 + 80050.05 = 1048921.17). Similarly, the result for the third row is the sum of the sales for January, February, and March. The column to the right of the CUME_SUM column performs a cumulative count that gives the count of rows that were within the set. For example, the first row had only one row summed (that is, itself), the second row had two rows summed (itself and the preceding row), and so on. The graph above gives a graphical representation of the sales figures as well as the cumulative sum calculated in the preceding query.

If we had multiple years worth of
data and wanted to compute the cumulative sum by month *within
each year*, we could also use the PARTITION BY clause as
follows:

select date, sales,

sum(sales) over (partition by year(date)

order by month(date)) as cume_sum

from sales

where year(date) >= 2000;

DATE SALES CUME_SUM

---------- ------------ -----------

01/01/2000 968871.12 968871.12

02/01/2000 80050.05 1048921.17

03/01/2000 757866.14 1806787.31

04/01/2000 58748.13 1865535.44

05/01/2000 40711.69 1906247.13

06/01/2000 241187.78 2147434.91

07/01/2000 954924.16 3102359.07

08/01/2000 502822.96 3605182.03

09/01/2000 97201.45 3702383.48

10/01/2000 853999.45 4556382.93

11/01/2000 358775.59 4915158.52

12/01/2000 437513.35 5352671.87

01/01/2001 476851.71 476851.71

02/01/2001 593768.12 1070619.83

03/01/2001 818597.97 1889217.80

...

*Calculating cumulative sums with
the PARTITION BY clause*

Now, notice how the cumulative sum gets reset for the
January 2001 row. This is because the cume_sum is equal to the
sales amount because the data is partitioned by the year, and no
rows precede January *within the year 2001*. The other
interesting thing demonstrated by this example is that the
arguments used within the OVER clause may be expressions, not just
column values. In more complicated examples, it is even possible to
nest other aggregate functions within the scalar-aggregate function
invocation. This is useful, because it's very common to perform
some kind of aggregation (for example, aggregating sales to the
month level) before performing the analysis. This raises the
following question: When are scalar-aggregate functions computed?
The answer is that these functions are computed at the time the
rest of the select list is computed. In general, the order of
evaluation of a query is as follows:

- From Clause
- Where Clause
- Group By Clause
- Having Clause
- Select List

As you can see, the select list is computed after all other parts of the query. This means that if you have predicates (in the WHERE or HAVING clause), or if you have any aggregations as a result of a GROUP BY clause, these will all be applied before the functions are evaluated. For example, let's look at the following query:

select year(date) as year, sum(sales) as sum,

sum(sum(sales)) over (order by year(date)) as cume_sum

from sales

where year(date) >= 1995

group by year(date);

YEAR SUM CUME_SUM

----------- ------------- ------------

1995 7731162.39 7731162.39

1996 4127017.98 11858180.37

1997 7211584.76 19069765.13

1998 4149296.50 23219061.63

1999 6278023.54 29497085.17

2000 5352671.87 34849757.04

2001 5736777.81 40586534.85

*Cumulative sum over an
aggregation*

In this case, we access the table (named in the FROM clause) and apply the WHERE clause, then we do the GROUP BY and compute the sum of sales per year. Finally, we compute the select list, including all scalar-aggregate functions.

There is one other point to be made here.
Because the scalar-aggregate functions are computed *after*
the WHERE clause, it is not possible to reference scalar-aggregate
functions in a predicate. Instead, if you wish to do this, you must
*nest* the scalar-aggregate function invocation, either within
a common table expression, or within a nested query. This becomes
useful when performing queries that return the top *n* result.
An example would be writing a query to select the 3 years that have
the highest sales totals. We can do this by ranking the yearly
sales and then selecting the rows where this ranking is 3 or
less.

with ranked_years (year, sum, rank) as

(select year(date) as year, sum(sales) as sum,

rank() over (order by sum(sales) desc) as rank

from sales

group by year(date)

)

select year, sum, rank

from ranked_years

where rank <= 3;

YEAR SUM RANK

----------- ------------- -------

1995 7731162.39 1

1997 7211584.76 2

1999 6278023.54 3

*Top n query example*

As you can see in this example, we have a common table expression that performs the aggregation to sum the sales per year, and then ranks the sales totals. Then, the outer select uses this result table and adds a predicate so only rows whose rank is <=3 (i.e. the top 3 sales totals) are returned. Similar types of queries can be done to compute medians, percentiles, and other distribution statistics.

I hope that by now I've managed to convey a rough idea of what these new OLAP functions are and how they're used. There is actually much, much more to these functions than what I've described here. So keep your eyes open for another article that describes these functions in more detail.

The point I'd like to leave you with is that with the
implementation of these OLAP functions, DB2 has extended the
relational model so that it can now understand *ordering* with
respect to sets of data. If you've ever tried to formulate queries
where ordering matters, you know how difficult and complex these
queries can get (even simple queries like median). The OLAP
functions provide you with tools to efficiently and concisely
formulate such queries. With the increasing demands being placed
upon the DBMS, it is clear that the traditional relational model
must be extended to handle such increasingly complex analyses, and
these functions are just one example of where DB2 is pushing the
limits.

About the Author

Bob Lyle has been a DB2 developer for 13 years. He began his career in DB2 for OS/390, working on index management. During his tenure there, he was the architect of the Type 2 Index Manager and large object (LOB) Manager components of DB2. Three years ago, Bob moved to the Almaden Research Center, developing and standardizing the OLAP functions described in this column. Currently, Bob is back in the indexing world, exploring index extensions for DB2.

Bob can be reached at blyle@us.ibm.com

DISCLAIMERS AND TRADEMARKS

This information represents IBMâ€™s current intentions, goals and objectives, and is subject to change or withdrawal without additional or prior notice.

References in this publication to IBM products, programs, or services do not imply that IBM intends to make these available in all countries in which IBM operates.

TRADEMARKS

The following terms are trademarks or registered trademarks of the IBM Corporation in the United States and/or other countries: DB2, SQL, WebSphere, IBM.

Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries.

Other company, product, and service names may be trademarks or service marks of others.

IBM copyright and trademark information