Meet the Experts
Bob Lyle Talks About OLAP Functions in DB2 for UNIX, Windows, and OS/2
Â©International Business Machines Corporation 2001. All rights reserved.
While presenting at IDUG EMEA in Florence this year, I was surprised to find that very few users know much about the online analytical processing (OLAP) functions that were introduced into DB2Â® Universal Database Versions 6 and 7. While on the surface they sound like some simple new functions, they are actually extremely powerful. This power is based on how these functions extend the relational model to understand ordering within sets of rows. Previously, if you wanted to perform any analysis with respect to an ordering (be that a date or other ordering) you had to write very complex queries. Now, with the OLAP functions, this complexity is greatly reduced. In this article, I hope to give you a very high level overview of some of the functions we've implemented, and some simple examples of how they're used. All of the functionality documented in this article has been standardized in the OLAP Amendment to the SQL99 standard.
The first OLAP functions introduced into DB2 were the ranking functions in DB2 Version 6. These ranking functions provide the ability to define a set (using the PARTITION clause), and then rank the elements of the set with respect to an ordering. For example, suppose we have an employee table and would like to rank the employees' salaries within each department. To do this, we need a function invocation that does the following:
- Defines the partitions (sets) to be the individual departments
- Defines the ordering within the set to be on salary.
Traditionally, we'd rank high salaries first, so we will specify a descending ordering for the salary. The example below shows the query and output for this query.
select empnum, dept, salary,
rank() over (partition by dept
order by salary desc nulls last) as rank,
dense_rank() over (partition by dept order by salary desc nulls last)as denserank,
row_number() over (partition by dept order by salary desc nulls last)as rownumber
EMPNUM DEPT SALARY RANK DENSERANK ROWNUMBER
------ ---- ------ ---- --------- ---------
6 1 78000 1 1 1
2 1 75000 2 2 2
7 1 75000 2 2 3
11 1 53000 4 3 4
5 1 52000 5 4 5
1 1 50000 6 5 6
9 2 51000 1 1 1
4 2 - 2 2 2
Examples showing ranking
First, let's look at the function invocation. Notice that there are no arguments to the rank function itself. This is because the rank function doesn't perform any kind of calculations on any arguments. Instead, the rank function merely looks at sets of rows--and the position of each row within the set--as defined by the ordering. So how do we define the set and ordering for the function? Both are defined using the OVER clause. In this case, because we want the ranks within each department, we define the sets by partitioning by department. This has the effect of ensuring that only rows with equivalent values in the dept column are ranked against one each other. For the purposes of the ranking functions, the terms partition and set are equivalent. Following the PARTITION clause, we have an ORDER BY clause, which defines the ordering within the partition. In this case, we'd like to rank high salaries first, so we define the ordering to be on descending salary. In addition to the descending specification, we also specify NULLS LAST. In SQL, nulls collate high, meaning they appear to be greater than all other non-null values. This introduces a problem for ranking, because we probably don't want null salaries ranked first. Thus we use the NULLS LAST clause to change the default ordering, so that nulls are ranked last. (Note that the NULLS LAST clause was introduced in DB2 V7; however, it is possible to force the ordering in V6 using a CASE expression.)
Now, let's look at the output. The first six rows are all for employees of Department 1, and the rank is assigned to each row in descending order of salary. Notice that there are two employees in Department 1 with a salary of 75000, and that both rows are assigned a rank of two. This is because the rank function provides an "Olympic"-style ranking, in which two equal values receive an equal rank. Because there are two rows "tied for second," there is no row ranked third. Instead, the follwing row is ranked fourth, because three rows strictly preceded it with respect to the ordering.
For Department 2, notice that one of the rows has a null salary. Because we specified NULLS LAST, this row is ranked after the non-null row. Had we not specified NULLS LAST, the rankings of the two rows for Department 2 would have been reversed.
By now you may be asking yourself, what are these two other output columns, denserank and rownumber, in the above example? DB2 actually has three different ranking functions. The first, the rank function, provides Olympic-style rankings as already described. Two other functions, dense_rank and row_number, are provided. Dense_rank is much like rank, in that "ties" are ranked equally. The only difference is that for the value following the ties, the rank is incremented by 1 (instead of being incremented by the number of ties, as it is for rank). Thus, there are no gaps in the ranking (hence the name "dense"). Whereas Employee 11's salary was given a rank of 4, denserank returned a value of 3.
The last column gives the output of the row_number function. Row_number also performs a ranking, but in the event of a tie, the ties are arbitrarily (that is, non-deterministically) ordered. This is very useful when performing analytics on data with duplicate values. One interesting aspect of the row_number function is that it is the only ranking function that does not require an ordering. If row_number is invoked without a specified ordering, then all rows are treated as if they are ties and are arbitrarily numbered. This can be useful for numbering rows on output.
Other OLAP Functions
Many other OLAP functions were introduced in DB2 Version 7. Before these functions were introduced, DB2 supported two basic classes of functions, scalar functions and aggregate functions. Scalar functions are those that operate on values within a single row, and return a single result per row. Examples of scalar functions are arithmetic and string functions. For example, the query below uses the DIGITS scalar function to format the salary field. The calculation of the result is done per row, and only the salary value within the current row is used in the computation.
select empnum, salary,
digits(salary) as digits
where dept = 1;
EMPNUM SALARY DIGITS
----------- ----------- ----------
1 50000 0000050000
2 75000 0000075000
5 52000 0000052000
Example showing the DIGITS scalar functions
Aggregate functions (also referred to as column or set functions) behave differently. Aggregate functions operate on a set of rows, and aggregate (or combine) them into a single row in the output. An example of an aggregate function is the sum function, where a set of values is summed and placed into an individual result row. For example, the following query computes the sum of all of the employees in each department. The GROUP BY clause is used to indicate that the groups (or partitions) to aggregate are the sets of all rows within departments. A single row is returned for each department, giving the total of all salaries within that department.
select dept, sum(salary) as sum
group by dept;
Example showing the SUM aggregate function
The OLAP functions introduced in DB2 V7 introduce a new class of functions, which we'll call scalar-aggregate functions. These functions are like scalar functions, because they return a single value per row, but they're also like aggregate functions, because the calculation is performed on values from multiple rows within a set to compute the result. The scalar-aggregate sum function below does the same calculation as the aggregate sum function, but returns the results without combining the rows:
select dept, salary,
sum(salary) over (partition by dept) as deptsum,
avg(salary) over (partition by dept) as avgsal,
count(*) over (partition by dept) as deptcount,
max(salary) over (partition by dept) as maxsal
DEPT SALARY DEPTSUM AVGSAL DEPTCOUNT MAXSAL
----- ------- -------- ------- --------- --------
1 50000 383000 63833 6 78000
1 75000 383000 63833 6 78000
1 52000 383000 63833 6 78000
1 78000 383000 63833 6 78000
1 75000 383000 63833 6 78000
1 53000 383000 63833 6 78000
2 - 51000 51000 2 51000
2 51000 51000 51000 2 51000
3 79000 209000 69666 3 79000
3 55000 209000 69666 3 79000
3 75000 209000 69666 3 79000
- - 84000 84000 2 84000
- 84000 84000 84000 2 84000
Example showing the SUM reporting function
Notice that the query contains no GROUP BY clause. Instead, the OVER clause is used to partition the data so that the sum function is computed over rows in the same department, and the sum of all the salaries in each department is returned for each row within the department. Traditionally, a join was necessary to include such aggregated results per row, but now the OLAP functions provide this much simpler formulation. We refer to this type of function as a reporting function, because the sum is calculated over the set and reported once per row. I've used SUM for the above example and later examples, but most existent aggregate functions (i.e. AVG, MIN, MAX, STDEV, etc...)Â work with the OVER clause. The additional columns to the right of the DEPTSUM column show the average salary, count of employees in the department, and maximum salary within the department. The only aggregate functions not supported as scalar-aggregates are the linear regression functions.
One powerful use of these reporting functions is in calculating ratios and percentages. To calculate the percentage of one employee's salary versus the total of the entire department's salaries, simply divide the employee's salary by the reporting sum of the salaries.
select empnum, dept, salary,
sum(salary) over (partition by dept) as deptsum,
sum(salary) over(partition by dept)as percentage
EMPNUM DEPT SALARY DEPTSUM PERCENTAGE
------ ----- -------- ----------- ----------
1 1 50000 383000 0.1305
2 1 75000 383000 0.1958
5 1 52000 383000 0.1357
6 1 78000 383000 0.2036
7 1 75000 383000 0.1958
11 1 53000 383000 0.1383
4 2 - 51000
9 2 51000 51000 1.0000
8 3 79000 209000 0.3779
10 3 55000 209000 0.2631
12 3 75000 209000 0.3588
0 - - 84000
3 - 84000 84000 1.0000
Employee salary as a percentage of total salary expenditure.
What happens if we introduce an ordering to the sets used for aggregations? The answer is that, instead of computing a reporting function, we compute a cumulative function. A cumulative function is a scalar-aggregate function that operates on the current row, and all rows in the set that precede it with respect to the ordering. Let's use a different table for this example. Suppose we have a table with the monthly sales results for the current calendar year. How do we compute the current year-to-date sales figures for each month? In this case, we want to compute a cumulative sum of the monthly sales figures. This is done as follows:
select date, sales,
sum(sales) over (order by date) as cume_sum,
count(*) over (order by date) as setcount
where year(date) = 2000;
DATE SALES CUME_SUM SETCOUNT
---------- ------------ ------------ ---------
01/01/2000 968871.12 968871.12 1
02/01/2000 80050.05 1048921.17 2
03/01/2000 757866.14 1806787.31 3
04/01/2000 58748.13 1865535.44 4
05/01/2000 40711.69 1906247.13 5
06/01/2000 241187.78 2147434.91 6
07/01/2000 954924.16 3102359.07 7
08/01/2000 502822.96 3605182.03 8
09/01/2000 97201.45 3702383.48 9
10/01/2000 853999.45 4556382.93 10
11/01/2000 358775.59 4915158.52 11
12/01/2000 437513.35 5352671.87 12
Example of calculating cumulative sums
Let's look at the results. For the first row, the cumulative sum is equal to the sales amount for that row. This is because there were no rows that preceded January. For the second row, the cumulative sum is equal to the sum of January and February's sales amounts (968871.12 + 80050.05 = 1048921.17). Similarly, the result for the third row is the sum of the sales for January, February, and March. The column to the right of the CUME_SUM column performs a cumulative count that gives the count of rows that were within the set. For example, the first row had only one row summed (that is, itself), the second row had two rows summed (itself and the preceding row), and so on. The graph above gives a graphical representation of the sales figures as well as the cumulative sum calculated in the preceding query.
If we had multiple years worth of data and wanted to compute the cumulative sum by month within each year, we could also use the PARTITION BY clause as follows:
select date, sales,
sum(sales) over (partition by year(date)
order by month(date)) as cume_sum
where year(date) >= 2000;
DATE SALES CUME_SUM
---------- ------------ -----------
01/01/2000 968871.12 968871.12
02/01/2000 80050.05 1048921.17
03/01/2000 757866.14 1806787.31
04/01/2000 58748.13 1865535.44
05/01/2000 40711.69 1906247.13
06/01/2000 241187.78 2147434.91
07/01/2000 954924.16 3102359.07
08/01/2000 502822.96 3605182.03
09/01/2000 97201.45 3702383.48
10/01/2000 853999.45 4556382.93
11/01/2000 358775.59 4915158.52
12/01/2000 437513.35 5352671.87
01/01/2001 476851.71 476851.71
02/01/2001 593768.12 1070619.83
03/01/2001 818597.97 1889217.80
Calculating cumulative sums with the PARTITION BY clause
Now, notice how the cumulative sum gets reset for the January 2001 row. This is because the cume_sum is equal to the sales amount because the data is partitioned by the year, and no rows precede January within the year 2001. The other interesting thing demonstrated by this example is that the arguments used within the OVER clause may be expressions, not just column values. In more complicated examples, it is even possible to nest other aggregate functions within the scalar-aggregate function invocation. This is useful, because it's very common to perform some kind of aggregation (for example, aggregating sales to the month level) before performing the analysis. This raises the following question: When are scalar-aggregate functions computed? The answer is that these functions are computed at the time the rest of the select list is computed. In general, the order of evaluation of a query is as follows:
- From Clause
- Where Clause
- Group By Clause
- Having Clause
- Select List
As you can see, the select list is computed after all other parts of the query. This means that if you have predicates (in the WHERE or HAVING clause), or if you have any aggregations as a result of a GROUP BY clause, these will all be applied before the functions are evaluated. For example, let's look at the following query:
select year(date) as year, sum(sales) as sum,
sum(sum(sales)) over (order by year(date)) as cume_sum
where year(date) >= 1995
group by year(date);
YEAR SUM CUME_SUM
----------- ------------- ------------
1995 7731162.39 7731162.39
1996 4127017.98 11858180.37
1997 7211584.76 19069765.13
1998 4149296.50 23219061.63
1999 6278023.54 29497085.17
2000 5352671.87 34849757.04
2001 5736777.81 40586534.85
Cumulative sum over an aggregation
In this case, we access the table (named in the FROM clause) and apply the WHERE clause, then we do the GROUP BY and compute the sum of sales per year. Finally, we compute the select list, including all scalar-aggregate functions.
There is one other point to be made here. Because the scalar-aggregate functions are computed after the WHERE clause, it is not possible to reference scalar-aggregate functions in a predicate. Instead, if you wish to do this, you must nest the scalar-aggregate function invocation, either within a common table expression, or within a nested query. This becomes useful when performing queries that return the top n result. An example would be writing a query to select the 3 years that have the highest sales totals. We can do this by ranking the yearly sales and then selecting the rows where this ranking is 3 or less.
with ranked_years (year, sum, rank) as
(select year(date) as year, sum(sales) as sum,
rank() over (order by sum(sales) desc) as rank
group by year(date)
select year, sum, rank
where rank <= 3;
YEAR SUM RANK
----------- ------------- -------
1995 7731162.39 1
1997 7211584.76 2
1999 6278023.54 3
Top n query example
As you can see in this example, we have a common table expression that performs the aggregation to sum the sales per year, and then ranks the sales totals. Then, the outer select uses this result table and adds a predicate so only rows whose rank is <=3 (i.e. the top 3 sales totals) are returned. Similar types of queries can be done to compute medians, percentiles, and other distribution statistics.
I hope that by now I've managed to convey a rough idea of what these new OLAP functions are and how they're used. There is actually much, much more to these functions than what I've described here. So keep your eyes open for another article that describes these functions in more detail.
The point I'd like to leave you with is that with the implementation of these OLAP functions, DB2 has extended the relational model so that it can now understand ordering with respect to sets of data. If you've ever tried to formulate queries where ordering matters, you know how difficult and complex these queries can get (even simple queries like median). The OLAP functions provide you with tools to efficiently and concisely formulate such queries. With the increasing demands being placed upon the DBMS, it is clear that the traditional relational model must be extended to handle such increasingly complex analyses, and these functions are just one example of where DB2 is pushing the limits.
About the Author
Bob Lyle has been a DB2 developer for 13 years. He began his career in DB2 for OS/390, working on index management. During his tenure there, he was the architect of the Type 2 Index Manager and large object (LOB) Manager components of DB2. Three years ago, Bob moved to the Almaden Research Center, developing and standardizing the OLAP functions described in this column. Currently, Bob is back in the indexing world, exploring index extensions for DB2.
Bob can be reached at email@example.com
DISCLAIMERS AND TRADEMARKS
This information represents IBMâs current intentions, goals and objectives, and is subject to change or withdrawal without additional or prior notice.
References in this publication to IBM products, programs, or services do not imply that IBM intends to make these available in all countries in which IBM operates.
The following terms are trademarks or registered trademarks of the IBM Corporation in the United States and/or other countries: DB2, SQL, WebSphere, IBM.
Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries.
Other company, product, and service names may be trademarks or service marks of others.