 | Level: Introductory Bob LyleIBM Corporation
23 Oct 2001 Bob Lyle gives an overview of powerful online analytical processing (OLAP) functions that are available in DB2, including ranking, and reporting functions. These functions are part of the SQL-99 standard and provide the ability to control ordering of result sets, an important extension to the traditional relational model.
©International Business Machines Corporation 2001.
All rights reserved.
Bob Lyle
While presenting at IDUG EMEA in Florence this year,
I was surprised to find that very few users know much about the
online analytical processing (OLAP) functions that were introduced
into DB2® Universal Database Versions 6 and 7. While on the surface
they sound like some simple new functions, they are actually
extremely powerful. This power is based on how these functions
extend the relational model to understand ordering within sets of
rows. Previously, if you wanted to perform any analysis with
respect to an ordering (be that a date or other ordering) you had
to write very complex queries. Now, with the OLAP functions, this
complexity is greatly reduced. In this article, I hope to give you
a very high level overview of some of the functions we've
implemented, and some simple examples of how they're used. All of
the functionality documented in this article has been standardized
in the OLAP Amendment to the SQL99 standard.
Ranking Functions
The first OLAP functions introduced into DB2 were the
ranking functions in DB2 Version 6. These ranking functions
provide the ability to define a set (using the PARTITION clause),
and then rank the elements of the set with respect to an ordering.
For example, suppose we have an employee table and would like to
rank the employees' salaries within each department. To do this, we
need a function invocation that does the following:
- Defines the partitions (sets) to be the individual
departments
- Defines the ordering within the set to be on salary.
Traditionally, we'd rank high salaries first, so we will specify
a descending ordering for the salary. The example below shows the
query and output for this query.
|
select empnum, dept, salary,
rank() over (partition by dept
order by salary desc nulls last) as
rank,
dense_rank() over (partition by dept order by salary desc
nulls last)as denserank,
row_number() over (partition by dept order by salary desc
nulls last)as rownumber
from emptab;
EMPNUM DEPT SALARY RANK DENSERANK
ROWNUMBER
------ ---- ------ ---- ---------
---------
6 1 78000 1 1 1
2 1 75000 2 2 2
7 1 75000 2 2 3
11 1 53000 4 3 4
5 1 52000 5 4 5
1 1 50000 6 5 6
--------------------------------------------------
9 2 51000 1 1 1
4 2 - 2 2 2
|
Examples showing ranking
First, let's look at the function invocation. Notice
that there are no arguments to the rank function itself. This is
because the rank function doesn't perform any kind of calculations
on any arguments. Instead, the rank function merely looks at sets
of rows--and the position of each row within the set--as defined by
the ordering. So how do we define the set and ordering for the
function? Both are defined using the OVER clause. In this case,
because we want the ranks within each department, we define the
sets by partitioning by department. This has the effect of ensuring
that only rows with equivalent values in the dept column are ranked
against one each other. For the purposes of the ranking functions,
the terms partition and set are equivalent. Following
the PARTITION clause, we have an ORDER BY clause, which defines the
ordering within the partition. In this case, we'd like to rank high
salaries first, so we define the ordering to be on descending
salary. In addition to the descending specification, we also
specify NULLS LAST. In SQL, nulls collate high, meaning they appear
to be greater than all other non-null values. This introduces a
problem for ranking, because we probably don't want null salaries
ranked first. Thus we use the NULLS LAST clause to change the
default ordering, so that nulls are ranked last. (Note that the
NULLS LAST clause was introduced in DB2 V7; however, it is possible
to force the ordering in V6 using a CASE expression.)
Now, let's look at the output. The first six rows are
all for employees of Department 1, and the rank is assigned to each
row in descending order of salary. Notice that there are two
employees in Department 1 with a salary of 75000, and that both
rows are assigned a rank of two. This is because the rank function
provides an "Olympic"-style ranking, in which two equal values
receive an equal rank. Because there are two rows "tied for
second," there is no row ranked third. Instead, the follwing row is
ranked fourth, because three rows strictly preceded it with respect
to the ordering.
For Department 2, notice that one of the rows has a
null salary. Because we specified NULLS LAST, this row is ranked
after the non-null row. Had we not specified NULLS LAST, the
rankings of the two rows for Department 2 would have been
reversed.
By now you may be asking yourself, what are these two
other output columns, denserank and rownumber, in the above
example? DB2 actually has three different ranking functions. The
first, the rank function, provides Olympic-style rankings as
already described. Two other functions, dense_rank and
row_number, are provided. Dense_rank is much like rank, in
that "ties" are ranked equally. The only difference is that for the
value following the ties, the rank is incremented by 1 (instead of
being incremented by the number of ties, as it is for rank). Thus,
there are no gaps in the ranking (hence the name "dense"). Whereas
Employee 11's salary was given a rank of 4, denserank returned a
value of 3.
The last column gives the output of the row_number
function. Row_number also performs a ranking, but in the event of a
tie, the ties are arbitrarily (that is, non-deterministically)
ordered. This is very useful when performing analytics on data with
duplicate values. One interesting aspect of the row_number function
is that it is the only ranking function that does not require an
ordering. If row_number is invoked without a specified ordering,
then all rows are treated as if they are ties and are arbitrarily
numbered. This can be useful for numbering rows on output.
Other OLAP Functions
Many other OLAP functions were introduced in DB2
Version 7. Before these functions were introduced, DB2 supported
two basic classes of functions, scalar functions and
aggregate functions. Scalar functions are those that operate
on values within a single row, and return a single result per row.
Examples of scalar functions are arithmetic and string functions.
For example, the query below uses the DIGITS scalar function to
format the salary field. The calculation of the result is done per
row, and only the salary value within the current row is used in
the computation.
select empnum, salary,
digits(salary) as digits
from emptab
where dept = 1;
EMPNUM SALARY DIGITS
----------- ----------- ----------
1 50000 0000050000
2 75000 0000075000
5 52000 0000052000
...
|
Example showing the DIGITS scalar functions
Aggregate functions (also referred to as
column or set functions) behave differently.
Aggregate functions operate on a set of rows, and aggregate (or
combine) them into a single row in the output. An example of an
aggregate function is the sum function, where a set of values is
summed and placed into an individual result row. For example, the
following query computes the sum of all of the employees in each
department. The GROUP BY clause is used to indicate that the groups
(or partitions) to aggregate are the sets of all rows within
departments. A single row is returned for each department, giving
the total of all salaries within that department.
|
select dept, sum(salary) as sum
from emptab
group by dept;
DEPT SUM
----------- -----------
1 383000
2 51000
3 209000
- 84000
|
Example showing the SUM
aggregate function
The OLAP functions introduced in DB2 V7
introduce a new class of functions, which we'll call
scalar-aggregate functions. These functions are like scalar
functions, because they return a single value per row, but they're
also like aggregate functions, because the calculation is performed
on values from multiple rows within a set to compute the result.
The scalar-aggregate sum function below does the same calculation
as the aggregate sum function, but returns the results without
combining the rows:
select dept, salary,
sum(salary) over (partition by dept) as
deptsum,
avg(salary) over (partition by dept) as
avgsal,
count(*) over (partition by dept) as
deptcount,
max(salary) over (partition by dept) as
maxsal
from emptab;
DEPT SALARY DEPTSUM AVGSAL DEPTCOUNT
MAXSAL
----- ------- -------- ------- ---------
--------
1 50000 383000 63833 6
78000
1 75000 383000 63833 6
78000
1 52000 383000 63833 6
78000
1 78000 383000 63833 6
78000
1 75000 383000 63833 6
78000
1 53000 383000 63833 6
78000
2 - 51000 51000 2
51000
2 51000 51000 51000 2
51000
3 79000 209000 69666 3
79000
3 55000 209000 69666 3
79000
3 75000 209000 69666 3
79000
- - 84000 84000 2
84000
- 84000 84000 84000 2
84000
|
Example showing the SUM
reporting function
Notice that the query contains no GROUP BY clause.
Instead, the OVER clause is used to partition the data so that the
sum function is computed over rows in the same department, and the
sum of all the salaries in each department is returned for each row
within the department. Traditionally, a join was necessary to
include such aggregated results per row, but now the OLAP functions
provide this much simpler formulation. We refer to this type of
function as a reporting function, because the sum is
calculated over the set and reported once per row. I've used SUM
for the above example and later examples, but most existent
aggregate functions (i.e. AVG, MIN, MAX, STDEV, etc...) work with
the OVER clause. The additional columns to the right of the
DEPTSUM column show the average salary, count of employees in the
department, and maximum salary within the department. The only
aggregate functions not supported as scalar-aggregates are the
linear regression functions.
One powerful use of these reporting functions is in
calculating ratios and percentages. To calculate the percentage of
one employee's salary versus the total of the entire department's
salaries, simply divide the employee's salary by the reporting sum
of the salaries.
select empnum, dept, salary,
sum(salary) over (partition by dept) as
deptsum,
decimal(salary,10,2) /
sum(salary) over(partition by dept)as
percentage
from emptab;
EMPNUM DEPT SALARY DEPTSUM
PERCENTAGE
------ ----- -------- ----------- ----------
1 1 50000 383000 0.1305
2 1 75000 383000 0.1958
5 1 52000 383000 0.1357
6 1 78000 383000 0.2036
7 1 75000 383000 0.1958
11 1 53000 383000 0.1383
4 2 - 51000
9 2 51000 51000 1.0000
8 3 79000 209000 0.3779
10 3 55000 209000 0.2631
12 3 75000 209000 0.3588
0 - - 84000
3 - 84000 84000 1.0000
|
Percentage Example
Employee salary as a percentage of total salary expenditure.
What happens if we introduce an ordering to the
sets used for aggregations? The answer is that, instead of
computing a reporting function, we compute a
cumulative function. A cumulative function is a
scalar-aggregate function that operates on the current row, and
all rows in the set that precede it with respect to the
ordering. Let's use a different table for this example. Suppose we
have a table with the monthly sales results for the current
calendar year. How do we compute the current year-to-date
sales figures for each month? In this case, we want to compute a
cumulative sum of the monthly sales figures. This is done as
follows:
select date, sales,
sum(sales) over (order by date) as
cume_sum,
count(*) over (order by date) as
setcount
from sales
where year(date) = 2000;
DATE SALES CUME_SUM
SETCOUNT
---------- ------------ ------------
---------
01/01/2000 968871.12 968871.12
1
02/01/2000 80050.05 1048921.17
2
03/01/2000 757866.14 1806787.31
3
04/01/2000 58748.13 1865535.44
4
05/01/2000 40711.69 1906247.13
5
06/01/2000 241187.78 2147434.91
6
07/01/2000 954924.16 3102359.07
7
08/01/2000 502822.96 3605182.03
8
09/01/2000 97201.45 3702383.48
9
10/01/2000 853999.45 4556382.93
10
11/01/2000 358775.59 4915158.52
11
12/01/2000 437513.35 5352671.87 12
|
Example of calculating
cumulative sums
Let's look at the results. For the first row, the
cumulative sum is equal to the sales amount for that row. This is
because there were no rows that preceded January. For the second
row, the cumulative sum is equal to the sum of January and
February's sales amounts (968871.12 + 80050.05 =
1048921.17). Similarly, the result for the third row is the
sum of the sales for January, February, and March. The column to
the right of the CUME_SUM column performs a cumulative count that
gives the count of rows that were within the set. For example, the
first row had only one row summed (that is, itself), the second row
had two rows summed (itself and the preceding row), and so on. The
graph above gives a graphical representation of the sales figures
as well as the cumulative sum calculated in the preceding
query.
If we had multiple years worth of
data and wanted to compute the cumulative sum by month within
each year, we could also use the PARTITION BY clause as
follows:
select date, sales,
sum(sales) over (partition by
year(date)
order by month(date)) as
cume_sum
from sales
where year(date) >= 2000;
DATE SALES CUME_SUM
---------- ------------ -----------
01/01/2000 968871.12 968871.12
02/01/2000 80050.05 1048921.17
03/01/2000 757866.14 1806787.31
04/01/2000 58748.13 1865535.44
05/01/2000 40711.69 1906247.13
06/01/2000 241187.78 2147434.91
07/01/2000 954924.16 3102359.07
08/01/2000 502822.96 3605182.03
09/01/2000 97201.45 3702383.48
10/01/2000 853999.45 4556382.93
11/01/2000 358775.59 4915158.52
12/01/2000 437513.35 5352671.87
01/01/2001 476851.71 476851.71
02/01/2001 593768.12 1070619.83
03/01/2001 818597.97 1889217.80
... |
Calculating cumulative sums with
the PARTITION BY clause
Now, notice how the cumulative sum gets reset for the
January 2001 row. This is because the cume_sum is equal to the
sales amount because the data is partitioned by the year, and no
rows precede January within the year 2001. The other
interesting thing demonstrated by this example is that the
arguments used within the OVER clause may be expressions, not just
column values. In more complicated examples, it is even possible to
nest other aggregate functions within the scalar-aggregate function
invocation. This is useful, because it's very common to perform
some kind of aggregation (for example, aggregating sales to the
month level) before performing the analysis. This raises the
following question: When are scalar-aggregate functions computed?
The answer is that these functions are computed at the time the
rest of the select list is computed. In general, the order of
evaluation of a query is as follows:
- From Clause
- Where Clause
- Group By Clause
- Having Clause
- Select List
As you can see, the select list is computed after all
other parts of the query. This means that if you have predicates
(in the WHERE or HAVING clause), or if you have any aggregations as
a result of a GROUP BY clause, these will all be applied before the
functions are evaluated. For example, let's look at the following
query:
select year(date) as year, sum(sales) as
sum,
sum(sum(sales)) over (order by year(date)) as
cume_sum
from sales
where year(date) >= 1995
group by year(date);
YEAR SUM CUME_SUM
----------- ------------- ------------
1995 7731162.39 7731162.39
1996 4127017.98 11858180.37
1997 7211584.76 19069765.13
1998 4149296.50 23219061.63
1999 6278023.54 29497085.17
2000 5352671.87 34849757.04
2001 5736777.81 40586534.85
|
Cumulative sum over an
aggregation
In this case, we access the table (named in the FROM
clause) and apply the WHERE clause, then we do the GROUP BY and
compute the sum of sales per year. Finally, we compute the select
list, including all scalar-aggregate functions.
There is one other point to be made here.
Because the scalar-aggregate functions are computed after
the WHERE clause, it is not possible to reference scalar-aggregate
functions in a predicate. Instead, if you wish to do this, you must
nest the scalar-aggregate function invocation, either within
a common table expression, or within a nested query. This becomes
useful when performing queries that return the top n result.
An example would be writing a query to select the 3 years that have
the highest sales totals. We can do this by ranking the yearly
sales and then selecting the rows where this ranking is 3 or
less.
with ranked_years (year, sum, rank) as
(select year(date) as year, sum(sales) as
sum,
rank() over (order by sum(sales) desc) as
rank
from sales
group by year(date)
)
select year, sum, rank
from ranked_years
where rank <= 3;
YEAR SUM RANK
----------- ------------- -------
1995 7731162.39 1
1997 7211584.76 2
1999 6278023.54 3
|
Top n query example
As you can see in this example, we have a common
table expression that performs the aggregation to sum the sales per
year, and then ranks the sales totals. Then, the outer select uses
this result table and adds a predicate so only rows whose rank is
<=3 (i.e. the top 3 sales totals) are returned. Similar types of
queries can be done to compute medians, percentiles, and other
distribution statistics.
I hope that by now I've managed to convey a rough
idea of what these new OLAP functions are and how they're used.
There is actually much, much more to these functions than what I've
described here. So keep your eyes open for another article that
describes these functions in more detail.
The point I'd like to leave you with is that with the
implementation of these OLAP functions, DB2 has extended the
relational model so that it can now understand ordering with
respect to sets of data. If you've ever tried to formulate queries
where ordering matters, you know how difficult and complex these
queries can get (even simple queries like median). The OLAP
functions provide you with tools to efficiently and concisely
formulate such queries. With the increasing demands being placed
upon the DBMS, it is clear that the traditional relational model
must be extended to handle such increasingly complex analyses, and
these functions are just one example of where DB2 is pushing the
limits.
About the Author
Bob Lyle has been a DB2 developer for 13 years. He
began his career in DB2 for OS/390, working on index management.
During his tenure there, he was the architect of the Type 2 Index
Manager and large object (LOB) Manager components of DB2. Three
years ago, Bob moved to the Almaden Research Center, developing and
standardizing the OLAP functions described in this column.
Currently, Bob is back in the indexing world, exploring index
extensions for DB2.
Bob can be reached at blyle@us.ibm.com
DISCLAIMERS AND TRADEMARKS
This information represents IBM’s current intentions, goals and
objectives, and is subject to change or withdrawal without
additional or prior notice.
References in this publication to IBM products, programs, or
services do not imply that IBM intends to make these available in
all countries in which IBM operates.
TRADEMARKS
The following terms are trademarks or registered trademarks of
the IBM Corporation in the United States and/or other countries:
DB2, SQL, WebSphere, IBM.
Java and all Java-based trademarks and logos are trademarks or
registered trademarks of Sun Microsystems, Inc. in the United
States and other countries.
Other company, product, and service names may be trademarks or
service marks of others.
IBM copyright
and trademark information
About the author  | |  | Bob Lyle is a developerWorks contributing author. |
Rate this page
|  |