SQL OLAP Functions: A Primer
Informix is adding a set of SQL OLAP functions in an upcoming release. The following write-up is to introduce the concept of such functions which may be new to Informix users. It is not meant to be a reference manual that lists the complete syntax and optional parameters. These analytic functions are valuable to users because they simplify the expression of complex but commonly used business questions and are efficient without resorting to subqueries or other calculations.
OLAP is OnLine Analytical Processing (as opposed to OnLine Transaction Processing – OLTP). OLAP is generally described as part of Business Intelligence that allows users to perform business reporting for sales, marketing, inventory control, business processes, etc. It usually works with multi-dimensional databases requiring frequent roll-up, roll-down and slicing and dicing.
While OLAP is the overall encompassing term for the type of processing to be done, there are specific functions known as SQL OLAP functions that are supported by different database systems and/or BI tools such as Cognos, Business Object, etc.
SQL OLAP functions are not new. In fact, they adhere to and extend the ANSI SQL-99 standards, known as the “Elementary OLAP” package. Many other databases and BI tools have provided these functions for some number of years, but SQL OLAP functions have not been supported by the Informix Dynamic Server (IDS) until now.
The OLAP functions to be supported by Informix are as follows:
· Ranking RANK(), DENSE_RANK(), DENSERANK(), CUME_DIST(), PERCENT_RANK(), NTILE() · Numbering ROW_NUMBER(), ROWNUMBER() · Aggregate RATIO_TO_REPORT, RATIOTOREPORT · First/Last FIRST, LAST
Note that if the underlying database does not support these OLAP functions, as in the case with Informix currently, then the user must rely on the BI tool to perform these calculations. If the underlying database system supports such OLAP functions, users can choose to submit SQL directly to the database or to submit it through the BI tool. In order to take advantage of the underlying data base support of OLAP functions, the tool must recognize this fact and generates the appropriate SQL to take advantage of it. The difference in performance can be enormous, as in orders of magnitude differences. A new version of Cognos Enterprise BI will generate SQL to take advantage of the OLAP functions implemented in IDS in an upcoming release.
SQL OLAP Functions: Concepts
A window partition is a subset of rows returned by a query, as defined by one or more columns in a special OVER() clause. SQL OLAP functions are applied to these partitioned result sets within the scope of a single query expression.
Olap_function () over (partition by col1, col2…)
Non-analytic functions compute their calculations with respect to every row in the result set. However, an OLAP function in a query is applied with respect to the contents of its window partitions. An example with the RANK function is as follows:
Select date, store_name, sum(dollars) as sales_dols,
rank () over (partition by date order by sales_dols desc) as date_rank
from period, store, sales
where period.perkey = sales.perkey
and store.storekey = sales.storekey
and state = “CA”
group by date, store_name
order by date;
DATE STORE_NAME SALES_DOL DATE_RANK
2012-01-02 Beaches Brew 785.55 1
2012-01-02 Roasters,
2012-01-02
2012-01-02
2012-01-02 Instant Coffee 457.75 5
2012-01-03 Instant Coffee 713.75 1
2012-01-03
… …
2012-01-04 Instant Coffee 1031.50 1
2012-01-04
… …
The query returns ordinal rankings for sales figures. The stores are ranked according to their sales totals for each value; as the date changes, the ranking values are reset. Multiple sets of ranked values (1 to n) are computed within a single query expression.
In practice, OLAP functions allow application developers to compose analytic business queries more easily and more efficiently. For example, moving averages and moving sums can be calculated over various intervals, aggregations and ranks can be reset as selected column values change; and complex ratios can be expressed in simple terms. Within the scope of a single query expression, several different OLAP functions can be defined, each with its own partitioning rules.
Types of OLAP Functions
OLAP functions supported by Informix fall into four categories:
- Ranking
- RANK
- DENSE_RANK and DENSERANK (synonyms)
- NTILE
- Numbering
- ROW_NUMBER and ROWNUMBER (synonyms)
- Aggregation
- AVG
- COUNT
- MIN
- MAX
- SUM
- Ration-to-report
- RATIO_TO_REPORT AND RATIOTOREPORT (synonyms)
The OVER Clause
The OLAP OVER clause differentiates OLAP functions from other analytic or reporting functions. The OVER clause has three distinct capabilities:
- Defining window partitions (PARTITION BY clause)
- Ordering rows within partitions (ORDER BY clause)
- Defining window frames (ROWS/RANGE specification)
Depending on the type of OLAP function in question (ranking, aggregation, etc.), the OVER clause is subject to different usage rules. The following discussion explains the concepts of window partitions, ordering partitioned rows, and window frames in general terms.
OLAP Window Partitions
A window partition is a set of rows that are grouped together for the purpose of applying an OLAP function. The OLAP function is applied to every row, but is calculated with respect to the rows in the partition. If no partition is specified, the OLAP function is computed over the complete intermediate result set.
Ordering Rows within Partitions
The OLAP ORDER BY clause is separate and distinct from the ORDER BY clause that can be appended to any query expression. Typically, the ORDER BY clause for an OLAP function defines the expressions for sorting rows within window partitions; however, the ORDER BY clause can be used without a preceding PARTITION BY clause, in which case the sort specification ensures that the OLAP function is applied to a meaningful (and intended) ordering of the intermediate result set.
This specification is a prerequisite for the ranking family of OLAP functions; it is the ORDER BY clause, not an argument to the function itself, that identifies the measures(s) for the ranking values. In the case of OLAP aggregations, the ORDER BY clause is not required in general, but it is a prerequisite to defining a window frame. The partitioned rows must be sorted before the appropriate aggregate values can be computed for each frame.
In the following example, the window partition is defined by the “Dimension” column. The five rows that contain the value “A” comprise a window, as do the five rows for “B” and “C”. The Meaure column is the input data for an OLAP RANK function; the rows within each partition are ordered by the Measure values. When the RANK function is applied, it is calculated over each partition. The Measure values are ranked 1 through 5 within each partition.
The OLAP ROW_NUMBER function is also calculated in this example, without a PARRITION BY clause, to produce consecutive row numbers for the entire result set.
Row | Dimension | Measure | OLAP_RANK |
1 | A | 10 | 1 |
2 | A | 50 | 2 |
3 | A | 100 | 3 |
4 | A | 120 | 4 |
5 | A | 500 | 5 |
6 | B | 50 | 1 |
7 | B | 60 | 2 |
8 | B | 200 | 3 |
9 | B | 220 | 4 |
10 | B | 600 | 5 |
11 | C | 40 | 1 |
12 | C | 300 | 2 |
13 | C | 500 | 3 |
14 | C | 1000 | 4 |
15 | C | 2000 | 5 |
The SQL for this query would look like this:
Select row_number() over () as row,
dimension,
measure,
rank () over (partition by dimension order by measure)
as Olap_rank
from …;
You can define window partitions according to the values in a single dimension or you can specify multiple dimensions. For example, you could partition the rows based on city and state values or month, quarter, and year values.
Window Frames
For non-ranking OLAP functions, such as aggregations, you can define a window frame. A window frame defines a set of rows within a window partition. When a window frame is defined, the OLAP function is computed with respect to the contents of this moving frame rather than the fixed contents of the whole partition.
The definition of a window frame can be row-based (ROWS specification) or value-based (RANGE specification).
Row-Based Window Frames
The reference point for all window frames in the current row. The SQL OLAP syntax provides mechanisms for defining a row-based window frame as any number of rows preceding and/or following the current row.
In the following example, rows 1 through 5 represent a partition; each row becomes the current row as the OLAP window frame slides forward. The frame in this case is defined as between current row and 2 following, so each frame includes a maximum of three rows and a minimum of one row. When the frame reaches the end of the partition, only the current row is included. The shaded areas indicate which rows are excluded from the frame at each step.
In other words, the window frame in this case imposes the following rules:
- When row 1 is the current row, rows 4 and 5 are excluded
- When row 2 is the current row, rows 5 and 1 are excluded
- When row 3 is the current row, rows 1 and 2 are excluded
- When row 4 is the current row, rows 1 to 3 are excluded
- When row 5 is the current row, rows 1 to 4 are excluded
The following example applies these rules to a specific set of values, showing the OLAP AVG function that would be calculated for each row. The sliding calculations produce a moving average with an interval of three rows or fewer, depending on which row is the current one.
Row | Dimension | Measure | OLAP_AVG |
1 A 10 53.3
2 A 50 90.0
3 A 100 240
4 A 120 310
5 A 500 500
The SQL for this query would look like this:
Select row_number () over () as row,
dimension,
measure,
avg (measure) over (partition by dimension)
order by measure
rows between current row and 2 following) as OLAP_AVG
from ….
The averages are computed as follows:
- Row 1 = (10+50+100)/3 = 53.3
- Row 2 = (50+100+120)/3 = 90.0
- Row 3 = (100+120+500)/3 = 240
- Row 4 = (120+500)/2 = 310
- Row 5 = (500)/1 = 500
Similar calculations would be implemented for all subsequent partitions (B, C, etc.) in the result set.
Value-Based Window Frames
The previous example demonstrates a row-based window frame definition. The SQL OLAP syntax also supports another kind of window frame, whose limits are defined in terms of a value-based or range-based set of rows rather than a specific sequence of rows.
For example, a frame could be defined as the set of rows with Year values some number of years preceding or following the current row’s year. A specific example might be:
Order by year asc range between current row and 1 preceding
where 1 preceding means the current row’s Year value minus 1
This kind of range specification is inclusive. If the current row’s Year value is 2000, all rows in the window partition with Year values 2000 and 1999 qualify for the frame, regardless of the physical position of those rows in the partition. The rules for including and excluding value-based rows are quite different from the rules applied to row-based frames, which depend entirely on the physical sequence of rows.
Put in the context of an OLAP AVG calculation, the following partial result set further demonstrate the concept of a value-based window frame. The frame consists of rows that:
- Have the same year as the current row
- Have the same year as the current row minus 1:
Row | Dimension | Year | Measure | Olap_Avg |
1 | A | 1999 | 10000 | 10000 |
2 | A | 2001 | 5000 | 3000 |
3 | A | 2001 | 1000 | 3000 |
4 | A | 2002 | 12000 | 5250 |
5 | A | 2002 | 3000 | 5250 |
The SQL for this query would look like this:
select row_number() over () as row,
dimension,
year,
measure,
avg (measure) over (partition by dimension
order by year asc
range between current row and 1 preceding ) as olap_avg
from …
The averages are computed as follows:
- Row 1 = 1999; rows 2 to 5 are excluded; AVG = 10000/1 = 10000
- Row 2 = 2001; rows 1, 4, 5 are excluded; AVG = (5000+1000)/2 = 3000
- Row 3 = 2001; rows 1, 4, 5 are excluded; AVG = (5000+1)000/2 = 3000
- Row 4 = 2002; row 1 is excluded; AVG = (5000+1000+12000+3000)/4 = 5250
- Row 5 = 2002; row 1 is excluded; AVG = (5000+1000+12000+3000)/4 = 5250
Reference:
Most of the information provided here can be found in various manuals of database systems/BI tools that have implemented SQL OLAP standard. The examples used here are taken from the Red Brick Warehouse SQL Reference Guide with slight modifications.