Welcome to the Metrics Maven series: Database features, tips, tricks, and code
In our Metrics Maven series, IBM Cloud Database’s data scientist shares database features, tips, tricks, and code that you can use to get the metrics you need from your data. In this first article, we’ll look at how to use window functions in PostgreSQL.
PostgreSQL window functions
If you use PostgreSQL, you’re probably already familiar with many of the common aggregate functions, such as
AVG(). However, you may not be familiar with window functions since they’re touted as an advanced feature. The good news is that window functions aren’t nearly as esoteric as they may seem.
As the name implies, window functions provide a “window” into your data, letting you perform aggregations against a set of data rows according to specified criteria that match the current row. While they are similar to standard aggregations, there are also additional functions that can only be used through window functions (such as the
RANK() function we’ll demonstrate below). In some situations, window functions can minimize the complexity of your query or even speed up the performance.
It is important to note that window functions always use the
OVER() clause, so if you see
OVER(), you’re looking at a window function. Once you get used to how the
OVER() clause is formatted, where it fits in your queries, and the kind of results you can get, you’ll soon start to see lots of ways to apply it. Let’s dive in!
Depending on the purpose and complexity of the window function you want to run, you can use
OVER() all by itself or with a handful of conditional clauses. Let’s start by looking at using
OVER() all by itself.
If the aggregation you want to run is to be performed across all the rows returned by the query and you don’t need to specify any other conditions, you can use the
OVER() clause by itself. Here’s an example of a simple window function querying a table in our IBM Cloud Databases for PostgreSQL database containing the United States Census data on estimated population:
Notice that we’re using a window function to sum the state populations over all the result rows (that’s the
OVER() you see in our query—yep, just that one little addition to an otherwise standard query). Returned, we get result rows for each state and their populations with also the population sum for the nation—that’s the aggregation we performed with our window function:
Consider how this compares to standard aggregation functions. Without the window function, the simplest thing we could do is return the national population by itself, like this, by summing the state populations:
The problem is, we don’t get any of the state level information this way. To get the same results as our window function, we’d have to do a sub-select as a derived table:
Looks ugly in comparison, doesn’t it? Using window functions, our query is much less complex and easier to understand.
In the above example, we looked at a simple window function without any additional conditions, but in many cases, you’ll want to apply some conditions in the form of additional clauses to your
OVER() clause. One is
PARTITION BY, which acts as the grouping mechanism for aggregations. The other one is
ORDER BY, which orders the results in the window frame (the set of applicable rows).
So, besides the format of the returned rows as we reviewed above, the other obvious difference with window functions is how the syntax works in your queries. Use the
OVER() clause with an aggregate function (like
AVG()) and/or with a specialized window function (like
ROW_NUMBER()) in your
SELECTlist to indicate you’re creating a window and apply additional conditions as necessary to the
OVER() clause, such as using
PARTITION BY (instead of the
GROUP BY you may be used to for aggregation).
Let’s look at some specific examples.
PARTITION BY allows us to group aggregations according to the values of the specified fields.
In our census data for estimated population, each state is categorized according to the division and region it belongs to. Let’s partition first by region:
Now we can see the population sum by region but still get the state level data:
Let’s add division:
Now we’re looking at state-level data, broken out by region and division, with a population summary at the division level:
As you’ve probably noticed in the previous queries, we’re using
ORDER BY in the usual way to order the results by the state name, but we can also use
ORDER BY in our
OVER() clause to impact the window function calculation. For example, we’d want to use
ORDER BY as a condition for the
RANK() window function since ranking requires an order to be established. Let’s rank the states according to highest population:
In this case, we’ve added
ORDER BY popestimate2015 desc as a condition of our
OVER() clause in order to describe how the ranking should be performed. Because we still have our
ORDER BY name clause for our result set, though, our results will continue to be in state name order, but we’ll see the populations ranked accordingly, with California coming in at number one based on its population:
Let’s combine our
PARTITION BY and our
ORDER BY window function clauses now to see the ranking of the states by population within each region. For this, we’ll change our result-level
ORDER BY name clause at the end to order by region instead so that it’ll be clear how our window function works:
Here we can see that Illinois is the top-ranking state by population in the Midwest region and New York is number one in the Northeast region.
So, we combined some conditions here, but what if we need more than one window function?
Named window functions
In queries where you are using the same window function logic for more than one returned field or where you need to use more than one window function definition, you can name them to make your query more readable.
Here’s an example where we’ve defined two windows functions. One, named “rw,” partitions by region and the other, named “dw,” partitions by division. We’re using each one twice—once to calculate the population sum and again to calculate the population average. Our windows functions are defined and named using the
WINDOW clause which comes after the
WHERE clause in our query:
Since we didn’t do any manipulation on the averages values yet, the numbers look a little crazy; that can be easily cleaned up using
CAST(), if need be. Our purpose here is to demonstrate how to use multiple window functions and the results you’ll get. Check it out:
Now that’s an informative report of population metrics, and window functions made it easy!
This article has given you a glimpse of the power of PostgreSQL window functions. We touched on the benefits of using window functions, looked at how they are different (and similar) to standard aggregation functions, and learned how to use them with various conditional clauses, walking through examples along the way. Now that you can see how window functions work, start trying them out by replacing standard aggregations with window functions in your queries. Once you get the hang of them, you’ll be hooked.
In our next article, we’ll look at window framing options in PostgreSQL to give you even more control over how your window functions behave.