The trouble of combining DISTINCT with ORDER BY
Comments (4) Visits (42814)
Yesterday one of my coworkers from India asked for my help in migrating the following SQL statement from Sybase to DB2.
SELECT DISTINCT c1, c2 FROM t ORDER BY c3
When this statement is run in DB2 (assuming the appropriate definition of "T") the following error is being returned:
SQL0214N An expression in the ORDER BY clause in the following position, or starting with "C3" in the "ORDER BY" clause is not valid. Reason code = "2".
Why does DB2 raise this error?
The reason is simple: It's not self evident what the answer should be!
To look into this issue a bit close and come up with a possible interpretation and implementation of the request we look at an example
My pals and I like to go out for lunch to a chine mall next to IBM. Most of the time we go the same vendor, so you might say we are regulars. The ladies taking the orders know us, and if one of us doesn't join we get a jovial: "Where is your friend today?"
As a business it pays to keep track of your regular guests. If they don't come for a while the reasons could be innocuous (vacation, sickness, travel etc) or they could be relevant to you as the owner. Maybe the food isn't as good anymore, or the menu has gotten boring. So a reasonable query may be:
Who are the guests I haven't seen for the longest time?
Here is the table and some data to illustrate this example:
CREATE TABLE guests(name VARCHAR(20), firstname VARCHAR(20), visit DATE); INSERT INTO guests VALUES ('Jones' , 'Jeremy' , '2012-04-13'), ('Beaver' , 'Boris' , '2012-04-14'), ('Gervais' , 'Gerald' , '2012-04-13'), ('Jones' , 'Jeremy' , '2012-04-15'), ('Beaver' , 'Boris' , '2012-04-15'), ('Chambers', 'Charlene', '2012-04-15'), ('Dreyfus' , 'Don' , '2012-04-14'), ('Jones' , 'Jeremy' , '2012-04-14'), ('Eakes' , 'Edwin' , '2012-04-14'), ('Fox' , 'Frauke' , '2012-04-14'), ('Dreyfus' , 'Don' , '2012-04-13');
It is easy enough to provide a sorted list of customers with the oldest visits showing first:
SELECT name, firstname FROM guests ORDER BY visit; NAME FIRSTNAME ----
Well, we now know that Jeremy visited a long time ago, but he actually keeps coming back.
Gerald, on the other hand came only once and never returned.
What happens when we add a DISTINCT to eliminate the duplicates?
SELECT DISTINCT name, firstname FROM guests ORDER BY visit; SQL0214N An expression in the ORDER BY clause in the following position, or starting with "VISIT" in the "ORDER BY" clause is not valid. Reason code = "2".
Right... we knew this would happen.
To get this query to compile we need to separate the ORDER BY from the DISTINCT.
SELECT DISTINCT name, firstname FROM (SELECT name, firstname FROM guests ORDER BY visit); NAME FIRSTNAME ----
The ORDER BY got lost in the outer query. This is to be expected and according to the rules of SQL.
The output is actually sorted by name which is a side effect of the DISTINCT processing.
We could flip the order around and first do DISTINCT, but then we would have to include the "VISIT" column which would only cause us to eliminate duplicate same-day visits.
So, are we stuck? Not at all!
There are two ways to attack this problem. One without and one with OLAP:
Going old school
The classic approach requires us to do a join. We will look up the most recent visit per customer and eliminate all other rows of that same customer:
SELECT name, firstname FROM guests g WHERE visit IN (SELECT MAX(visit) FROM guests WHERE g.name = guests.name AND g.firstname = guests.firstname) ORDER BY visit; NAME FIRSTNAME ----
That did the job, but it requires a nested loop join.
Although an index on (name, firstname, visit DESC) might to wonders to speed up this query.
OLAP to the rescue
Another approach to solve the problem uses some very basic OLAP function.
Essentially we will number each persons visits from the newest to the oldest starting with "1" and then eliminate everything but the first entry per person.
If that sounds familiar you are right. In this blog I did use the technique before for de-duping.
In the end this is precisely what we are doing here.
SELECT name, firstname FROM(SELECT ROW_NUMBER() OVER(PARTITION BY name, firstname ORDER BY visit DESC) AS rn, name, firstname, visit FROM guests) WHERE rn = 1 ORDER BY visit; NAME FIRSTNAME ----
So how does this relate to DISTINCT and ORDER BY again?
Note how in the previous examples I used MAX(visit) and ORDER BY visit DESC?
That was appropriate for the business problem.
We picked the most recent version of the (name, firstname) tuple for the distinct processing.
But we could have taken the oldest (MIN(visit) and ORDER BY visit ASC respectively).
In fact a smart DBMS could process these semantics much faster since the same ordering used for ROW_NUMBER can be used to produce the result set.
We could also legally have taken the "first" or "last" row touched.
The SQL Standard does not tell which semantics would be right, so it's safer to return an error, then what might be perceived as a wrong result.