Who's the most popular kid on the block?
Comments (8) Visits (8023)
Roberto from Italy sent me the following question:
Is there something like the STATS_MODE() function in DB2?
He says STATS_MODE() is supposed to return the most frequently occurring value from a multi-set of values.
A bit of rifling through the internet, steering clear of zones forbidden to me, reveals that STATS_MODE is an aggregate function similar to SUM() or AVG().
The function returns the most frequent value within a group.
If there are two equally frequent values one of them will be picked with no specific rule.
DB2 does not have a function by that name (It's a non intuitive name to me I must say).
Neither does DB2 have a function under another name that does the same thing.
But how hard can it be to solve the same problem using common SQL structures? Let's take a look.
I couldn't resist the subject line, so my scenario will, for once, not use employees.
Imagine a bunch of kids and suburbs.
We are social workers (or social marketing experts staring at Facebook...) trying to find the key influencers in each neighborhood.
CREATE TABLE kids(name VARCHAR(20), suburb VARCHAR(20), likes VARCHAR(20)); INSERT INTO kids VALUES ('Anton', 'Scarborough', 'Mona'), ('Brenda', 'Cab
Step by Step
To find the most frequently occurring names under "LIKES" per "SUBURB" we need to first count each occurrence:
SELECT suburb, likes, count(*)
SELECT suburb, likes, count(*) as cnt
Things are clearing up a bit.
Looking at the result we know we want: Mona, Charley, Quentin, Anton and Olga.
How can we filter them out?
We need to get the first row of this result set for each suburb.
For such tasks I always like the ROW_NUMBER() OVER() function.
SELECT suburb, likes, cnt, ROW_NUMBER() OVER(PARTITION BY suburb ORDER BY cnt DESC) AS rn FROM (SELECT suburb, likes, count(*) as cnt
This is pretty close. Now we just need to filter out all but the first rows using "RN" in a predicate.
SELECT suburb, likes FROM (SELECT suburb, likes, ROW_NUMBER() OVER(PARTITION BY suburb ORDER BY cnt DESC) AS rn FROM (SELECT suburb, likes, count(*) as cnt
Granted this is longer than:
SELECT suburb, STATS_MODE(likes)
So STATS_MODE() seems like a handy function if you do this sort of thing a lot.
Now, is the solution we have above a workaround, an emulation perhaps which is slower than the real deal would be?
For that we should look at the explain and see what DB2 does with this query
Rows RETURN ( 1) Cost I/O | 5 FILTER ( 2) 6.85026 1 | 25 TBSCAN ( 3) 6.83917 1 | 25 SORT ( 4) 6.83629 1 | 25 GRPBY ( 5) 6.81861 1 | 25 TBSCAN ( 6) 6.81617 1 | 25 SORT ( 7) 6.81329 1 | 25 TBSCAN ( 8) 6.80028 1 | 25 TABLE: ADMINISTRATOR KIDS Q1
Well, I can't say for certain that this is the best possible plan.
Of course having the right index could eliminate SORT(7).
But there may be a smart way to avoid SORT(4).
All we want is to preserve the row of the maximum "CNT" per partition.And the partitions are already sorted.