# Background

Roberto from Italy sent me the following question:

Is there something like the STATS_MODE() function in DB2?

He says STATS_MODE() is supposed to return the most frequently occurring value from a multi-set of values.

Is there something like the STATS_MODE() function in DB2?

He says STATS_MODE() is supposed to return the most frequently occurring value from a multi-set of values.

A bit of rifling through the internet, steering clear of zones forbidden to me, reveals that STATS_MODE is an aggregate function similar to SUM() or AVG().

The function returns the most frequent value within a group.

The function returns the most frequent value within a group.

If there are two equally frequent values one of them will be picked with no specific rule.

DB2 does not have a function by that name (It's a non intuitive name to me I must say).

Neither does DB2 have a function under another name that does the same thing.

Neither does DB2 have a function under another name that does the same thing.

But how hard can it be to solve the same problem using common SQL structures? Let's take a look.

# Scenario

I couldn't resist the subject line, so my scenario will, for once, not use employees.

Imagine a bunch of kids and suburbs.

We are social workers (or social marketing experts staring at Facebook...) trying to find the key influencers in each neighborhood.

We are social workers (or social marketing experts staring at Facebook...) trying to find the key influencers in each neighborhood.

CREATE TABLE kids(name VARCHAR(20), suburb VARCHAR(20), likes VARCHAR(20)); INSERT INTO kids VALUES ('Anton', 'Scarborough', 'Mona'), ('Brenda', 'Cabbagetown','Charley'), ('Charley', 'Danforth', 'Norbert'), ('Dora', 'Uptown', 'Olga'), ('Edwin', 'Scarborough', 'Laura'), ('Fleur', 'Regent Park', 'Kyle'), ('George', 'Scarborough', 'Anton'), ('Hong', 'Cabbagetown', 'Mona'), ('Ida', 'Danforth', 'Norbert'), ('Jay', 'Uptown', 'Olga'), ('Kyle', 'Regent Park', 'Quentin'), ('Laura', 'Scarborough', 'Piotr'), ('Mona', 'Cabbagetown', 'Stacy'), ('Norbert', 'Danforth', 'Charley'), ('Olga', 'Uptown', 'Vera'), ('Piotr', 'Scarborough', 'Anton'), ('Quentin', 'Regent Park', 'Olga'), ('Rita', 'Scarborough', 'Anton'), ('Stacy', 'Cabbagetown', 'Mona'), ('Ulf', 'Danforth', 'Charley'), ('Vera', 'Uptown', 'Olga'), ('Wendy', 'Regent Park', 'Quentin'), ('Xavier', 'Scarborough', 'Mona'), ('Yvette', 'Cabbagetown', 'Stacy'), ('Zoe', 'Danforth', 'Charley');

# Step by Step

To find the most frequently occurring names under "LIKES" per "SUBURB" we need to first count each occurrence:

SELECT suburb, likes, count(*)

FROM kids

GROUP BY suburb, likes;

SUBURB LIKES 3 -------------------- -------------------- ----------- Scarborough Anton 3 Cabbagetown Charley 1 Danforth Charley 3 Regent Park Kyle 1 Scarborough Laura 1 Cabbagetown Mona 2 Scarborough Mona 2 Danforth Norbert 2 Regent Park Olga 1 Uptown Olga 3 Scarborough Piotr 1 Regent Park Quentin 2 Cabbagetown Stacy 2 Uptown Vera 1 14 rows were retrieved.

Interesting...Anyone who ever thought GROUP BY implies ORDER BY.. think again...anyway, lets sort by suburb and then the count:

SELECT suburb, likes, count(*) as cnt

FROM kids GROUP BY suburb, likes

ORDER BY suburb, cnt desc;

SUBURB LIKES CNT -------------------- -------------------- ----------- Cabbagetown Mona 2 Cabbagetown Stacy 2 Cabbagetown Charley 1 Danforth Charley 3 Danforth Norbert 2 Regent Park Quentin 2 Regent Park Kyle 1 Regent Park Olga 1 Scarborough Anton 3 Scarborough Mona 2 Scarborough Laura 1 Scarborough Piotr 1 Uptown Olga 3 Uptown Vera 1 14 rows were retrieved.

Things are clearing up a bit.

Looking at the result we know we want: Mona, Charley, Quentin, Anton and Olga.

How can we filter them out?

We need to get the first row of this result set for each suburb.

For such tasks I always like the ROW_NUMBER() OVER() function.

Looking at the result we know we want: Mona, Charley, Quentin, Anton and Olga.

How can we filter them out?

We need to get the first row of this result set for each suburb.

For such tasks I always like the ROW_NUMBER() OVER() function.

SELECT suburb, likes, cnt, ROW_NUMBER() OVER(PARTITION BY suburb ORDER BY cnt DESC) AS rn FROM (SELECT suburb, likes, count(*) as cnt

FROM kids

GROUP BY suburb, likes);

SUBURB LIKES CNT RN -------------------- -------------------- ----------- -------------------- Cabbagetown Mona 2 1 Cabbagetown Stacy 2 2 Cabbagetown Charley 1 3 Danforth Charley 3 1 Danforth Norbert 2 2 Regent Park Quentin 2 1 Regent Park Kyle 1 2 Regent Park Olga 1 3 Scarborough Anton 3 1 Scarborough Mona 2 2 Scarborough Laura 1 3 Scarborough Piotr 1 4 Uptown Olga 3 1 Uptown Vera 1 2 14 rows were retrieved.

This is pretty close. Now we just need to filter out all but the first rows using "RN" in a predicate.

SELECT suburb, likes FROM (SELECT suburb, likes, ROW_NUMBER() OVER(PARTITION BY suburb ORDER BY cnt DESC) AS rn FROM (SELECT suburb, likes, count(*) as cnt

FROM kids

GROUP BY suburb, likes)) WHERE rn = 1;

SUBURB LIKES -------------------- -------------------- Cabbagetown Mona Danforth Charley Regent Park Quentin Scarborough Anton Uptown Olga

Tadah!

Granted this is longer than:

Granted this is longer than:

SELECT suburb, STATS_MODE(likes)

FROM kids

GROUP BY suburb;

So STATS_MODE() seems like a handy function if you do this sort of thing a lot.

Now, is the solution we have above a workaround, an emulation perhaps which is slower than the real deal would be?

For that we should look at the explain and see what DB2 does with this query

Now, is the solution we have above a workaround, an emulation perhaps which is slower than the real deal would be?

For that we should look at the explain and see what DB2 does with this query

Rows RETURN ( 1) Cost I/O | 5 FILTER ( 2) 6.85026 1 | 25 TBSCAN ( 3) 6.83917 1 | 25 SORT ( 4) 6.83629 1 | 25 GRPBY ( 5) 6.81861 1 | 25 TBSCAN ( 6) 6.81617 1 | 25 SORT ( 7) 6.81329 1 | 25 TBSCAN ( 8) 6.80028 1 | 25 TABLE: ADMINISTRATOR KIDS Q1

Well, I can't say for certain that this is the best possible plan.

Of course having the right index could eliminate SORT(7).

But there may be a smart way to avoid SORT(4).

All we want is to preserve the row of the maximum "CNT" per partition.And the partitions are already sorted.

Of course having the right index could eliminate SORT(7).

But there may be a smart way to avoid SORT(4).

All we want is to preserve the row of the maximum "CNT" per partition.And the partitions are already sorted.

Ideas welcome.