Background
Roberto from Italy sent me the following question:
Is there something like the STATS_MODE() function in DB2?
He says STATS_MODE() is supposed to return the most frequently occurring value from a multiset of values.
A bit of rifling through the internet, steering clear of zones forbidden to me, reveals that STATS_MODE is an aggregate function similar to SUM() or AVG().
The function returns the most frequent value within a group.
If there are two equally frequent values one of them will be picked with no specific rule.
DB2 does not have a function by that name (It's a non intuitive name to me I must say).
Neither does DB2 have a function under another name that does the same thing.
But how hard can it be to solve the same problem using common SQL structures? Let's take a look.
Scenario
I couldn't resist the subject line, so my scenario will, for once, not use employees.
Imagine a bunch of kids and suburbs.
We are social workers (or social marketing experts staring at Facebook...) trying to find the key influencers in each neighborhood.
CREATE TABLE kids(name VARCHAR(20), suburb VARCHAR(20), likes VARCHAR(20));
INSERT INTO kids VALUES
('Anton', 'Scarborough', 'Mona'),
('Brenda', 'Cabbagetown','Charley'),
('Charley', 'Danforth', 'Norbert'),
('Dora', 'Uptown', 'Olga'),
('Edwin', 'Scarborough', 'Laura'),
('Fleur', 'Regent Park', 'Kyle'),
('George', 'Scarborough', 'Anton'),
('Hong', 'Cabbagetown', 'Mona'),
('Ida', 'Danforth', 'Norbert'),
('Jay', 'Uptown', 'Olga'),
('Kyle', 'Regent Park', 'Quentin'),
('Laura', 'Scarborough', 'Piotr'),
('Mona', 'Cabbagetown', 'Stacy'),
('Norbert', 'Danforth', 'Charley'),
('Olga', 'Uptown', 'Vera'),
('Piotr', 'Scarborough', 'Anton'),
('Quentin', 'Regent Park', 'Olga'),
('Rita', 'Scarborough', 'Anton'),
('Stacy', 'Cabbagetown', 'Mona'),
('Ulf', 'Danforth', 'Charley'),
('Vera', 'Uptown', 'Olga'),
('Wendy', 'Regent Park', 'Quentin'),
('Xavier', 'Scarborough', 'Mona'),
('Yvette', 'Cabbagetown', 'Stacy'),
('Zoe', 'Danforth', 'Charley');
Step by Step
To find the most frequently occurring names under "LIKES" per "SUBURB" we need to first count each occurrence:
SELECT suburb, likes, count(*)
FROM kids
GROUP BY suburb, likes;
SUBURB LIKES 3
  
Scarborough Anton 3
Cabbagetown Charley 1
Danforth Charley 3
Regent Park Kyle 1
Scarborough Laura 1
Cabbagetown Mona 2
Scarborough Mona 2
Danforth Norbert 2
Regent Park Olga 1
Uptown Olga 3
Scarborough Piotr 1
Regent Park Quentin 2
Cabbagetown Stacy 2
Uptown Vera 1
14 rows were retrieved.
Interesting...Anyone who ever thought GROUP BY implies ORDER BY.. think again...anyway, lets sort by suburb and then the count:
SELECT suburb, likes, count(*) as cnt
FROM kids GROUP BY suburb, likes
ORDER BY suburb, cnt desc;
SUBURB LIKES CNT
  
Cabbagetown Mona 2
Cabbagetown Stacy 2
Cabbagetown Charley 1
Danforth Charley 3
Danforth Norbert 2
Regent Park Quentin 2
Regent Park Kyle 1
Regent Park Olga 1
Scarborough Anton 3
Scarborough Mona 2
Scarborough Laura 1
Scarborough Piotr 1
Uptown Olga 3
Uptown Vera 1
14 rows were retrieved.
Things are clearing up a bit.
Looking at the result we know we want: Mona, Charley, Quentin, Anton and Olga.
How can we filter them out?
We need to get the first row of this result set for each suburb.
For such tasks I always like the ROW_NUMBER() OVER() function.
SELECT suburb, likes, cnt, ROW_NUMBER() OVER(PARTITION BY suburb ORDER BY cnt DESC) AS rn
FROM (SELECT suburb, likes, count(*) as cnt
FROM kids
GROUP BY suburb, likes);
SUBURB LIKES CNT RN
   
Cabbagetown Mona 2 1
Cabbagetown Stacy 2 2
Cabbagetown Charley 1 3
Danforth Charley 3 1
Danforth Norbert 2 2
Regent Park Quentin 2 1
Regent Park Kyle 1 2
Regent Park Olga 1 3
Scarborough Anton 3 1
Scarborough Mona 2 2
Scarborough Laura 1 3
Scarborough Piotr 1 4
Uptown Olga 3 1
Uptown Vera 1 2
14 rows were retrieved.
This is pretty close. Now we just need to filter out all but the first rows using "RN" in a predicate.
SELECT suburb, likes
FROM (SELECT suburb, likes, ROW_NUMBER() OVER(PARTITION BY suburb ORDER BY cnt DESC) AS rn
FROM (SELECT suburb, likes, count(*) as cnt
FROM kids
GROUP BY suburb, likes))
WHERE rn = 1;
SUBURB LIKES
 
Cabbagetown Mona
Danforth Charley
Regent Park Quentin
Scarborough Anton
Uptown Olga
Tadah!
Granted this is longer than:
SELECT suburb, STATS_MODE(likes)
FROM kids
GROUP BY suburb;
So STATS_MODE() seems like a handy function if you do this sort of thing a lot.
Now, is the solution we have above a workaround, an emulation perhaps which is slower than the real deal would be?
For that we should look at the explain and see what DB2 does with this query
Rows
RETURN
( 1)
Cost
I/O

5
FILTER
( 2)
6.85026
1

25
TBSCAN
( 3)
6.83917
1

25
SORT
( 4)
6.83629
1

25
GRPBY
( 5)
6.81861
1

25
TBSCAN
( 6)
6.81617
1

25
SORT
( 7)
6.81329
1

25
TBSCAN
( 8)
6.80028
1

25
TABLE: ADMINISTRATOR
KIDS
Q1
Well, I can't say for certain that this is the best possible plan.
Of course having the right index could eliminate SORT(7).
But there may be a smart way to avoid SORT(4).
All we want is to preserve the row of the maximum "CNT" per partition.And the partitions are already sorted.
Ideas welcome.
Tags:
by
frequent
group
olap
values
stats_mode
The tricky part seems to be to eliminate "duplicates", i.e choose either Mona or Stacy for Cabbagetown. row_number() is ideal for that purpose, but introduces a sort no matter what (well in my extensive test the last 5 minutes anyhow:). If we can ignore the duplicate problem
allow reverse scans
collect detailed statistics;
from (
SELECT suburb, likes, cnt
, max(cnt) over (PARTITION BY suburb
RANGE BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING) max_cnt
FROM (
SELECT suburb, likes, count(*) as cnt
FROM kids
GROUP BY suburb, likes
)
)
WHERE max_cnt = cnt
ORDER BY suburb, likes;
RETURN
( 1)
Cost
I/O

0.56
FILTER
( 2)
0.00696143
0

14
GRPBY
( 3)
0.00493112
0

25
IXSCAN
( 4)
0.00457293
0

25
INDEX: LTJN
X1
Q1
Lennart,
Since we don't care for the order of the counts anymore this should not inject another sort.
The only way I could prevent the sort was via another nesting level:
SELECT suburb, likes
FROM (
SELECT suburb, likes, cnt, max_cnt
, ROW_NUMBER() over (PARTITION by suburb) as rn
FROM (
SELECT suburb, likes, cnt
, MAX(cnt) over (PARTITION BY suburb
RANGE BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING) max_cnt
FROM (
SELECT suburb, likes, COUNT(*) as cnt
FROM kids
GROUP BY suburb, likes
)
)
WHERE cnt = max_cnt
)
WHERE rn = 1
;
[/code]
RETURN
( 1)
Cost
I/O

0.56
FILTER
( 2)
0.0671975
0

0.56
FILTER
( 3)
0.0647155
0

14
GRPBY
( 4)
0.0458627
0

25
IXSCAN
( 5)
0.0424629
0

25
INDEX: LELLE
X1
Q1
[/code]
Hi ,Rielau
let me explain it with a simple example:
There are two tables :
T1 ( c1 int primay key , c2 int )
T2 ( c1 int not null, c2 int)
query 1:
select sum(b.s)
from t1 a,
( select c1,sum(c2) s
from t2
group by c1 ) b
where a.c1 = b.c1;
select sum(b.c2)
from t1 a, t2 b
where a.c1 = b.c1;
but optimizer does not seem to think so ?
when i run query 1, db2 did not covert it to query 2
and still chose sum twice which is a more expensive
way to finish the query.....
mdkii,
However the space of query rewrite rules than can be applied is quite huge.
This particular rule is rather far down the priority list at this point I'm afraid.
Serge
Other examples to choose one for every suburb from Lennart's first answer might be
SELECT suburb, likes
FROM (SELECT suburb, likes, cnt
, MAX(cnt) over (PARTITION BY suburb) max_cnt
FROM (SELECT suburb, likes
, COUNT(*) + RAND() as cnt
FROM kids
GROUP BY suburb, likes
)
)
WHERE cnt = max_cnt
ORDER BY suburb
;
(2)
SELECT suburb, likes
FROM (SELECT suburb, likes, cnt
, MAX(cnt) over (PARTITION BY suburb) max_cnt
FROM (SELECT suburb, likes
, COUNT(*) * 1000 + INT( RAND() * 1000 ) as cnt
FROM kids
GROUP BY suburb, likes
)
)
WHERE cnt = max_cnt
ORDER BY suburb
;
Lennart second: two OLAP specifications and two filters(WHERE clause).
Tonkuma(1): a SYSFUN schema function(i.e. RAND) was used. CNT would be converted to doubleprecision floating point.
Tonkuma(2): a SYSFUN schema function(i.e. RAND) was used.
Though performance may be worse,
just for fun...
, stats_mode( LISTAGG(likes , ',') ) AS likes
FROM kids
GROUP BY suburb
ORDER BY suburb
;
( in_string VARCHAR(4000) )
RETURNS VARCHAR(100)
READS SQL DATA
NO EXTERNAL ACTION
DETERMINISTIC
RETURN WITH
find_delemeters
( k , p/* [position of delemeter] + 1 */ ) AS (
VALUES ( 0 , 1 )
UNION ALL
SELECT k + 1
, LOCATE(',' , in_string , p) + 1
FROM find_delemeters
WHERE k < 4000
AND ( p > 1 OR k = 0 )
)
SELECT element
FROM (SELECT SUBSTR(
in_string
, p
, LEAD( p  1 , 1 , LENGTH(in_string) + 1 )
OVER(ORDER BY k)
 p
) AS element
FROM find_delemeters
WHERE p > 1 OR k = 0
)
GROUP BY element
ORDER BY COUNT(*) DESC
FETCH FIRST 1 ROW ONLY
;
Tonkuma,
Serge