• Edit
• More Actions v
• Quarantine this Entry

1 commented

The tricky part seems to be to eliminate "duplicates", i.e choose either Mona or Stacy for Cabbagetown. row_number() is ideal for that purpose, but introduces a sort no matter what (well in my extensive test the last 5 minutes anyhow:-). If we can ignore the duplicate problem

create index x1 on ltjn.kids (suburb, likes)
allow reverse scans
collect detailed statistics;

select suburb, likes
from (
SELECT suburb, likes, cnt
, max(cnt) over (PARTITION BY suburb
RANGE BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING) max_cnt
FROM (
SELECT suburb, likes, count(*) as cnt
FROM kids
GROUP BY suburb, likes
)
)
WHERE max_cnt = cnt
ORDER BY suburb, likes;

Rows
RETURN
( 1)
Cost
I/O
|
0.56
FILTER
( 2)
0.00696143
0
|
14
GRPBY
( 3)
0.00493112
0
|
25
IXSCAN
( 4)
0.00457293
0
|
25
INDEX: LTJN
X1
Q1

But then we get 6 rows instead of 5 in the result

2 commented

Lennart,

Windowing.. very nice. I didn't even think in that direction.

Now, I think we can re-introduce the rownumber to squeeze out the dupe.
Since we don't care for the order of the counts anymore this should not inject another sort.

3 commented

The only way I could prevent the sort was via another nesting level:

[code]
SELECT suburb, likes
FROM (
SELECT suburb, likes, cnt, max_cnt
, ROW_NUMBER() over (PARTITION by suburb) as rn
FROM (
SELECT suburb, likes, cnt
, MAX(cnt) over (PARTITION BY suburb
RANGE BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING) max_cnt
FROM (
SELECT suburb, likes, COUNT(*) as cnt
FROM kids
GROUP BY suburb, likes
)
)
WHERE cnt = max_cnt
)
WHERE rn = 1
;
[/code]

The plan then contains two filter clauses:

[code]

Rows
RETURN
( 1)
Cost
I/O
|
0.56
FILTER
( 2)
0.0671975
0
|
0.56
FILTER
( 3)
0.0647155
0
|
14
GRPBY
( 4)
0.0458627
0
|
25
IXSCAN
( 5)
0.0424629
0
|
25
INDEX: LELLE
X1
Q1
[/code]

Was that similar to what you had in mind regarding row_number()?

4 commented

Hi ,Rielau

Can i ask another question ? it bothered me for a quite while...
let me explain it with a simple example:
There are two tables :
T1 ( c1 int primay key , c2 int )
T2 ( c1 int not null, c2 int)
query 1:
select sum(b.s)
from t1 a,
( select c1,sum(c2) s
from t2
group by c1 ) b
where a.c1 = b.c1;

query 2:
select sum(b.c2)
from t1 a, t2 b
where a.c1 = b.c1;

I think they are the same query.
but optimizer does not seem to think so ?
when i run query 1, db2 did not covert it to query 2
and still chose sum twice which is a more expensive
way to finish the query.....

thx...

5 commented

mdkii,

I see what you are suggesting.
However the space of query rewrite rules than can be applied is quite huge.
This particular rule is rather far down the priority list at this point I'm afraid.

Of course priorities can change quickly based on customer pressures.

Cheers
Serge

6 commented

Other examples to choose one for every suburb from Lennart's first answer might be

Note: "RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING" is default, then removed.

(1)
SELECT suburb, likes
FROM (SELECT suburb, likes, cnt
, MAX(cnt) over (PARTITION BY suburb) max_cnt
FROM (SELECT suburb, likes
, COUNT(*) + RAND() as cnt
FROM kids
GROUP BY suburb, likes
)
)
WHERE cnt = max_cnt
ORDER BY suburb
;

or
(2)
SELECT suburb, likes
FROM (SELECT suburb, likes, cnt
, MAX(cnt) over (PARTITION BY suburb) max_cnt
FROM (SELECT suburb, likes
, COUNT(*) * 1000 + INT( RAND() * 1000 ) as cnt
FROM kids
GROUP BY suburb, likes
)
)
WHERE cnt = max_cnt
ORDER BY suburb
;

Comparison:
Lennart second: two OLAP specifications and two filters(WHERE clause).
Tonkuma(1): a SYSFUN schema function(i.e. RAND) was used. CNT would be converted to double-precision floating point.
Tonkuma(2): a SYSFUN schema function(i.e. RAND) was used.

7 commented

Though performance may be worse,
just for fun...

SELECT suburb
, stats_mode( LISTAGG(likes , ',') ) AS likes
FROM kids
GROUP BY suburb
ORDER BY suburb
;

Here is the definition of stats_mode

CREATE FUNCTION stats_mode
( in_string VARCHAR(4000) )
RETURNS VARCHAR(100)
NO EXTERNAL ACTION
DETERMINISTIC
RETURN WITH
find_delemeters
( k , p/* [position of delemeter] + 1 */ ) AS (
VALUES ( 0 , 1 )
UNION ALL
SELECT k + 1
, LOCATE(',' , in_string , p) + 1
FROM find_delemeters
WHERE k < 4000
AND ( p > 1 OR k = 0 )
)
SELECT element
FROM (SELECT SUBSTR(
in_string
, p
, LEAD( p - 1 , 1 , LENGTH(in_string) + 1 )
OVER(ORDER BY k)
- p
) AS element
FROM find_delemeters
WHERE p > 1 OR k = 0
)
GROUP BY element
ORDER BY COUNT(*) DESC
FETCH FIRST 1 ROW ONLY
;

8 commented

Tonkuma,

I am looking for co-authors for this blog. Any interest?

Cheers
Serge