SQL ranking query to compute ranks and median in sub groups
Asked Answered
C

2

9

I want to compute the Median of y in sub groups of this simple xy_table:

  x | y --groups--> gid |   x | y --medians-->  gid |   x | y
-------             -------------               -------------
0.1 | 4             0.0 | 0.1 | 4               0.0 | 0.1 | 4
0.2 | 3             0.0 | 0.2 | 3                   |     |
0.7 | 5             1.0 | 0.7 | 5               1.0 | 0.7 | 5
1.5 | 1             2.0 | 1.5 | 1                   |     |
1.9 | 6             2.0 | 1.9 | 6                   |     |
2.1 | 5             2.0 | 2.1 | 5               2.0 | 2.1 | 5
2.7 | 1             3.0 | 2.7 | 1               3.0 | 2.7 | 1

In this example every x is unique and the table is already sorted by x. I now want to GROUP BY round(x) and get the tuple that holds the median of y in each group.

I can already compute the median for the whole table with this ranking query:

SELECT a.x, a.y FROM xy_table a,xy_table b
WHERE a.y >= b.y
GROUP BY a.x, a.y
HAVING count(*) = (SELECT round((count(*)+1)/2) FROM xy_table)

Output: 0.1, 4.0

But I did not yet succeed writing a query to compute the median for sub groups.

Attention: I do not have a median() aggregation function available. Please also do not propose solutions with special PARTITION, RANK, or QUANTILE statements (as found in similar but too vendor specific SO questions). I need plain SQL (i.e., compatible to SQLite without median() function)

Edit: I was actually looking for the Medoid and not the Median.

Carreno answered 11/4, 2013 at 10:35 Comment(4)
what is the relation between 2nd and 3rd table in your example? 1st gid changed to 0.1 and neither x nor y values are medians for the 0.0 groupCharterhouse
Sorry I had a typo there. The 3rd table should present the medians of the gid (group ID) groups. And I assume the median of [4,3] is 4 (the bigger value) in such cases.Carreno
median of [4,3] is usually 3.5 as suggested in the first paragraph of your wikipedia link, do you explicitly want to get the bigger value?Charterhouse
Yes, I want the bigger one. The query should only select existing records and not introduce new ones. Sorry for calling this a "median". :) I am currently playing around with your query, which looks good so far. I just removed the left_row/right_row stuff and the avg, since I do not need that.Carreno
C
4

I suggest doing the computing in your programming language:

for each group:
  for each record_in_group:
    append y to array
  median of array

But if you are stuck with SQLite, you can order each group by y and select the records in the middle like this http://sqlfiddle.com/#!5/d4c68/55/0:

UPDATE: only bigger "median" value is importand for even nr. of rows, so no avg() is needed:

select groups.gid,
  ids.y median
from (
  -- get middle row number in each group (bigger number if even nr. of rows)
  -- note the integer divisions and modulo operator
  select round(x) gid,
    count(*) / 2 + 1 mid_row_right
  from xy_table
  group by round(x)
) groups
join (
  -- for each record get equivalent of
  -- row_number() over(partition by gid order by y)
  select round(a.x) gid,
    a.x,
    a.y,
    count(*) rownr_by_y
  from xy_table a
  left join xy_table b
    on round(a.x) = round (b.x)
    and a.y >= b.y
  group by a.x
) ids on ids.gid = groups.gid
where ids.rownr_by_y = groups.mid_row_right
Charterhouse answered 11/4, 2013 at 12:36 Comment(2)
Thx, your answer provides the basis for my solution. Why not just use SELECT round(x) gid, 1+(count(*))/2 mid_row to get the "middle" row? It works for me at least.Carreno
yep it was leftover from the left-side middle point computation :)Charterhouse
B
0

OK, this relies on a temporary table:

create temporary table tmp (x float, y float);

insert into tmp
  select * from xy_table order by round(x), y

But you could potentially create this for a range of data you were interested in. Another way would be to ensure the xy_table had this sort order, instead of just ordering on x. The reason for this is SQLite's lack of row numbering capability.

Then:

select tmp4.x as gid, t.* from (
  select tmp1.x, 
         round((tmp2.y + coalesce(tmp3.y, tmp2.y)) / 2) as y -- <- for larger of the two, change to: (case when tmp2.y > coalesce(tmp3.y, 0) then tmp2.y else tmp3.y end)
  from (
    select round(x) as x, min(rowid) + (count(*) / 2) as id1, 
           (case when count(*) % 2 = 0 then min(rowid) + (count(*) / 2) - 1 
                 else 0 end) as id2
    from (  
      select *, rowid from tmp
    ) t
    group by round(x)
  ) tmp1
  join tmp tmp2 on tmp1.id1 = tmp2.rowid
  left join tmp tmp3 on tmp1.id2 = tmp3.rowid
) tmp4
join xy_table t on tmp4.x = round(t.x) and tmp4.y = t.y

If you wanted to treat the median as the larger of the two middle values, which doesn't fit the definition as @Aprillion already pointed out, then you would simply take the larger of the two y values, instead of their average, on the third line of the query.

Bulbul answered 11/4, 2013 at 13:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.