How to find the SQL medians for a grouping
Asked Answered
A

3

12

I am working with SQL Server 2008

If I have a Table as such:

Code   Value
-----------------------
4      240
4      299
4      210
2      NULL
2      3
6      30
6      80
6      10
4      240
2      30

How can I find the median AND group by the Code column please? To get a resultset like this:

Code   Median
-----------------------
4      240
2      16.5
6      30

I really like this solution for median, but unfortunately it doesn't include Group By: https://mcmap.net/q/107898/-function-to-calculate-median-in-sql-server

Arman answered 13/12, 2013 at 12:26 Comment(4)
Possible duplicate of #1343398Uprear
I don't think that link actually deals with grouping by a second column.Arman
In that link the OP states "What would be the best way (if possible) to do this - allow for the calculation of a median value (assuming a numeric data type) in an aggregate query?" It usually means he ill using a group by (AGGREGATE).Uprear
@StuHarper: "it doesn't include Group By"...No it does!!Command
W
14

The solution using rank works nicely when you have an odd number of members in each group, i.e. the median exists within the sample, where you have an even number of members the rank method will fall down, e.g.

1
2
3
4

The median here is 2.5 (i.e. half the group is smaller, and half the group is larger) but the rank method will return 3. To get around this you essentially need to take the top value from the bottom half of the group, and the bottom value of the top half of the group, and take an average of the two values.

WITH CTE AS
(   SELECT  Code,
            Value, 
            [half1] = NTILE(2) OVER(PARTITION BY Code ORDER BY Value), 
            [half2] = NTILE(2) OVER(PARTITION BY Code ORDER BY Value DESC)
    FROM    T
    WHERE   Value IS NOT NULL
)
SELECT  Code,
        (MAX(CASE WHEN Half1 = 1 THEN Value END) + 
        MIN(CASE WHEN Half2 = 1 THEN Value END)) / 2.0
FROM    CTE
GROUP BY Code;

Example on SQL Fiddle


In SQL Server 2012 you can use PERCENTILE_CONT

SELECT  DISTINCT
        Code,
        Median = PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY Value) OVER(PARTITION BY Code)
FROM    T;

Example on SQL Fiddle

Whereon answered 13/12, 2013 at 12:53 Comment(1)
PERCENTILE_CONT is pretty slow, so you can either use the first solution or something like this for SQL server: https://mcmap.net/q/107898/-function-to-calculate-median-in-sql-serverWhitehall
W
2

SQL Server does not have a function to calculate medians, but you could use the ROW_NUMBER function like this:

WITH RankedTable AS (
    SELECT Code, Value, 
        ROW_NUMBER() OVER (PARTITION BY Code ORDER BY VALUE) AS Rnk,
        COUNT(*) OVER (PARTITION BY Code) AS Cnt
    FROM MyTable
)
SELECT Code, Value
FROM RankedTable
WHERE Rnk = Cnt / 2 + 1

To elaborate a bit on this solution, consider the output of the RankedTable CTE:

Code   Value   Rnk    Cnt
---------------------------
4      240     2      3   -- Median
4      299     3      3
4      210     1      3
2      NULL    1      2
2      3       2      2   -- Median
6      30      2      3   -- Median
6      80      3      3
6      10      1      3

Now from this result set, if you only return those rows where Rnk equals Cnt / 2 + 1 (integer division), you get only the rows with the median value for each group.

Wilcher answered 13/12, 2013 at 12:30 Comment(6)
Thanks Dan. This nearly works, but not quite. I added another row (4, 240). This results in your query missing out Code 4 completely from the resultset.Arman
Ah yeah, you'll get in trouble when you have multiple identical values, as these are assigned the same RANK. I'm editing my solution to use the ROW_NUMBER function instead, as this will assign a unique value to each row.Wilcher
Again, thanks Dan but it still doesn't quite work. I added another row (2,30) and your query then results in 2,30 rather than 2,16.5. GarethD's solution below handles that OK.Arman
How can 16,5 be the median, if the number doesn't even exist in the original dataset?Wilcher
@Wilcher The median doesn't have to be in the original dataset. According to Wikipedia: "If there is an even number of observations, then there is no single middle value; the median is then usually defined to be the mean of the two middle values"Whereon
Thanks @GarethD. I should have paid more attention in statistics class :-)Wilcher
M
-2

SELECT Code, Value FROM (SELECT Code, Value, ROW_NUMBER() OVER(PARTITION BY Code, ORDER BY Value ASC) as r1, ROW_NUMBER() OVER(PARTITION BY Code, ORDER BY Value DESC) as r2 FROM TABLE)temp WHERE (r1=r2) or (abs(r1-r2)=1);

Matronna answered 3/7, 2023 at 11:13 Comment(1)
Welcome to Stack Overflow! This question is almost 10 years old, and already has an accepted answer with upvotes. Posting a code-only solution is typically not helpful. You should explain why you feel a different solution is needed and what makes yours better than the other answers. Otherwise, is there really any point in posting it?Kolnos

© 2022 - 2024 — McMap. All rights reserved.