Get top n records for each group of grouped results [duplicate]
Asked Answered
P

12

192

The following is the simplest possible example, though any solution should be able to scale to however many n top results are needed:

Given a table like that below, with person, group, and age columns, how would you get the 2 oldest people in each group? (Ties within groups should not yield more results, but give the first 2 in alphabetical order)

+--------+-------+-----+
| Person | Group | Age |
+--------+-------+-----+
| Bob    | 1     | 32  |
| Jill   | 1     | 34  |
| Shawn  | 1     | 42  |
| Jake   | 2     | 29  |
| Paul   | 2     | 36  |
| Laura  | 2     | 39  |
+--------+-------+-----+

Desired result set:

+--------+-------+-----+
| Shawn  | 1     | 42  |
| Jill   | 1     | 34  |
| Laura  | 2     | 39  |
| Paul   | 2     | 36  |
+--------+-------+-----+

NOTE: This question builds on a previous one- Get records with max value for each group of grouped SQL results - for getting a single top row from each group, and which received a great MySQL-specific answer from @Bohemian:

select * 
from (select * from mytable order by `Group`, Age desc, Person) x
group by `Group`

Would love to be able to build off this, though I don't see how.

Pulsar answered 24/8, 2012 at 17:0 Comment(3)
https://mcmap.net/q/28677/-get-n-per-group-mysql and #2130193 might help youDoodlebug
Check this example. It is pretty much close to what you ask: #1538106Cassondra
Using LIMIT within GROUP BY to get N results per group? #2130193Oreopithecus
M
107

Here is one way to do this, using UNION ALL (See SQL Fiddle with Demo). This works with two groups, if you have more than two groups, then you would need to specify the group number and add queries for each group:

(
  select *
  from mytable 
  where `group` = 1
  order by age desc
  LIMIT 2
)
UNION ALL
(
  select *
  from mytable 
  where `group` = 2
  order by age desc
  LIMIT 2
)

There are a variety of ways to do this, see this article to determine the best route for your situation:

http://www.xaprb.com/blog/2006/12/07/how-to-select-the-firstleastmax-row-per-group-in-sql/

Edit:

This might work for you too, it generates a row number for each record. Using an example from the link above this will return only those records with a row number of less than or equal to 2:

select person, `group`, age
from 
(
   select person, `group`, age,
      (@num:=if(@group = `group`, @num +1, if(@group := `group`, 1, 1))) row_number 
  from test t
  CROSS JOIN (select @num:=0, @group:=null) c
  order by `Group`, Age desc, person
) as x 
where x.row_number <= 2;

See Demo

Mendez answered 24/8, 2012 at 17:25 Comment(12)
if he have 1 000+ groups, wouldn't that make this a bit scary?Doodlebug
@CharlesForest yes, it would and that Is why I stated that you would have to specify it for more than two groups. It would become ugly.Mendez
isn't there some kind of FOR in sql, he could just send the amount of loops to add in the query (mostly working with php on these situations, he asked in SQL so i'm not sure)Doodlebug
@CharlesForest I think I found a better solution, see my editMendez
@Pulsar there are many ways to do this you will have to decide based on your needs and your situation.Mendez
This wins because it handles tied results according to the requirements. See demo. Also, according to the article this will be a more performant option. Thanks-Pulsar
@Yarin: won't this not order by names when ties occur?Carilla
@Ludo if the requirement is to order by name in the event of tie, then it just need to be added to the order by clause.Mendez
@bluefeet: yes for sure, I was just curious to know why the OP concluded that the so-called requirements were respected before you made the edit.Carilla
A note for anyone reading this: The version is the variables is close to being correct. However, MySQL does not guarantee the order of evaluation of expressions in the SELECT (and, in fact, sometimes evaluates them out-of-order). The key to the solution is to put all the variable assignments in a single expression; here is an example: #38535520.Indoors
@GordonLinoff Updated my answer, thanks for pointing it out. It also took far too long for me to update it.Mendez
Thanks for the solution. Its just perfect for so many use cases. cheersNelda
S
84

In other databases you can do this using ROW_NUMBER. MySQL doesn't support ROW_NUMBER but you can use variables to emulate it:

SELECT
    person,
    groupname,
    age
FROM
(
    SELECT
        person,
        groupname,
        age,
        @rn := IF(@prev = groupname, @rn + 1, 1) AS rn,
        @prev := groupname
    FROM mytable
    JOIN (SELECT @prev := NULL, @rn := 0) AS vars
    ORDER BY groupname, age DESC, person
) AS T1
WHERE rn <= 2

See it working online: sqlfiddle


Edit I just noticed that bluefeet posted a very similar answer: +1 to him. However this answer has two small advantages:

  1. It it is a single query. The variables are initialized inside the SELECT statement.
  2. It handles ties as described in the question (alphabetical order by name).

So I'll leave it here in case it can help someone.

Salpingitis answered 24/8, 2012 at 22:55 Comment(9)
Mark- This is working well for us. Thanks for providing another good alternative to compliment @bluefeet's- much appreciated.Pulsar
+1. This worked for me. Really clean and to the point answer. Can you please explain how exactly this works? Whats the logic behind this?Mithridate
Nice solution but it seems it is not working in my environment (MySQL 5.6) because the order by clause is applied after select so it doesn't return the top result, see my alternative solution to fix this issueSeumas
While running this I was able to delete JOIN (SELECT @prev := NULL, @rn := 0) AS vars. I get the idea is to declare empty variables, but it seems extraneous for MySql.Infernal
This works great for me in MySQL 5.7, but it would be awesome if someone could explain how it worksSidney
MySQL 8 and MariaDB 10.2 have ROW_NUMBER() and related Windowing functions.Gilbart
@George B Sort first, then, according to the grouped fields, add one to the line number for the same word, and start again if you see a different oneTrogon
I see a warning Warning: #1287 Setting user variables within expressions is deprecated and will be removed in a future release. Consider alternatives: 'SET variable=expression, ...', or 'SELECT expression(s) INTO variables(s)'. I understand it is something related to @rn and @prev, how to stop this? Adding SET @rn = 0, @prev =''; at first line didn't helped. MySQL v8Utley
This should be the accepted answerTormoria
R
57

Try this:

SELECT a.person, a.group, a.age FROM person AS a WHERE 
(SELECT COUNT(*) FROM person AS b 
WHERE b.group = a.group AND b.age >= a.age) <= 2 
ORDER BY a.group ASC, a.age DESC

DEMO

Roark answered 24/8, 2012 at 17:40 Comment(6)
snuffin coming out of nowhere with the simplest solution! Is this more elegant than Ludo's/Bill Karwin's? Can I get some commentaryPulsar
Hm, not sure if it is more elegant. But judging from the votes, I guess bluefeet might have the better solution.Roark
There's a problem with this. If there's a tie for second place within the group, only one top result is returned. See modified demoPulsar
It's not a problem if it's desired. You can set the order of a.person.Gib
no, it is not working in my case, neither does the DEMO workTillage
This algorithm dies not scale well -- it is O(n*n).Gilbart
C
35

How about using self-joining:

CREATE TABLE mytable (person, groupname, age);
INSERT INTO mytable VALUES('Bob',1,32);
INSERT INTO mytable VALUES('Jill',1,34);
INSERT INTO mytable VALUES('Shawn',1,42);
INSERT INTO mytable VALUES('Jake',2,29);
INSERT INTO mytable VALUES('Paul',2,36);
INSERT INTO mytable VALUES('Laura',2,39);

SELECT a.* FROM mytable AS a
  LEFT JOIN mytable AS a2 
    ON a.groupname = a2.groupname AND a.age <= a2.age
GROUP BY a.person
HAVING COUNT(*) <= 2
ORDER BY a.groupname, a.age DESC;

gives me:

a.person    a.groupname  a.age     
----------  -----------  ----------
Shawn       1            42        
Jill        1            34        
Laura       2            39        
Paul        2            36      

I was strongly inspired by the answer from Bill Karwin to Select top 10 records for each category

Also, I'm using SQLite, but this should work on MySQL.

Another thing: in the above, I replaced the group column with a groupname column for convenience.

Edit:

Following-up on the OP's comment regarding missing tie results, I incremented on snuffin's answer to show all the ties. This means that if the last ones are ties, more than 2 rows can be returned, as shown below:

.headers on
.mode column

CREATE TABLE foo (person, groupname, age);
INSERT INTO foo VALUES('Paul',2,36);
INSERT INTO foo VALUES('Laura',2,39);
INSERT INTO foo VALUES('Joe',2,36);
INSERT INTO foo VALUES('Bob',1,32);
INSERT INTO foo VALUES('Jill',1,34);
INSERT INTO foo VALUES('Shawn',1,42);
INSERT INTO foo VALUES('Jake',2,29);
INSERT INTO foo VALUES('James',2,15);
INSERT INTO foo VALUES('Fred',1,12);
INSERT INTO foo VALUES('Chuck',3,112);


SELECT a.person, a.groupname, a.age 
FROM foo AS a 
WHERE a.age >= (SELECT MIN(b.age)
                FROM foo AS b 
                WHERE (SELECT COUNT(*)
                       FROM foo AS c
                       WHERE c.groupname = b.groupname AND c.age >= b.age) <= 2
                GROUP BY b.groupname)
ORDER BY a.groupname ASC, a.age DESC;

gives me:

person      groupname   age       
----------  ----------  ----------
Shawn       1           42        
Jill        1           34        
Laura       2           39        
Paul        2           36        
Joe         2           36        
Chuck       3           112      
Carilla answered 24/8, 2012 at 17:37 Comment(6)
@Ludo- Just saw that answer from Bill Karwin - thanks for applying it herePulsar
What do you think of Snuffin's answer? I'm trying to compare the twoPulsar
There's a problem with this. If there's a tie for second place within the group, only one top result is returned- See demoPulsar
@Ludo- the original requirement was that each group return the exact n results, with any ties being resolved alphabeticallyPulsar
The edit to include the ties does not work for me. I get ERROR 1242 (21000): Subquery returns more than 1 row, presumably because of the GROUP BY. When I execute the SELECT MIN subquery alone, it generates three rows: 34, 39, 112 and there it appears the second value should be 36, not 39.Peshawar
similar to the previous answer, nice but not working in my case: missing records or second order item is wrongly ordered.Tillage
S
19

Snuffin solution seems quite slow to execute when you've got plenty of rows and Mark Byers/Rick James and Bluefeet solutions doesn't work on my environnement (MySQL 5.6) because order by is applied after execution of select, so here is a variant of Marc Byers/Rick James solutions to fix this issue (with an extra imbricated select):

select person, groupname, age
from
(
    select person, groupname, age,
    (@rn:=if(@prev = groupname, @rn +1, 1)) as rownumb,
    @prev:= groupname 
    from 
    (
        select person, groupname, age
        from persons 
        order by groupname ,  age desc, person
    )   as sortedlist
    JOIN (select @prev:=NULL, @rn :=0) as vars
) as groupedlist 
where rownumb<=2
order by groupname ,  age desc, person;

I tried similar query on a table having 5 millions rows and it returns result in less than 3 seconds

Seumas answered 7/12, 2017 at 13:57 Comment(5)
This is the only query that has been working in my environment. Thanks!Schoof
Add LIMIT 9999999 to any derived table with an ORDER BY. This may prevent the ORDER BY from being ignored.Gilbart
I ran a a similar query on a table containing a few thousand rows, and it took 60 seconds to return one result, so... thanks for the post, it's a start for me. (ETA: down to 5 seconds. Good!)Auxesis
This is the query that works perfect with the Order. The answers below dont work well. ThanksInterview
Wow, really nice, I was able to improve the performance significantly using this approach! Btw, had to escape all variable assignments using double backslash \\:= in spring jpa repository code for nativeQuery.Levine
G
12

If the other answers are not fast enough Give this code a try:

SELECT
        province, n, city, population
    FROM
      ( SELECT  @prev := '', @n := 0 ) init
    JOIN
      ( SELECT  @n := if(province != @prev, 1, @n + 1) AS n,
                @prev := province,
                province, city, population
            FROM  Canada
            ORDER BY
                province   ASC,
                population DESC
      ) x
    WHERE  n <= 3
    ORDER BY  province, n;

Output:

+---------------------------+------+------------------+------------+
| province                  | n    | city             | population |
+---------------------------+------+------------------+------------+
| Alberta                   |    1 | Calgary          |     968475 |
| Alberta                   |    2 | Edmonton         |     822319 |
| Alberta                   |    3 | Red Deer         |      73595 |
| British Columbia          |    1 | Vancouver        |    1837970 |
| British Columbia          |    2 | Victoria         |     289625 |
| British Columbia          |    3 | Abbotsford       |     151685 |
| Manitoba                  |    1 | ...
Gilbart answered 8/4, 2016 at 3:20 Comment(4)
Looked at your site - where would I get the data source for the cities' populations? TIA and rgs.Theurgy
maxmind.com/en/worldcities -- I find it handy for experimenting with lat/lng searches, queries, partitioning, etc. It is big enough to be interesting, yet readable enough to recognize the answers. The Canadian subset is handy for this kind of question. (Fewer provinces than US cities.)Gilbart
@RickJames thank you, this worked first time, I have spent 3 hours trying to do this and failed miserably.Tiros
@Tiros - It's tricky code. I have seen a lot of postings that fail to perform well. So I set out to find (or create) the best solutions.Gilbart
S
10

Check this out:

SELECT
  p.Person,
  p.`Group`,
  p.Age
FROM
  people p
  INNER JOIN
  (
    SELECT MAX(Age) AS Age, `Group` FROM people GROUP BY `Group`
    UNION
    SELECT MAX(p3.Age) AS Age, p3.`Group` FROM people p3 INNER JOIN (SELECT MAX(Age) AS Age, `Group` FROM people GROUP BY `Group`) p4 ON p3.Age < p4.Age AND p3.`Group` = p4.`Group` GROUP BY `Group`
  ) p2 ON p.Age = p2.Age AND p.`Group` = p2.`Group`
ORDER BY
  `Group`,
  Age DESC,
  Person;

SQL Fiddle: http://sqlfiddle.com/#!2/cdbb6/15

Stallard answered 24/8, 2012 at 17:48 Comment(4)
Man, others found much simpler solutions...I just spent like 15 minutes on this and was incredibly proud of myself for coming up with such a complicated solution, too. That sucks.Stallard
I had to find an internal version number that was 1 less than the current - this gave me the answer to do this: max(internal_version - 1) - so stress less :)Enteron
This version can return three items in a group assuming the second 2 items have the same age value. How can I resolve such??Zinciferous
@NINSIIMAWILBER You'll probably need to look at other possible solutions using ROW_NUMBER or something similar. Looks like the original answer from Bill Karwin has had a more recent (5 years ago) update that may work for you.Stallard
C
9
WITH cte_window AS (
SELECT movie_name,director_id,release_date,
ROW_NUMBER() OVER( PARTITION BY director_id ORDER BY release_date DESC) r
FROM movies
)   
SELECT * FROM cte_window WHERE r <= <n>;

Above query will returns latest n movies for each directors.

Cabbageworm answered 14/10, 2021 at 9:14 Comment(1)
Works better than the accepted answer, thank you Hiren!Gram
S
2

I wanted to share this because I spent a long time searching for an easy way to implement this in a java program I'm working on. This doesn't quite give the output you're looking for but its close. The function in mysql called GROUP_CONCAT() worked really well for specifying how many results to return in each group. Using LIMIT or any of the other fancy ways of trying to do this with COUNT didn't work for me. So if you're willing to accept a modified output, its a great solution. Lets say I have a table called 'student' with student ids, their gender, and gpa. Lets say I want to top 5 gpas for each gender. Then I can write the query like this

SELECT sex, SUBSTRING_INDEX(GROUP_CONCAT(cast(gpa AS char ) ORDER BY gpa desc), ',',5) 
AS subcategories FROM student GROUP BY sex;

Note that the parameter '5' tells it how many entries to concatenate into each row

And the output would look something like

+--------+----------------+
| Male   | 4,4,4,4,3.9    |
| Female | 4,4,3.9,3.9,3.8|
+--------+----------------+

You can also change the ORDER BY variable and order them a different way. So if I had the student's age I could replace the 'gpa desc' with 'age desc' and it will work! You can also add variables to the group by statement to get more columns in the output. So this is just a way I found that is pretty flexible and works good if you are ok with just listing results.

Singhalese answered 10/3, 2018 at 3:27 Comment(0)
M
1

In SQL Server row_number() is a powerful function that can get result easily as below

select Person,[group],age
from
(
select * ,row_number() over(partition by [group] order by age desc) rn
from mytable
) t
where rn <= 2
Monjo answered 5/12, 2016 at 14:25 Comment(3)
With 8.0 and 10.2 being GA, this answer is becoming reasonable.Gilbart
@RickJames what does 'being GA' means? Window functions (dev.mysql.com/doc/refman/8.0/en/window-functions.html) solved my problem very well.Vitriform
@Vitriform - "GA" means "Generally Available". It is tech-speak for "ready for prime time", or "released". They are through developing the version and will be focusing on bug that they missed. That link discusses MySQL 8.0's implementation, which might be different than MariaDB 10.2's implementation.Gilbart
R
1

There is a really nice answer to this problem at MySQL - How To Get Top N Rows per Each Group

Based on the solution in the referenced link, your query would be like:

SELECT Person, Group, Age
   FROM
     (SELECT Person, Group, Age, 
                  @group_rank := IF(@group = Group, @group_rank + 1, 1) AS group_rank,
                  @current_group := Group 
       FROM `your_table`
       ORDER BY Group, Age DESC
     ) ranked
   WHERE group_rank <= `n`
   ORDER BY Group, Age DESC;

where n is the top n and your_table is the name of your table.

I think the explanation in the reference is really clear. For quick reference I will copy and paste it here:

Currently MySQL does not support ROW_NUMBER() function that can assign a sequence number within a group, but as a workaround we can use MySQL session variables.

These variables do not require declaration, and can be used in a query to do calculations and to store intermediate results.

@current_country := country This code is executed for each row and stores the value of country column to @current_country variable.

@country_rank := IF(@current_country = country, @country_rank + 1, 1) In this code, if @current_country is the same we increment rank, otherwise set it to 1. For the first row @current_country is NULL, so rank is also set to 1.

For correct ranking, we need to have ORDER BY country, population DESC

Rebbeccarebe answered 28/3, 2018 at 9:4 Comment(3)
Well, it is the principle used by solutions of Marc Byers, Rick James and mine.Seumas
Difficult to say which post (Stack Overflow or SQLlines) was the firstSeumas
@LaurentPELE - Mine was posted Feb, 2015. I see no timestamp or name on SQLlines. MySQL blogs have been around for long enough that some of them are out of date, and should be removed -- people are quoting mis-information.Gilbart
P
0
SELECT
p1.Person,
p1.`GROUP`,
p1.Age  
   FROM
person AS p1 
 WHERE
(
SELECT
    COUNT( DISTINCT ( p2.age ) ) 
FROM
    person AS p2 
WHERE
    p2.`GROUP` = p1.`GROUP` 
    AND p2.Age >= p1.Age 
) < 2 
ORDER BY
p1.`GROUP` ASC,
p1.age DESC

reference leetcode

Pearson answered 29/9, 2020 at 7:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.