Fetch the rows which have the Max value for a column for each distinct value of another column

T

35

674

Table:

UserId, Value, Date.

I want to get the UserId, Value for the max(Date) for each UserId. That is, the Value for each UserId that has the latest date.

How do I do this in SQL? (Preferably Oracle.)

I need to get ALL the UserIds. But for each UserId, only that row where that user has the latest date.

Tarantella answered 23/9, 2008 at 14:34 Comment(9)

What if there are multiple rows having the maximum date value for a particular userid? – Wore 23/9, 2008 at 18:29

What are the key fields of the table? – Hughie 20/6, 2013 at 9:53

some solutions below compared: sqlfiddle.com/#!4/6d4e81/1 – Scuppernong 7/8, 2014 at 7:27

@DavidAldridge, That column is likely unique. – Lopes 3/2, 2015 at 3:38

#2854757 – Callus 11/10, 2015 at 10:29

Possible duplicate of How can I SELECT rows with MAX(Column value), DISTINCT by another column in SQL? – Neuromuscular 12/6, 2016 at 22:19

Postgres users probably want to look at #3801051 – Newsom 3/5, 2017 at 6:42

I am surprised that all the solutions indicated here are too verbose and there is no easier and more direct way to solve such a common issue. – Airdrop 30/12, 2020 at 11:37

select userid, my_date, ... from ( select userid, my_date, ... max(my_date) over (partition by userid) max_my_date from users ) where my_date = max_my_date – Subteen 6/8, 2021 at 16:29

W

461

This will retrieve all rows for which the my_date column value is equal to the maximum value of my_date for that userid. This may retrieve multiple rows for the userid where the maximum date is on multiple rows.

select userid,
       my_date,
       ...
from
(
select userid,
       my_date,
       ...
       max(my_date) over (partition by userid) max_my_date
from   users
)
where my_date = max_my_date

"Analytic functions rock"

Edit: With regard to the first comment ...

"using analytic queries and a self-join defeats the purpose of analytic queries"

There is no self-join in this code. There is instead a predicate placed on the result of the inline view that contains the analytic function -- a very different matter, and completely standard practice.

"The default window in Oracle is from the first row in the partition to the current one"

The windowing clause is only applicable in the presence of the order by clause. With no order by clause, no windowing clause is applied by default and none can be explicitly specified.

The code works.

Wore answered 23/9, 2008 at 14:34 Comment(15)

When applied to a table having 8.8 million rows, this query took half the time of the queries in some the other highly voted answers. – Boardman 15/4, 2011 at 23:59

What indexes should I use to make this query (specifically this query) go faster? I'm a bit over my head and using Oracle right now. In a table with 5.5 million rows, this call isn't returning over 30 seconds, and I was hoping for ~100ms or less on this call. – Apollinaire 18/3, 2012 at 0:38

I think you'd have to use an combined index over userid and my_date, so the database can completely use the index to get you the results fast and only read the relevant rows – Holmgren 6/5, 2014 at 14:14

Anyone care to post a link to the MySQL equivalent of this, if there is one? – Malaria 10/1, 2015 at 2:35

Couldn't this return duplicates? Eg. if two rows have the same user_id and the same date (which happens to be the max). – Subsidize 15/6, 2016 at 19:30

@Subsidize I think that was acknowledged in the question – Wore 17/6, 2016 at 15:47

@DavidAldridge Are you referring to "That column is likely unique"? – Subsidize 20/6, 2016 at 17:21

Instead of MAX(...) OVER (...) you can also use ROW_NUMBER() OVER (...) (for the top-n-per-group) or RANK() OVER (...) (for the greatest-n-per-group). – Hydrophobic 27/6, 2016 at 8:13

Is there a way to run the inner query without having to display the max(value) ? I am in the case where I don't have a where clause (I want all matching rows and no duplicates can be), but I would prefer to not display the max value. – Winonawinonah 14/2, 2018 at 13:56

@Hydrophobic : isn't "top-n-per-group" or "greatest-n-per-group" the same as "max per group"?? And could you please provide a short example?? Thanks in advance. – Gregory 26/10, 2020 at 13:35

@Gregory Filtering on ROW_NUMBER() OVER ( ... ) then for each partition you get exactly n rows. Filtering on RANK() OVER ( ... ) then you get rows for the n top values and may be more than n rows if there are ties. Using DENSE_RANK gives the n top unique values which will be more than n rows if there are ties. Filtering on MAX( ... ) OVER (...) is the same as filtering on RANK (or DENSE_RANK) and restricting to the first rank only and it may or may not be the same as ROW_NUMBER() OVER (...) depending on whether the rows in the ORDER BY clause are unique or not. – Hydrophobic 26/10, 2020 at 13:44

@Gregory An example of the differences is db<>fiddle. – Hydrophobic 26/10, 2020 at 14:14

The question mentions "... for each UserId". This will only return userId/value pairs for the users that have an entry where my_date = max_my_date, but not the rest. Right? – Unbreathed 30/10, 2020 at 15:33

@MT0: Wow. Very enlightening! Thanks! It'll take me some time to wrap my head around it. – Gregory 8/2, 2021 at 16:6

@Hydrophobic Thanks for this example, I changed to MySQL: dbfiddle.uk/Ig2x5eN0. – Dressel 21/9, 2022 at 7:14

R

523

I see many people use subqueries or else window functions to do this, but I often do this kind of query without subqueries in the following way. It uses plain, standard SQL so it should work in any brand of RDBMS.

SELECT t1.*
FROM mytable t1
  LEFT OUTER JOIN mytable t2
    ON (t1.UserId = t2.UserId AND t1."Date" < t2."Date")
WHERE t2.UserId IS NULL;

In other words: fetch the row from t1 where no other row exists with the same UserId and a greater Date.

(I put the identifier "Date" in delimiters because it's an SQL reserved word.)

In case if t1."Date" = t2."Date", doubling appears. Usually tables has auto_inc(seq) key, e.g. id. To avoid doubling can be used follows:

SELECT t1.*
FROM mytable t1
  LEFT OUTER JOIN mytable t2
    ON t1.UserId = t2.UserId AND ((t1."Date" < t2."Date") 
         OR (t1."Date" = t2."Date" AND t1.id < t2.id))
WHERE t2.UserId IS NULL;

Re comment from @Farhan:

Here's a more detailed explanation:

An outer join attempts to join t1 with t2. By default, all results of t1 are returned, and if there is a match in t2, it is also returned. If there is no match in t2 for a given row of t1, then the query still returns the row of t1, and uses NULL as a placeholder for all of t2's columns. That's just how outer joins work in general.

The trick in this query is to design the join's matching condition such that t2 must match the same userid, and a greater date. The idea being if a row exists in t2 that has a greater date, then the row in t1 it's compared against can't be the greatest date for that userid. But if there is no match -- i.e. if no row exists in t2 with a greater date than the row in t1 -- we know that the row in t1 was the row with the greatest date for the given userid.

In those cases (when there's no match), the columns of t2 will be NULL -- even the columns specified in the join condition. So that's why we use WHERE t2.UserId IS NULL, because we're searching for the cases where no row was found with a greater date for the given userid.

Rochellrochella answered 23/9, 2008 at 20:1 Comment(36)

Wow Bill. This is the most creative solution to this problem I've seen. It is pretty performant too on my fairly large data set. This sure beats many of the other solutions I've seen or my own attempts at solving this quandary. – Scrupulous 13/1, 2011 at 2:7

When applied to a table having 8.8 million rows, this query took almost twice as long as that in the accepted answer. – Boardman 15/4, 2011 at 23:11

@Derek: Optimizations depend on the brand and version of RDBMS, as well as presence of appropriate indexes, data types, etc. – Rochellrochella 19/4, 2011 at 17:30

Bill, I ran my test on an Oracle 10 database server (tag on question assumes Oracle) with index on a column analagous to UserId and a compound index that includes a column analagous to Date. Perhaps the query would take less time with an index that includes only Date. – Boardman 19/4, 2011 at 17:56

On MySQL, this kind of query appears to actually cause it to loop over the result of a Cartesian join between the tables, resulting in O(n^2) time. Using the subquery method instead reduced the query time from 2.0s to 0.003s. YMMV. – Argyres 22/2, 2012 at 6:22

@Jesse: on MySQL, all joins are nested-loop joins. If you have an index on (UserId,Date) in this case, you should be able to achieve an index-only join and speed it up a great deal. – Rochellrochella 28/2, 2012 at 17:36

Is there a way to adapt this to match rows where date is the greatest date less than or equal to a user given date? For example if the user gives the date "23-OCT-2011", and the table includes rows for "24-OCT-2011", "22-OCT-2011", "20-OCT-2011", then I want to get "22-OCT-2011". Been scratching my head and reading this snippet for a while now... – Apollinaire 17/3, 2012 at 8:18

@CoryKendall, add conditions for both t1 and t2 to the join condition: AND t1.Date <= '2011-10-23' AND t2.Date <= '2011-10-23' in addition to the other join conditions I have shown above. – Rochellrochella 17/3, 2012 at 16:59

Replace table AS t1 by table t1 to make it work on all DBMSs, including Oracle (fails with AS). – Doughman 15/1, 2013 at 13:51

@BillKarwin "add conditions for both t1 and t2 to the join condition" -- This doesn't seem to work (incorrect results)! What I did instead was use the subquery modularization:

WITH subq AS (SELECT * FROM mytable WHERE "Date" <= '2011-10-23') SELECT t1.* FROM subq t1 LEFT OUTER JOIN subq t2 ON ( [...]

This works because only filtered data is provided as input to the left outer join. It also has the added advantage of providing the condition only once. – Scipio 16/1, 2014 at 6:46

@ADTC, good solution! I work with MySQL more frequently, and MySQL doesn't support WITH expressions yet. – Rochellrochella 16/1, 2014 at 16:44

That's really sad because the main problem with SQL is the lack of modularization, but the WITH construct somehow eases the pain by providing a basic layer of modularization. It should really be a standard SQL (if it's not already). Btw, your original proposal did not seem to give the correct results in Postgres. Does it give the correct results in MySQL? – Scipio 16/1, 2014 at 17:45

@ADTC, yes, the WITH construct is part of SQL:2003. MySQL development has focused for the last ~5 years focusing on improving performance and scalability by changing code deep in the storage engines, but they have done less work adding SQL features. – Rochellrochella 16/1, 2014 at 17:50

@DavidMann, it's frequently called an exclusion join. – Rochellrochella 25/6, 2014 at 17:48

@BillKarwin Ah sure, the outer join is an exclusion join. I guess I meant to ask if there was a name for the approach of using an exclusion join with some condition that lets one solve a 'greatest-n-per-gorup' problem – Lina 25/6, 2014 at 18:57

@DavidMann, oh, I don't know if this has a particular pattern name. – Rochellrochella 25/6, 2014 at 18:57

I'm sorry but why doesn't this return NULL for cases where t1.date > t2.date? – Ostyak 21/2, 2015 at 18:2

@dani-h, if t1.date > t2.date, and there are only two rows, then yes of course t2.* would return NULL. But t2 could be any row with the same userid. If t2 matches even one row with a greater date, then t2.* will return non-NULL. Only if t1 has a greater date than all rows matched by t2, does t2.* return NULL. Does that help? – Rochellrochella 21/2, 2015 at 18:35

@BillKarwin Thanks for attempting to explain this, but I think you've confused me even more :]. A left join is a similar to a cartesian join, yes? Meaning that all rows in t1 are mixed with all rows in t2, where the id matches. If t2.date > t1.date it returns the row in t1 joined by the row in t2. If t1.date > t2.date then there is no match on the right hand side, shouldn't it return NULL for these values as well? – Ostyak 21/2, 2015 at 21:14

@dani-h, Suppose you have three rows: January 1, February 1, and March1. Suppose t1 points to February 1. You join t1 to the set of rows with a greater date, and call it t2. The first row (January 1) is not greater, so it is not in that set. Does the join therefore return NULL? No -- because the third row (March 1) is greater than t1 and is in the set of t2. Therefore t1 referencing February 1 is not the row with the greatest date. Only when t1 references March 1, and no row is found that is greater, does t2 return NULLs, and t1 is the greatest. – Rochellrochella 22/2, 2015 at 9:29

@BillKarwin. I am newbie to SQL. Trying to understand the solution. I was wondering why do we need a WHERE clause. Can't we put the where condition directly in the on clause. i.e ON (t1.UserId = t2.UserId AND t1."Date" < t2."Date" AND t2.UserId IS NULL). can you please explain? – Hitchcock 7/9, 2015 at 16:6

@frank, because t2.UserId is not null until after the outer join has been evaluated. Please study about outer joins. – Rochellrochella 7/9, 2015 at 18:1

This performs terribly on some RDBMSs, but I upvoted it anyway because it's a fresh and awesome way to think about the problem! – Scilla 6/6, 2016 at 1:53

@JonKloske since answering this question in 2008, I have found the performance has a lot to do with the data. I.e. how many rows per distinct UserId. Anyway, it's almost always a better solution than correlated subqueries. – Rochellrochella 6/6, 2016 at 3:37

yep, very much depends on how easy it is to join with an index, too. If for example you have datetime log data and you're grouping by date(datetime), in mysql at least its not indexable so it's O(n^2), which is worse than some subquery approaches, but as they're all terrible for large rowcounts anyway it doesn't matter much practically. And obviously that's not oracle, though I haven't tested that, maybe that case is bad there too. – Scilla 7/6, 2016 at 21:53

(I found a very quick O(n) solution for that case in mysql that I haven't seen anywhere on SO for those type of questions that also works generally for any type of 'select max or min row' query that also makes it easy to pluck out both in the same row at no extra cost, but er, to paraphrase Fermat, the details are too big to fit in this margin!!!) – Scilla 7/6, 2016 at 21:57

"t uses plain, standard SQL" - window functions are standard SQL and are not "vendor specific". They have been part of the SQL standard since 2003 – Suisse 30/8, 2016 at 8:43

@a_horse_with_no_name - perhaps the sentence should say widely supported standard SQL, since MySQL did not support window functions until 8.0.2 in 2018. (And sadly, some of us are stuck on legacy implementations that haven't upgraded to 8...) – Median 8/7, 2020 at 0:42

I edited the answer to say "window functions" instead of "vendor specific features". – Rochellrochella 8/7, 2020 at 1:15

Yes.@BillKarwin It works as expected. But How I query it in randomly?. – Basophil 11/8, 2020 at 8:31

@LeangSocheat That sounds like a new question. – Rochellrochella 11/8, 2020 at 13:44

This performs much faster than the accepted answer, provided indexes can be used. – Endogen 5/5, 2021 at 14:45

the query is cool but how do you avoid doubling if the field Date is not unique and you don't have a secondary ID field in the table? – Splash 2/3, 2023 at 16:58

@AndreaMauro Use a window function solution. – Rochellrochella 2/3, 2023 at 17:9

Another explanation: All rows of t1 must be considered. For every id in t1, there is an id in t2 with a greater date, except one. – Crock 15/5, 2023 at 19:45

These mysql docs explain this approach, and also others: dev.mysql.com/doc/mysql-tutorial-excerpt/8.0/en/… – Shiff 18/10, 2023 at 12:25

W

461

This will retrieve all rows for which the my_date column value is equal to the maximum value of my_date for that userid. This may retrieve multiple rows for the userid where the maximum date is on multiple rows.

select userid,
       my_date,
       ...
from
(
select userid,
       my_date,
       ...
       max(my_date) over (partition by userid) max_my_date
from   users
)
where my_date = max_my_date