Efficient latest record query with Postgresql
Asked Answered
G

6

75

I need to do a big query, but I only want the latest records.

For a single entry I would probably do something like

SELECT * FROM table WHERE id = ? ORDER BY date DESC LIMIT 1;

But I need to pull the latest records for a large (thousands of entries) number of records, but only the latest entry.

Here's what I have. It's not very efficient. I was wondering if there's a better way.

SELECT * FROM table a WHERE ID IN $LIST AND date = (SELECT max(date) FROM table b WHERE b.id = a.id);
Gambeson answered 5/11, 2009 at 22:56 Comment(3)
So, does my SELECT DISTINCT query help you in any way? It should be faster than correlated subselects, but I am not sure by how much.Fronia
Use this one https://mcmap.net/q/79916/-sql-join-selecting-the-last-records-in-a-one-to-many-relationshipAlive
Using MAX produces duplicate results if more than one record for each ID has the same max date. This might not be what you want.Headlock
F
68

If you don't want to change your data model, you can use DISTINCT ON to fetch the newest record from table "b" for each entry in "a":

SELECT DISTINCT ON (a.id) *
FROM a
INNER JOIN b ON a.id=b.id
ORDER BY a.id, b.date DESC

If you want to avoid a "sort" in the query, adding an index like this might help you, but I am not sure:

CREATE INDEX b_id_date ON b (id, date DESC)

SELECT DISTINCT ON (b.id) *
FROM a
INNER JOIN b ON a.id=b.id
ORDER BY b.id, b.date DESC

Alternatively, if you want to sort records from table "a" some way:

SELECT DISTINCT ON (sort_column, a.id) *
FROM a
INNER JOIN b ON a.id=b.id
ORDER BY sort_column, a.id, b.date DESC

Alternative approaches

However, all of the above queries still need to read all referenced rows from table "b", so if you have lots of data, it might still just be too slow.

You could create a new table, which only holds the newest "b" record for each a.id -- or even move those columns into the "a" table itself.

Fronia answered 14/11, 2009 at 16:52 Comment(2)
If you're looking for a more efficient solution try Manji's answer below. It benchmarked about 3x faster than using the DISTINCT ON solution presented here.Barnet
this guy went all out on this: https://mcmap.net/q/28543/-select-first-row-in-each-group-by-groupImpetigo
Y
59

this could be more eficient. Difference: query for table b is executed only 1 time, your correlated subquery is executed for every row:

SELECT * 
FROM table a 
JOIN (SELECT ID, max(date) maxDate
        FROM table
      GROUP BY ID) b
ON a.ID = b.ID AND a.date = b.maxDate
WHERE ID IN $LIST 
Yachting answered 5/11, 2009 at 23:5 Comment(7)
Thought that looked promising but the join is really inefficient.Gambeson
Why do you think that join is inefficient, especially given that it joins with just one row?Microclimatology
After testing both methods, max(date) was about 3x faster for me than the DISTINCT ON.Barnet
It's really a great solution! Thanks a lot! My query time reduced from 470ms to 95ms. I have used max(id) as a last row identifier. So it is can be more efficient than datetime comparison.Franklyn
In my case a user table contains all users of the system, an invitation table contains all invitations sent to become users. Multiple invitations can be sent per email address, each invitation has a valid_until date, a invitation_accepted flag, and a created_date. We only care about the most recent invitation. If you accept the use of aggregate functions on all columns, the performance makes this the best answer.Moleskin
FROM user u LEFT JOIN (SELECT lower(i2.email) AS email, bool_or(i2.used) AS used, max(i2.created_date) AS max_created_date, max(i2.valid_until) AS max_valid_until FROM invitation i2 GROUP BY lower(i2.email)) i ON i.email = lower(u.email)Moleskin
IIUC this would not allow pulling another field from b? E.g. I want to show the latest comment on a task.Lashelllasher
Z
13

what do you think about this?

select * from (
   SELECT a.*, row_number() over (partition by a.id order by date desc) r 
   FROM table a where ID IN $LIST 
)
WHERE r=1

i used it a lot on the past

Zwiebel answered 9/10, 2015 at 10:50 Comment(0)
O
4

On method - create a small derivative table containing the most recent update / insertion times on table a - call this table a_latest. Table a_latest will need sufficient granularity to meet your specific query requirements. In your case it should be sufficient to use

CREATE TABLE 
a_latest 
( id INTEGER NOT NULL, 
  date TSTAMP NOT NULL, 
  PRIMARY KEY (id, max_time) );

Then use a query similar to that suggested by najmeddine :

SELECT a.* 
FROM TABLE a, TABLE a_latest 
USING ( id, date );

The trick then is keeping a_latest up to date. Do this using a trigger on insertions and updates. A trigger written in plppgsql is fairly easy to write. I am happy to provide an example if you wish.

The point here is that computation of the latest update time is taken care of during the updates themselves. This shifts more of the load away from the query.

Olympiaolympiad answered 12/11, 2009 at 11:10 Comment(0)
H
2

If you have many rows per id's you definitely want a correlated subquery. It will make 1 index lookup per id, but this is faster than sorting the whole table.

Something like :

SELECT a.id,
(SELECT max(t.date) FROM table t WHERE t.id = a.id) AS lastdate
FROM table2;

The 'table2' you will use is not the table you mention in your query above, because here you need a list of distinct id's for good performance. Since your ids are probably FKs into another table, use this one.

Horrendous answered 7/11, 2009 at 16:36 Comment(0)
H
0

You can use a NOT EXISTS subquery to answer this also. Essentially you're saying "SELECT record... WHERE NOT EXISTS(SELECT newer record)":

SELECT t.id FROM table t
WHERE NOT EXISTS
    (SELECT * FROM table n WHERE t.id = n.id AND n.date > t.date)
Headlock answered 4/8, 2022 at 1:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.