select *
from records
where id in ( select max(id) from records group by option_id )
This query works fine even on millions of rows. However as you can see from the result of explain statement:
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=30218.84..31781.62 rows=620158 width=44) (actual time=1439.251..1443.458 rows=1057 loops=1)
-> HashAggregate (cost=30218.41..30220.41 rows=200 width=4) (actual time=1439.203..1439.503 rows=1057 loops=1)
-> HashAggregate (cost=30196.72..30206.36 rows=964 width=8) (actual time=1438.523..1438.807 rows=1057 loops=1)
-> Seq Scan on records records_1 (cost=0.00..23995.15 rows=1240315 width=8) (actual time=0.103..527.914 rows=1240315 loops=1)
-> Index Scan using records_pkey on records (cost=0.43..7.80 rows=1 width=44) (actual time=0.002..0.003 rows=1 loops=1057)
Index Cond: (id = (max(records_1.id)))
Total runtime: 1443.752 ms
(cost=0.00..23995.15 rows=1240315 width=8)
<- Here it says it is scanning all rows and that is obviously inefficient.
I also tried reordering the query:
select r.* from records r
inner join (select max(id) id from records group by option_id) r2 on r2.id= r.id;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=30197.15..37741.04 rows=964 width=44) (actual time=835.519..840.452 rows=1057 loops=1)
-> HashAggregate (cost=30196.72..30206.36 rows=964 width=8) (actual time=835.471..835.836 rows=1057 loops=1)
-> Seq Scan on records (cost=0.00..23995.15 rows=1240315 width=8) (actual time=0.336..348.495 rows=1240315 loops=1)
-> Index Scan using records_pkey on records r (cost=0.43..7.80 rows=1 width=44) (actual time=0.003..0.003 rows=1 loops=1057)
Index Cond: (id = (max(records.id)))
Total runtime: 840.809 ms
(cost=0.00..23995.15 rows=1240315 width=8)
<- Still scanning all rows.
I tried with and without index on (option_id)
, (option_id, id)
, (option_id, id desc)
, none of them had any effect on the query plan.
Is there a way of executing a groupwise maximum query in Postgres without scanning all rows?
What I am looking for, programmatically, is an index which stores the maximum id for each option_id
as they are inserted into the records table. That way, when I query for maximums of option_ids, I should only need to scan index records as many times as there are different option_ids.
I've seen select distinct on
answers all over SO from high ranking users (thanks to @Clodoaldo Neto for giving me keywords to search for). Here's why it doesn't work:
create index index_name on records(option_id, id desc)
select distinct on (option_id) *
from records
order by option_id, id desc
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=0.43..76053.10 rows=964 width=44) (actual time=0.049..1668.545 rows=1056 loops=1)
-> Index Scan using records_option_id_id_idx on records (cost=0.43..73337.25 rows=1086342 width=44) (actual time=0.046..1368.300 rows=1086342 loops=1)
Total runtime: 1668.817 ms
That's great, it's using an index. However using an index to scan all ids doesn't really make much sense. According to my executions, it is in fact slower than a simple sequential scan.
Interesting enough, MySQL 5.5 is able to optimize the query simply using an index on records(option_id, id)
mysql> select count(1) from records;
+----------+
| count(1) |
+----------+
| 1086342 |
+----------+
1 row in set (0.00 sec)
mysql> explain extended select * from records
inner join ( select max(id) max_id from records group by option_id ) mr
on mr.max_id= records.id;
+------+----------+--------------------------+
| rows | filtered | Extra |
+------+----------+--------------------------+
| 1056 | 100.00 | |
| 1 | 100.00 | |
| 201 | 100.00 | Using index for group-by |
+------+----------+--------------------------+
3 rows in set, 1 warning (0.02 sec)
(option_id, id desc)
index and runANALYZE
against the given table? Btw, what posgtresql version are you running? – Birkleoption_id
will unlikely affect it in any way since you still need to retrieveMAX(id)
hence iterate over all rows. – BirkleIndexes are smaller than the whole dataset and it's more chance they are in a cache.
<- @Birkle I don't see how that is possible since the "whole dataset" here is just a bunch of ids I'm trying to get into a derived table so they can be queried against. Sequential scan is faster than index scan in case you are going through the whole table. – Afieldoption_id
do you have? – Varsityoptions
where all the options are listed that can be referenced in tablerecords
. Plus a foreign key constraint onrecords.option_id
. Do you? – Clapboard