PostgreSQL query very slow with limit 1
Asked Answered
L

5

74

My queries get very slow when I add a limit 1.

I have a table object_values with timestamped values for objects:

 timestamp |  objectID |  value
--------------------------------
 2014-01-27|       234 | ksghdf

Per object I want to get the latest value:

SELECT * FROM object_values WHERE (objectID = 53708) ORDER BY timestamp DESC LIMIT 1;

(I cancelled the query after more than 10 minutes)

This query is very slow when there are no values for a given objectID (it is fast if there are results). If I remove the limit it tells me nearly instantaneous that there are no results:

SELECT * FROM object_values WHERE (objectID = 53708) ORDER BY timestamp DESC;  
...  
Time: 0.463 ms

An explain shows me that the query without limit uses the index, where as the query with limit 1 does not make use of the index:

Slow query:

explain SELECT * FROM object_values WHERE (objectID = 53708) ORDER BY timestamp DESC limit 1;  
QUERY PLAN`
----------------------------------------------------------------------------------------------------------------------------
Limit  (cost=0.00..2350.44 rows=1 width=126)
->  Index Scan Backward using object_values_timestamp on object_values  (cost=0.00..3995743.59 rows=1700 width=126)
     Filter: (objectID = 53708)`

Fast query:

explain SELECT * FROM object_values WHERE (objectID = 53708) ORDER BY timestamp DESC;
                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------
 Sort  (cost=6540.86..6545.11 rows=1700 width=126)
   Sort Key: timestamp
   ->  Index Scan using object_values_objectID on working_hours_t  (cost=0.00..6449.65 rows=1700 width=126)
         Index Cond: (objectID = 53708)

The table contains 44,884,559 rows and 66,762 distinct objectIDs.
I have separate indexes on both fields: timestamp and objectID.
I have done a vacuum analyze on the table and I have reindexed the table.

Additionally the slow query becomes fast when I set the limit to 3 or higher:

explain SELECT * FROM object_values WHERE (objectID = 53708) ORDER BY timestamp DESC limit 3;
                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Limit  (cost=6471.62..6471.63 rows=3 width=126)
   ->  Sort  (cost=6471.62..6475.87 rows=1700 width=126)
         Sort Key: timestamp
         ->  Index Scan using object_values_objectID on object_values  (cost=0.00..6449.65 rows=1700 width=126)
               Index Cond: (objectID = 53708)

In general I assume it has to do with the planner making wrong assumptions about the exectution costs and therefore chooses for a slower execution plan.

Is this the real reason? Is there a solution for this?

Laurenelaurens answered 27/1, 2014 at 16:1 Comment(2)
Has this issue been raised on pg-bugs and been resolved in the latest version of postgres?Belicia
@ShiwanginiShishulkar - I'm askingBelicia
M
55

You're running into an issue which relates, I think, to the lack of statistics on row correlations. Consider reporting it to pg-bugs for reference if this is using the latest version Postgres.

The interpretation I'd suggest for your plans is:

  • limit 1 makes Postgres look for a single row, and in doing so it assumes that your object_id is common enough that it'll show up reasonably quickly in an index scan.

    Based on the stats you gave its thinking probably is that it'll need to read ~70 rows on average to find one row that fits; it just doesn't realize that object_id and timestamp correlate to the point where it's actually going to read a large portion of the table.

  • limit 3, in contrast, makes it realize that it's uncommon enough, so it seriously considers (and ends up…) top-n sorting an expected 1700 rows with the object_id you want, on grounds that doing so is likely cheaper.

    For instance, it might know that the distribution of these rows is so that they're all packed in the same area on the disk.

  • no limit clause means it'll fetch the 1700 anyways, so it goes straight for the index on object_id.

Solution, btw:

add an index on (object_id, timestamp) or (object_id, timestamp desc).

Muricate answered 27/1, 2014 at 16:34 Comment(10)
For the 'limit 1' case did you mean table scan? You wrote index scanMeasly
@harmic: OP has an index scan there… not necessarily of the whole table, but certainly of a lot of more of it than what PG thought.Muricate
You're right! I only read OP's text where he said it wasn't using the index. But it chooses to scan the timestamp index; weird choiceMeasly
@harmic: Not weird, it entirely depends on OP's stats. If PG thinks (as it probably does) that it'll find a row early enough, it'll index scan the entire table with a query like that. I've seen it do that many times… :-(Muricate
@Denis: thanks for your reply, I already thought that the explanation would be something like this. The combined index solved it indeed and your reply made me realise a lot about indexes, sorting and combined indexes. Thanks for that. As the issue is based on the stats, it could be that it only emerges as the tables fills?!Laurenelaurens
@Measly sorry for being a bit unclear about using the index or not. I am inexperienced in reading the explain text. So what is the idfference between index scan ... filter and index scan ... index cond?Laurenelaurens
@pat: the issue is due to correlations in the table. Postgres collects little to no stats or data on them. As such, any plan Postgres will come up with will make the assumption that the data is entirely non-correlated. It would not know, for instance, that an auto incrementing ID might correlate very strongly with an auto-populated date_created field. :-)Muricate
@denis: but why is correlation here an issue? objectID and timestamp are not related. Think of storing meassurements for an object at a certain time.Laurenelaurens
I think what Denis means is that both are increasing as you add rows to the table. If it is a created_on timestamp, and not an updated_on, then that means that they are strictly correlated--larger IDs will always be paired with larger timestamps. If it's changed on update, there is still at least a "default" correlation that may degrade over time (as rows are updated).Bridgeboard
If you have index on the filtered column, you should LIMIT at least 5 rows, or it won't use the index.Nutmeg
C
79

You can avoid this issue by adding an unneeded ORDER BY clause to the query.

SELECT * FROM object_values 
WHERE (objectID = 53708) 
ORDER BY timestamp DESC, objectID 
limit 1;
Cough answered 1/12, 2014 at 21:31 Comment(8)
HA! That is awesome! Completely fixes it!Placeman
This answer actually works, unlike the answer and all the comments above.Wolverhampton
That's amazing! Just boost my query and can use it in runtime. Thanks!Ningsia
This worked wonders for where an empty result set was taking 150s, reduced it to 2ms. But for the majority, none empty case, it went up from 2ms to 38s, so kind of back where I started :-(Hedrick
Good one. Would it be possible to get an explanation of why it is so?Biotite
Discussion of this bug on pg list: postgresql.org/message-id/flat/…Skinnydip
In my experience, the unneeded ORDER BY trick no longer works for Postgres 13. Instead the trick becomes to rewrite the queries (with a CTE or a subquery) so as to move the LIMIT, as in this example.Helga
In our case - this fails - because your suggestion moves the DESC to object id -- which makes this return a limited set of ASCENDING qualified items - which ARE THE WRONG ITEMS. If you change the above to include the DESC on the timestamp - PG 14 FAILS UTTERLY, just like in the OP.Peggie
M
55

You're running into an issue which relates, I think, to the lack of statistics on row correlations. Consider reporting it to pg-bugs for reference if this is using the latest version Postgres.

The interpretation I'd suggest for your plans is:

  • limit 1 makes Postgres look for a single row, and in doing so it assumes that your object_id is common enough that it'll show up reasonably quickly in an index scan.

    Based on the stats you gave its thinking probably is that it'll need to read ~70 rows on average to find one row that fits; it just doesn't realize that object_id and timestamp correlate to the point where it's actually going to read a large portion of the table.

  • limit 3, in contrast, makes it realize that it's uncommon enough, so it seriously considers (and ends up…) top-n sorting an expected 1700 rows with the object_id you want, on grounds that doing so is likely cheaper.

    For instance, it might know that the distribution of these rows is so that they're all packed in the same area on the disk.

  • no limit clause means it'll fetch the 1700 anyways, so it goes straight for the index on object_id.

Solution, btw:

add an index on (object_id, timestamp) or (object_id, timestamp desc).

Muricate answered 27/1, 2014 at 16:34 Comment(10)
For the 'limit 1' case did you mean table scan? You wrote index scanMeasly
@harmic: OP has an index scan there… not necessarily of the whole table, but certainly of a lot of more of it than what PG thought.Muricate
You're right! I only read OP's text where he said it wasn't using the index. But it chooses to scan the timestamp index; weird choiceMeasly
@harmic: Not weird, it entirely depends on OP's stats. If PG thinks (as it probably does) that it'll find a row early enough, it'll index scan the entire table with a query like that. I've seen it do that many times… :-(Muricate
@Denis: thanks for your reply, I already thought that the explanation would be something like this. The combined index solved it indeed and your reply made me realise a lot about indexes, sorting and combined indexes. Thanks for that. As the issue is based on the stats, it could be that it only emerges as the tables fills?!Laurenelaurens
@Measly sorry for being a bit unclear about using the index or not. I am inexperienced in reading the explain text. So what is the idfference between index scan ... filter and index scan ... index cond?Laurenelaurens
@pat: the issue is due to correlations in the table. Postgres collects little to no stats or data on them. As such, any plan Postgres will come up with will make the assumption that the data is entirely non-correlated. It would not know, for instance, that an auto incrementing ID might correlate very strongly with an auto-populated date_created field. :-)Muricate
@denis: but why is correlation here an issue? objectID and timestamp are not related. Think of storing meassurements for an object at a certain time.Laurenelaurens
I think what Denis means is that both are increasing as you add rows to the table. If it is a created_on timestamp, and not an updated_on, then that means that they are strictly correlated--larger IDs will always be paired with larger timestamps. If it's changed on update, there is still at least a "default" correlation that may degrade over time (as rows are updated).Bridgeboard
If you have index on the filtered column, you should LIMIT at least 5 rows, or it won't use the index.Nutmeg
T
12

I started having similar symptoms on an update-heavy table, and what was needed in my case was

analyze $table_name;

In this case the statistics needed to be refreshed, which then fixed the slow query plans that were occurring.
Supporting docs: https://www.postgresql.org/docs/current/sql-analyze.html

Travers answered 20/3, 2019 at 20:44 Comment(5)
Wow this is game changer when you create indexes or try to find the best query in local development! Thanks a lot!Duppy
This didn't help in our case. Great idea if your DB is changing a lot though! :)Topside
Query went from 2 minutes to 250ms. Thank you for this black magic!Diplococcus
This immediately fixed my issue where a query suddenly became slow after an minor version update of my RDS aurora-postgresql cluster. Thank you!Lamphere
I agree with this answer, it also solved my issue. To add to this, ensure that autovacuum is set to on in the postgresql.conf file or ensure you regularly run vacuumdb -zDiscordant
T
1

Not a fix, but sure enough switching from limit 1 to limit 50 (for me) and returning the first result row is way faster...Postgres 9.x in this instance. Just thought I'd mention it as a workaround mentioned by the OP.

Topside answered 23/2, 2021 at 16:54 Comment(1)
I got similar problem with LIMIT 50 actually (working fine without any LIMIT for query returning around 2000 rows). So this probably depends on many variables and when PG chooses different plan it's often out of our control, even after ANALYSE.Porty
S
0

I can show you my custom method to handle that. This method may not be right way, but works fast. Here is my PostgreSQL function:

CREATE OR REPLACE FUNCTION get_limited_object_values(_object_id BIGINT, _limit INTEGER)
    RETURNS SETOF object_values
    LANGUAGE plpgsql AS
$$
DECLARE
    row_object_value object_values;
    index INT := 0;
BEGIN
    FOR row_object_value IN (SELECT * FROM object_values WHERE (objectID = _object_id) ORDER BY timestamp DESC) LOOP
            IF index >= _limit THEN
                EXIT;
            END IF;
            RETURN NEXT row_object_value;
            index := index + 1;
        END LOOP;
    RETURN;
END;
$$;

You can use it like: SELECT * FROM get_limited_object_values(53708, 1);

Shiva answered 20/9, 2023 at 4:4 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.