Why does MYSQL higher LIMIT offset slow the query down?
Asked Answered
E

6

229

Scenario in short: A table with more than 16 million records [2GB in size]. The higher LIMIT offset with SELECT, the slower the query becomes, when using ORDER BY *primary_key*

So

SELECT * FROM large ORDER BY `id`  LIMIT 0, 30 

takes far less than

SELECT * FROM large ORDER BY `id` LIMIT 10000, 30 

That only orders 30 records and same eitherway. So it's not the overhead from ORDER BY.
Now when fetching the latest 30 rows it takes around 180 seconds. How can I optimize that simple query?

Erinerina answered 19/12, 2010 at 3:1 Comment(3)
NOTE: I'm the author. MySQL doesn't refer to the index (PRIMARY) in the above cases. see the below link by user "Quassnoi" for explanation.Erinerina
possible duplicate of How can I speed up a MySQL query with a large offset in the LIMIT clause?Loper
A related link: We need tool support for keyset pagination. If you’d like to know what happens inside the database when using offset or keyset pagination, have a look at those slides.Lockwood
P
234

It's normal that higher offsets slow the query down, since the query needs to count off the first OFFSET + LIMIT records (and take only LIMIT of them). The higher is this value, the longer the query runs.

The query cannot go right to OFFSET because, first, the records can be of different length, and, second, there can be gaps from deleted records. It needs to check and count each record on its way.

Assuming that id is the primary key of a MyISAM table, or a unique non-primary key field on an InnoDB table, you can speed it up by using this trick:

SELECT  t.* 
FROM    (
        SELECT  id
        FROM    mytable
        ORDER BY
                id
        LIMIT 10000, 30
        ) q
JOIN    mytable t
ON      t.id = q.id

See this article:

Pharmacopoeia answered 21/12, 2010 at 18:6 Comment(18)
MySQL "early row lookup" behavior was the answer why it's talking so long. By the trick you provided, only matched ids (by the index directly) are bound, saving unneeded row lookups of too many records. That did the trick, hooray!Erinerina
Awesome ... are there any limitations, where this trick will not work?Obaza
@harald: what exactly do you mean by "not work"? This is a pure performance improvement. If there is no index usable by ORDER BY or the index covers all fields you need, you don't need this workaround.Pharmacopoeia
@Quassnoi: i don't know. i played around with this on some own tables with millions of rows where i had performance problems before and the solution you provided works like a charm for. i guess i have to delve a little deeper in what's going on here to fully understand this solution. thanks!Obaza
From my test this solution has its limits. For a table with 16M rows, this method is getting slow once offset is around half the table, ie. LIMIT 7100000,100000Shum
@f055: the answer says "speed up", not "make instant". Have you read the very first sentence of the answer?Pharmacopoeia
Is this approach applicable to PostgreSQL? How it compares to using a server-side cursor in terms of performance? I'm talking about data volume of around 2M records per table, for some tables. Inspiration for this comment, if you're curious, is my question on SO: #24265317.Ihs
Is it possible to run something like this for InnoDB?Latrice
@NeverEndingQueue: in InnoDB, tables are clustered on primary key, so it's useless if you order by the primary key. If your order by a secondary index key, then yes, it would make sense.Pharmacopoeia
@Obaza - The trick will slow down once you have to scan so much of the index on id that the 'trick' becomes slow. The real 'fix' is to "remember where you left off".Inger
So, does MySQL re-use position, it gained on previous call to LIMIT if calls go sequentially?Taneshatang
@Dims: no it doesn't.Pharmacopoeia
How to implement this with multiple WHERE statements, like: WHERE categories &> '{news}' AND title ILIKE ALL (ARRAY['%hello%'])? Seems like this "hack" doesn't work inside a sub-query, execution time as long as without sub-query with primary key column. Of course, all of those columns are indexed. This is needed for category listing and full-text search.Caras
@Lanti: please post it as a separate question and don't forget to tag it with postgresql. This is a MySQL-specific answer.Pharmacopoeia
So this does what you would do manually if you only fetched the columns covered by a single key first (here: IDs only) and then used a second query to fetch both the IDs and some other data for these records.Malachy
Hi! I made a comparison (with a plot) here: https://mcmap.net/q/117595/-why-does-mysql-higher-limit-offset-slow-the-query-down :)Lightness
I'm using Innodb's primary key id field as order by condition, but I'm still seeing a significant performance increase(780ms -> 13.7ms). AFAIK, Innodb is using cluster index, why is the performance still got increased?Mylander
@ospider: post your setup and some sample data in a separate question and leave the link in a comment herePharmacopoeia
J
291

I had the exact same problem myself. Given the fact that you want to collect a large amount of this data and not a specific set of 30 you'll be probably running a loop and incrementing the offset by 30.

So what you can do instead is:

  1. Hold the last id of a set of data(30) (e.g. lastId = 530)
  2. Add the condition WHERE id > lastId limit 0,30

So you can always have a ZERO offset. You will be amazed by the performance improvement.

Journal answered 5/6, 2013 at 8:44 Comment(11)
Does this work if there are gaps? What if you don't have a single unique key (a composite key for example)?Sobel
It may not be obvious to all that this only works if your result set is sorted by that key, in ascending order (for descending order the same idea works, but change > lastid to < lastid.) It doesn't matter if it's the primary key, or another field (or group of fields.)Uneasy
Well done that man! A very simple solution that has solved my problem :-)Termor
Just a note that limit/offset is often used in paginated results, and holding lastId is simply not possibly because the user can jump to any page, not always the next page. In other words, offset often needs to be calculated dynamically based on page and limit, instead of following a continuous pattern.Interdisciplinary
If it is just pagination and sorting with id ONLY with a simple where clause if you want you can use this: select id,name from large_table where id>=(SELECT id FROM large_table where name like '%the%' limit 1 offset 2000001) and name like '%the%' limit 50; It is fast..Homy
I talk at more length about "remembering where you left off" in mysql.rjweb.org/doc.php/paginationInger
Although completely unprofessional, yet I have to admit it: Love ya!Erbes
What about there's no lastId ?Tarragon
How do you remember your last position when the user visiting Page 563 from bookmark? Seems like there's no alternative for limit / offset in this case, which is sad. I have a 6 million row sample data and the pagination is pain, if you cannot implement unlimited scrolling.Caras
man. you are a live saver. i have 5 mil data that need around 90 mins to process all with offset and limit now when i tried your answer. daamn its only need 9 mins to process Thankyou man. THANK YOU!!Cheboksary
@Caras Let's assume that Page 563 begins at offset 563 * 30 = 16890, since in the OP's example 30 is the page size and assume page numbering starts from 0. Further assume that column id is unique and is indexed. Then execute select id from large order by id limit 16889, 1 to read the id of the last row of Page 562. This should be reasonably efficient since only the index is involved. Now you have the "lastId" to proceed with selecting the next page.Spectator
P
234

It's normal that higher offsets slow the query down, since the query needs to count off the first OFFSET + LIMIT records (and take only LIMIT of them). The higher is this value, the longer the query runs.

The query cannot go right to OFFSET because, first, the records can be of different length, and, second, there can be gaps from deleted records. It needs to check and count each record on its way.

Assuming that id is the primary key of a MyISAM table, or a unique non-primary key field on an InnoDB table, you can speed it up by using this trick:

SELECT  t.* 
FROM    (
        SELECT  id
        FROM    mytable
        ORDER BY
                id
        LIMIT 10000, 30
        ) q
JOIN    mytable t
ON      t.id = q.id

See this article:

Pharmacopoeia answered 21/12, 2010 at 18:6 Comment(18)
MySQL "early row lookup" behavior was the answer why it's talking so long. By the trick you provided, only matched ids (by the index directly) are bound, saving unneeded row lookups of too many records. That did the trick, hooray!Erinerina
Awesome ... are there any limitations, where this trick will not work?Obaza
@harald: what exactly do you mean by "not work"? This is a pure performance improvement. If there is no index usable by ORDER BY or the index covers all fields you need, you don't need this workaround.Pharmacopoeia
@Quassnoi: i don't know. i played around with this on some own tables with millions of rows where i had performance problems before and the solution you provided works like a charm for. i guess i have to delve a little deeper in what's going on here to fully understand this solution. thanks!Obaza
From my test this solution has its limits. For a table with 16M rows, this method is getting slow once offset is around half the table, ie. LIMIT 7100000,100000Shum
@f055: the answer says "speed up", not "make instant". Have you read the very first sentence of the answer?Pharmacopoeia
Is this approach applicable to PostgreSQL? How it compares to using a server-side cursor in terms of performance? I'm talking about data volume of around 2M records per table, for some tables. Inspiration for this comment, if you're curious, is my question on SO: #24265317.Ihs
Is it possible to run something like this for InnoDB?Latrice
@NeverEndingQueue: in InnoDB, tables are clustered on primary key, so it's useless if you order by the primary key. If your order by a secondary index key, then yes, it would make sense.Pharmacopoeia
@Obaza - The trick will slow down once you have to scan so much of the index on id that the 'trick' becomes slow. The real 'fix' is to "remember where you left off".Inger
So, does MySQL re-use position, it gained on previous call to LIMIT if calls go sequentially?Taneshatang
@Dims: no it doesn't.Pharmacopoeia
How to implement this with multiple WHERE statements, like: WHERE categories &> '{news}' AND title ILIKE ALL (ARRAY['%hello%'])? Seems like this "hack" doesn't work inside a sub-query, execution time as long as without sub-query with primary key column. Of course, all of those columns are indexed. This is needed for category listing and full-text search.Caras
@Lanti: please post it as a separate question and don't forget to tag it with postgresql. This is a MySQL-specific answer.Pharmacopoeia
So this does what you would do manually if you only fetched the columns covered by a single key first (here: IDs only) and then used a second query to fetch both the IDs and some other data for these records.Malachy
Hi! I made a comparison (with a plot) here: https://mcmap.net/q/117595/-why-does-mysql-higher-limit-offset-slow-the-query-down :)Lightness
I'm using Innodb's primary key id field as order by condition, but I'm still seeing a significant performance increase(780ms -> 13.7ms). AFAIK, Innodb is using cluster index, why is the performance still got increased?Mylander
@ospider: post your setup and some sample data in a separate question and leave the link in a comment herePharmacopoeia
C
22

MySQL cannot go directly to the 10000th record (or the 80000th byte as your suggesting) because it cannot assume that it's packed/ordered like that (or that it has continuous values in 1 to 10000). Although it might be that way in actuality, MySQL cannot assume that there are no holes/gaps/deleted ids.

So, as bobs noted, MySQL will have to fetch 10000 rows (or traverse through 10000th entries of the index on id) before finding the 30 to return.

EDIT : To illustrate my point

Note that although

SELECT * FROM large ORDER BY id LIMIT 10000, 30 

would be slow(er),

SELECT * FROM large WHERE id >  10000 ORDER BY id LIMIT 30 

would be fast(er), and would return the same results provided that there are no missing ids (i.e. gaps).

Countertype answered 21/12, 2010 at 18:0 Comment(5)
This is correct. But since it's limited by "id", why does it take so long when that id is within an index (primary key)? Optimizer should refer to that index directly, and then fetch the rows with matched ids ( which came from that index)Erinerina
If you used a WHERE clause on id, it could go right to that mark. However, if you put a limit on it, ordered by id, it's just a relative counter to the beginning, so it has to transverse the whole way.Countertype
Very good article eversql.com/…Recompense
Worked for me @Countertype Thanks.Antivenin
This answer is wrong if there are other conditions in WHERE clause.Weatherman
A
12

I found an interesting example to optimize SELECT queries ORDER BY id LIMIT X,Y. I have 35million of rows so it took like 2 minutes to find a range of rows.

Here is the trick :

select id, name, address, phone
FROM customers
WHERE id > 990
ORDER BY id LIMIT 1000;

Just put the WHERE with the last id you got increase a lot the performance. For me it was from 2minutes to 1 second :)

Other interesting tricks here : http://www.iheavy.com/2013/06/19/3-ways-to-optimize-for-paging-in-mysql/

It works too with strings

Areaway answered 1/10, 2015 at 17:4 Comment(2)
this works only for tables, where no data are deletedSidedress
@Sidedress That's only true if you are working under the assumption that your query can do lookups at random pages, which I don't believe this poster is assuming. While I don't like this method for most real world cases, this will work with gaps as long as you are always basing it off the last id obtained.Blanketyblank
D
5

The time-consuming part of the two queries is retrieving the rows from the table. Logically speaking, in the LIMIT 0, 30 version, only 30 rows need to be retrieved. In the LIMIT 10000, 30 version, 10000 rows are evaluated and 30 rows are returned. There can be some optimization can be done my the data-reading process, but consider the following:

What if you had a WHERE clause in the queries? The engine must return all rows that qualify, and then sort the data, and finally get the 30 rows.

Also consider the case where rows are not processed in the ORDER BY sequence. All qualifying rows must be sorted to determine which rows to return.

Disremember answered 19/12, 2010 at 3:28 Comment(4)
just wondering why it consumes time to fetch those 10000 rows. The index used on that field ( id, which is a primary key ) should make retrieving those rows as fast as seeking that PK index for record no. 10000, which in turn is supposed to be fast as seeking the file to that offset multiplied by index record length, ( ie, seeking 10000*8 = byte no 80000 - given that 8 is the index record length )Erinerina
@Erinerina - The only way to count past the 10000 rows is to step over them one by one. This may just involve an index, but still index rows take time to step through. There is no MyISAM or InnoDB structure that can correctly (in all cases) "seek" to record 10000. The 10000*8 suggestion assumes (1) MyISAM, (2) FIXED length record, and (3) never any deletes from the table. Anyway, MyISAM indexes are BTrees, so it would not work.Inger
As this answer stated, I believe, the really slow part is the row lookup, not traversing the indexes (which of course will add up as well, but nowhere near as much as the row lookups on disk). Based on the workaround queries provided for this issue, I believe the row lookups tend to happen if you are selecting columns outside of the index -- even if they are not part of the order by or where clause. I haven't found a reason why this is necessary, but it appears to be why some of the workarounds help.Blanketyblank
I believe the delay is caused by counting the entries in the index tree, as oposed to finding the starting index (for which SQL index tree is optimized and it gets pointed close to the target row, without going through particular rows). The next part, reading number of rows, is equaly "slow" when using WHERE ID > x. But the latter is useless in most real world applications anyway.Virgilio
L
4

For those who are interested in a comparison and figures :)

Experiment 1: The dataset contains about 100 million rows. Each row contains several BIGINT, TINYINT, as well as two TEXT fields (deliberately) containing about 1k chars.

  • Blue := SELECT * FROM post ORDER BY id LIMIT {offset}, 5
  • Orange := @Quassnoi's method. SELECT t.* FROM (SELECT id FROM post ORDER BY id LIMIT {offset}, 5) AS q JOIN post t ON t.id = q.id
  • Of course, the third method, ... WHERE id>xxx LIMIT 0,5, does not appear here since it should be constant time.

Experiment 2: Similar thing, except that one row only has 3 BIGINTs.

  • green := the blue before
  • red := the orange before

enter image description here

Lightness answered 1/3, 2020 at 7:23 Comment(2)
Is your id primary key or non-primary key field?Mylander
@Mylander primary imhoLightness

© 2022 - 2024 — McMap. All rights reserved.