How to fetch the continuous list with PostgreSQL in web
Asked Answered
A

2

9

I am making an API over HTTP that fetches many rows from PostgreSQL with pagination. In ordinary cases, I usually implement such pagination through naive OFFET/LIMIT clause. However, there are some special requirements in this case:

  • A lot of rows there are so that I believe users cannot reach the end (imagine Twitter timeline).
  • Pages does not have to be randomly accessible but only sequentially.
  • API would return a URL which contains a cursor token that directs to the page of continuous chunks.
  • Cursor tokens have not to exist permanently but for some time.
  • Its ordering has frequent fluctuating (like Reddit rankings), however continuous cursors should keep their consistent ordering.

How can I achieve the mission? I am ready to change my whole database schema for it!

Allottee answered 12/10, 2011 at 21:37 Comment(2)
Just to be sure what your asking. Are you saying lots of rows or really wide rows, or both?Rohrer
does this have to be done with cursors only? because there are other methods to do it that do not involve managing cursors at allApplecart
D
7

Assuming it's only the ordering of the results that fluctuates and not the data in the rows, Fredrik's answer makes sense. However, I'd suggest the following additions:

  • store the id list in a postgresql table using the array type rather than in memory. Doing it in memory, unless you carefully use something like redis with auto expiry and memory limits, is setting yourself up for a DOS memory consumption attack. I imagine it would look something like this:

    create table foo_paging_cursor (
      cursor_token ..., -- probably a uuid is best or timestamp (see below)
      result_ids integer[], -- or text[] if you have non-integer ids
      expiry_time TIMESTAMP
    );
    
  • You need to decide if the cursor_token and result_ids can be shared between users to reduce your storage needs and the time needed to run the initial query per user. If they can be shared, chose a cache window, say 1 or 5 minute(s), and then upon a new request create the cache_token for that time period and then check to see if the results ids have already been calculated for that token. If not, add a new row for that token. You should probably add a lock around the check/insert code to handle concurrent requests for a new token.

  • Have a scheduled background job that purges old tokens/results and make sure your client code can handle any errors related to expired/invalid tokens.

Don't even consider using real db cursors for this.

Keeping the result ids in Redis lists is another way to handle this (see the LRANGE command), but be careful with expiry and memory usage if you go down that path. Your Redis key would be the cursor_token and the ids would be the members of the list.

Dilettante answered 19/10, 2011 at 21:44 Comment(3)
Better yet, make that a temporary table. Faster, less disc load. No need to worry about DOS attack, temp tables can only use limited RAM (read about temp_buffers in the manual and are written to disc when RAM is not enough.Kevin
Temporary tables are session local and are deleted upon session termination. Thus, that wouldn't work with db connection pooling or the case where the http api endpoint is distributed across multiple nodes and use different connections. It would also lead to problems when an appserver is restarted and has to reconnect to the db. That said, you get the same benefits by putting the table into a memory-backed (via tmpfs) tablespace. See magazine.redhat.com/2007/12/12/…Dilettante
Thanks for your advice. I decided to use memcached and store comma-separated ids into keys (which are cursor tokens) with expiration time. Thank you!Allottee
A
2

I know absolutely nothing about PostgreSQL, but I'm a pretty decent SQL Server developer, so I'd like to take a shot at this anyway :)

How many rows/pages do you expect a user would maximally browse through per session? For instance, if you expect a user to page through a maximum of 10 pages for each session [each page containing 50 rows], you could make take that max, and setup the webservice so that when the user requests the first page, you cache 10*50 rows (or just the Id:s for the rows, depends on how much memory/simultaneous users you got).

This would certainly help speed up your webservice, in more ways than one. And it's quite easy to implement to. So:

  • When a user requests data from page #1. Run a query (complete with order by, join checks, etc), store all the id:s into an array (but a maximum of 500 ids). Return datarows that corresponds to id:s in the array at positions 0-9.
  • When the user requests page #2-10. Return datarows that corresponds to id:s in the array at posisions (page-1)*50 - (page)*50-1.

You could also bump up the numbers, an array of 500 int:s would only occupy 2K of memory, but it also depends on how fast you want your initial query/response.

I've used a similar technique on a live website, and when the user continued past page 10, I just switched to queries. I guess another solution would be to continue to expand/fill the array. (Running the query again, but excluding already included id:s).

Anyway, hope this helps!

Arabeila answered 15/10, 2011 at 4:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.