Paging of frequently changing data

Asked 2/8, 2014 at 21:52 Answered 11/10, 2018 at 19:26

I'm developing a web application which display a list of let's say "threads". The list can be sorted by the amount of likes a thread has. There can be thousands of threads in one list.

The application needs to work in a scenario where the likes of a thread can change more than 10x in a second. The application furthermore is distributed over multiple servers.

I can't figure out an efficient way to enable paging for this sort of list. And I can't transmit the whole sorted list by likes to a user at once.

As soon as an user would go to page 2 of this list, it likely changed and may contain threads already listed from page one

Solutions which don't work:

Storing the seen threads on the client side (could be too many on mobile)
Storing the seen threads on the Server side (too many users and threads)
Snapshot the list in temp database table (it's too frequent changing data and it need to be actual)

(If it matters I'm using MongoDB+c#)

How would you solve this kind of problem?

Teshatesla answered 2/8, 2014 at 21:52 Comment(4)

What is the result you're hoping to achieve? Are you trying to cache a snapshot of the results so users don't see changes as quickly as they occur? It seems like you may want to use a different popularity metric for sorting which changes less frequently than "likes" (or have both a real-time "like" score and a saved "hotness" metric for ordering). For example, an interesting approach is using a hotness value with decay (eg Drupal radioactivity module). If you calculate this periodically the ordering can still be relevant but not overreactive. – Gerontocracy 4/8, 2014 at 6:7

Have you found a solution yet? – Unhouse 23/11, 2016 at 14:47

Sadly I never figured a proper way out – Teshatesla 8/12, 2016 at 17:48

I believe it's hardly possible to name it pagination. Pagination - dividing into discrete pages. What is a page in this example? Try to define it. I guess what you want to achieve is not a page but something like topN popular topics (they could be sequential: top10 then top20, etc.). A user can expect that top10 threads could change but she will hardly expect that page 1 will be updated with new records too often. – Gaily 11/1, 2021 at 20:11

Interesting question. Unless I'm misunderstanding you, and by all means let me know if I am, it sounds like the best solution would be to implement a system that, instead of page numbers, uses timestamps. It would be similar to what many of the main APIs already do. I know Tumblr even does this on the dashboard, where this is, of course, not an unreasonable case: there can be tons of posts added in a small amount of time at peak hours, depending on how many people the user follows.

So basically, your "next page" button could just link to /threads/threadindex/1407051000, which could translate to "all the threads that were created before 2014-08-02 17:30. That makes your query super easy to implement. Then, when you pull down all the next elements, you just look for anything that occurred before the last element on the page.

The downfall of this, of course, is that it's hard to know how many new elements have been added since the user started browsing, but you could always log the start time and know anything since then would be new. And it's also difficult for users to type in their own pages, but that's not a problem in most applications. You also need to store the timestamps for every record in your thread, but that's probably already being done, and if it's not then it's certainly not hard to implement. You'll be paying the cost of something like eight bytes extra per record, but that's better than having to store anything about "seen" posts.

It's also nice because, and again this might not apply to you, but a user could bookmark a page in the list, and it would last unchanged forever since it's not relative to anything else.

Seedtime answered 3/8, 2014 at 0:32 Comment(2)

Sorting by date and implementing paging is not the problem, the solution would be a kind of timestamp paging. (I edited my main question to clarify). My Problem is implement paging when you have no "constant" sort key, as the number of likes which are frequently changing. – Teshatesla 3/8, 2014 at 8:53

@Teshatesla Ah, I see. So date isn't always an option. That does complicate things a bit more. Well, do you think it could work if you used a hybrid of page numbers of start timestamp? So you'd grab the timestamp when the person first loads the page, then put that into your query for subsequent calls? Then you're not affected by the fact that they're changing so much. And presumably you have records in a UserLikes table to which you could add timestamps, then you could build your sorted like count off of that, filtered for only the ones that had occurred before the user started browsing? – Seedtime 3/8, 2014 at 9:15

This is typically handled using an OLAP cube. The idea here is that you add a natural time dimension. They may be too heavy for this application, but here's a summary in case someone else needs it.

OLAP cubes start with the fundamental concept of time. You have to know what time you care about to be able to make sense of the data.

You start off with a "Time" table:

Time {
  timestamp     long      (PK)
  created       datetime
  last_queried  datetime
}

This basically tracks snapshots of your data. I've included a last_queried field. This should be updated with the current time any time a user asks for data based on this specific timestamp.

Now we can start talking about "Threads":

Threads {
  id             long      (PK)
  identifier     long
  last_modified  datetime
  title          string
  body           string
  score          int
}

The id field is an auto-incrementing key; this is never exposed. identifier is the "unique" id for your thread. I say "unique" because there's no unique-ness constraint, and as far as the database is concerned it is not unique. Everything else in there is pretty standard... except... when you do writes you do not update this entry. In OLAP cubes you almost never modify data. Updates and inserts are explained at the end.

Now, how do we query this? You can't just directly query Threads. You need to include a star table:

ThreadStar {
  timestamp          long  (FK -> Time.timestamp)
  thread_id          long  (FK -> Threads.id)
  thread_identifier  long  (matches Threads[thread_id].identifier)
    (timestamp, thread_identifier should be unique)
}

This table gives you a mapping from what time it is to what the state of all of the threads are. Given a specific timestamp you can get the state of a Thread by doing:

SELECT Thread.*
FROM   Thread
JOIN   ThreadStar ON Thread.id = ThreadStar.thread_id
WHERE  ThreadStar.timestamp = {timestamp}
   AND Thread.identifier = {thread_identifier}

That's not too bad. How do we get a stream of threads? First we need to know what time it is. Basically you want to get the largest timestamp from Time and update Time.last_queried to the current time. You can throw a cache up in front of that that only updates every few seconds, or whatever you want. Once you have that you can get all threads:

SELECT   Thread.*
FROM     Thread
JOIN     ThreadStar ON Thread.id = ThreadStar.thread_id
WHERE    ThreadStar.timestamp = {timestamp}
ORDER BY Thread.score DESC

Nice. We've got a list of threads and the ordering is stable as the actual scores change. You can page through this at your leisure... kind of. Eventually data will be cleaned up and you'll lose your snapshot.

So this is great and all, but now you need to create or update a Thread. Creation and modification are almost identical. Both are handled with an INSERT, the only difference is whether you use an existing identifier or create a new one.

So now you've inserted a new Thread. You need to update ThreadStar. This is the crazy expensive part. Basically you make a copy of all of the ThreadStar entries with the most recent timestamp, except you update the thread_id for the Thread you just modified. That's a crazy amount of duplication. Fortunately it's pretty much only foreign keys, but still.

You also don't do DELETEs either; mark a row as deleted or just exclude it when you update ThreadStar.

Now you're humming along, but you've got crazy amounts of data growing. You'll probably want to clean it out, unless you've got a lot of storage budge, but even then things will start slowing down (aside: this will actually perform shockingly well, even with crazy amounts of data).

Cleanup is pretty straightforward. It's just a matter of some cascading deletes and scrubbing for orphaned data. Delete entries from Time whenever you want (e.g. it's not the latest entry and last_queried is null or older than whatever cutoff). Cascade those deletes to ThreadStar. Then find any Threads with an id that isn't in ThreadStar and scrub those.

This general mechanism also works if you have more nested data, but your queries get harder.

Final note: you'll find that your inserts get really slow because of the sheer amounts of data. Most places build this with appropriate constraints in development and testing environments, but then disable constraints in production!

Yeah. Make sure your tests are solid.

But at least you aren't sensitive to re-ordered data mid-paging.

Popular answered 11/10, 2018 at 19:26 Comment(1)

OLAP is great for data warehouses but for an application server too much imo – Piny 3/8, 2020 at 8:24

For constantly changing data such as likes I would use a two stage appraoch. For the frequently changing data I would use an in memory DB to keep up with the change rates and flush this peridically to the "real" db. Once you have that the query for constantly chaning data is easy.

Query the db.
Query the in memory db.
Merge the frequently changed data from the in memory db with the "slow" db data .
Remember which results you already have displayed so pressing the next button will not display an already dispalyed value twice because on different pages because its rank has changed.

If many people look at the same data it might help to cache the results of 3 in itself to reduce the load on the real db even further.

Your current architecture has no caching layers (the bigger the site the more things are cached). You will not get away with a simple DB and efficient queries against the db if things become too massive.

Lefler answered 3/8, 2014 at 9:19 Comment(4)

My problem is kinda only your 4th point. Assuming I have 1000 unique user list views a second and a user can possibly page a result set with more than 10000 pages (x 50 threads). Therefore I'd need to store at least 50000 "already seen threads" a second. Not knowing how long the user stays on the page I need to store the last seen threads for a long period. – Teshatesla 3/8, 2014 at 12:38

Part of requirements engineering is to deal with realistic numbers. Are you with your current design even able to store 1K changes/s into your db? How much page views/s are we talking about? One physical machine with IIS can serve ca. 2000 pages/s if it is heavily tuned like Stackoverflow (blog.cellfish.se/2014/07/…). – Lefler 3/8, 2014 at 12:44

I have a MongoDB cluster setup which is able to handle around 5000updates/s and 4 frontend servers which can handle ~3000-4000request/s. The thing I'm struggeling with is to avoid duplicates threads in the client view when sorted by likes which are frequently updated. – Teshatesla 3/8, 2014 at 13:4

What is your goal? If you want to display up to date data then it can happen that one new thread shown on page 4 moves to page 1 because it has become hugely popular in between. Or do you want to show a snapshot in time when moving between pages? – Lefler 3/8, 2014 at 17:43

I would cache all 'thread' results on the server when the user first time hits the database. Then return the first page of data to the user and for each subsequent next page calls I'd return cached results.

To minimize memory usage you can cache only records ids and fetch whole data when user requests it.

Cache can be evicted each time user exits current page. If it isn't a ton of data I would stick to this solution because user won't get annoyed of data constantly changing.

Hindmost answered 4/9, 2018 at 8:48 Comment(0)

Recommended topics

Hot tags