When to use the disruptor pattern and when local storage with work stealing?
Asked Answered
C

1

11

Is the following correct?

  • The disruptor pattern has better parallel performance and scalability if each entry has to be processed in multiple ways (io operations or annotations), since that can be parallelized using multiple consumers without contention.
  • Contrarily, work stealing (i.e. storing entries locally and stealing entries from other threads) has better parallel performance and scalability if each entry has to be processed in a single way only, since disjointly distributing the entries onto multiple threads in the disruptor pattern causes contention.

(And is the disruptor pattern still so much faster than other lockless multi-producer multi-consumer queues (e.g. from boost) when multiple producers (i.e. CAS operations) are involved?)


My situation in detail:

Processing an entry can produce several new entries, which must be processed eventually, too. Performance has highest priority, entries being processed in FIFO order has second priority.

In the current implementation, each thread uses a local FIFO, where it adds its new entries. Idle threads steal work from other thread's local FIFO. Dependencies between the thread's processing are resolved using a lockless, mechanically sympathetic hash table (CASs on write, with bucket granularity). This results in pretty low contention but FIFO order is sometimes broken.

Using the disruptor pattern would guarantee FIFO order. But wouldn't distributing the entries onto the threads cause much higher contention (e.g. CAS on a read cursor) than for local FIFOs with work stealing (each thread's throughput is about the same)?


References I've found

The performance tests in the standard technical paper on the disruptor (Chapter 5 + 6) do not cover disjoint work distribution.

https://groups.google.com/forum/?fromgroups=#!topic/lmax-disruptor/tt3wQthBYd0 is the only reference I've found on disruptor + work stealing. It states that a queue per thread is dramatically slower if there is any shared state, but does not go into detail or explain why. I doubt that this sentence applies to my situation with:

  • shared state being resolved with a lockless hash table;
  • having to disjointly distribute entries amongst consumers;
  • except for work stealing, each thread reads and writes only in its local queue.
Cupid answered 11/3, 2013 at 14:28 Comment(0)
S
14

Update - Bottom line up front for max performance: You need to write both in the idiomatic syntax for disruptor and work stealing, and then benchmark.

To me, I think the distinction is primarily in the split between message vs task focus, and therefore in the way you want to think of the problem. Try to solve your problem, and if it is task-focused then Disruptor is a good fit. If the problem is message focused, then you might be more suited to another technique such as work stealing.

  • Use work stealing when your implementation is message focused. Each thread can pick up a message and run it through to completion. For an example HTTP server - Each inbound http request is allocated a thread. That thread is focused on handling the request start to finish - logging the request, checking security controls, doing vhost lookup, fetching file, sending response, and closing connection

  • Use disruptor when your implementation is task focused. Each thread can work on a particular stage of the processing. Alternative example: for a task focus, the processing would be split into stages, so you would have a thread that does logging, a thread for security controls, a thread for vhost lookup, etc; each thread focused on its task and passes the request to the next thread in the pipeline. Stages may be parallelised but the overall structure is a thread focused on a specific task and hands the message along between threads.

Of course, you can change your implementation to suit each approach better.

In your case, I would structure the problem differently if you wanted to use Disruptor. Generally you would eliminate shared state by having a single thread own the state and pass all tasks through that thread of work - look up SEDA for lots of diagrams like this. This can have lots of benefits, but again, is really down to your implementation.

Some more verbosity:

  • Disruptor - very useful when strict ordering of stages is required, additional benefits when all tasks are of a consistent length eg: no blocking on external system, and very similar amount of processing per task. In this scenario, you can assume that all threads will work evenly through the system, and therefore arrange N threads to process every N messages. I like to think of Disruptor as an efficient way to implement SEDA-like systems where threads process stages. You can of course have an application with a single stage and multiple parallel units performing same work at each stage however this is not really the point in my view. This would totally avoid the cost of shared state.
  • Work stealing - use this when tasks are of a varied duration and the order of message processing is not important, as this allows threads that are free and have already consumed their messages to continue progress from another task queue. This way if for example you have 10 threads and 1 is blocked on IO, the remainder will still complete their processing.
Silkweed answered 12/3, 2013 at 7:45 Comment(6)
Thanks for your answer. Make a distinction based on some paradigm sounds quite good, unfortunately I do not know what the difference of message focus vs task focus is.Cupid
I did design a solution to my problem with disruptors: one consumer annotates entries in (multiple) disruptor(s) with their hash values, another consumer puts them in the local hash table and annotates the entries with a flag whether they have been in the hash table already. This solution might be faster, but does not scale linearly in the number of threads, as the lockless hash table does. One further disruptor, implementing the FIFO for all the newly created items, has multiple producers, so I'm not sure whether a disruptor improves this part of the implementation, either.Cupid
Btw, my tasks are definitely of varied duration, but the order of message processing is somewhat important...Cupid
FYI, CAS on write will definitely not scale linearly with the number of threads.Silkweed
Finding it a little hard to follow your explanation of hash tables and locks and how this relates to your problem domain. I added the BLUF comment as I think it's important - we can theorise "till the cows come home" but nothing beats a benchmark on your environment for your problem. Avoiding multiple producers (aka writers to a single queue) is preferable, better to have each producer write to its own queue and one reader read off multiple queues.Silkweed
Re message focus v task focus. An example for a HTTP web server: Message focus - Each inbound http request is allocated a thread. That thread is focused on handling the request start to finish - logging the request, checking security controls, doing vhost lookup, fetching file, sending response, and closing connection. Alternatively, for a task focus, the processing would be split into stages, so you would have a thread that does logging, a thread for security controls, a thread for vhost lookup, etc; each thread focused on its task and passes the request to the next thread in the pipeline.Silkweed

© 2022 - 2024 — McMap. All rights reserved.