Is multiple-producer, single-consumer possible in a lockfree setting?

G

4

13

I have a bunch of threads that are doing lots of communication with each other. I would prefer this be lock free.

For each thread, I want to have a mailbox, where other threads can send it messages, (but only the owner can remove messages). This is a multiple-producer single-consumer situation. is it possible for me to do this in a lockfree / high performance matter? (This is in the inner loop of a gigantic simulation.)

Grangerize answered 28/1, 2010 at 1:53 Comment(0)

N

3

Sure, if you have an atomic CompareAndSwap instruction:

for (i = 0; ; i = (i + 1) % MAILBOX_SIZE)
{
    if ((mailbox[i].owned == false) &&
        (CompareAndSwap(&mailbox[i].owned, true, false) == false))
        break;
}

mailbox[i].message = message;
mailbox[i].ready = true;

After reading a message, the consuming thread just sets mailbox[i].ready = false; mailbox[i].owned = false; (in that order).

Nobby answered 28/1, 2010 at 2:12 Comment(3)

you would need acquire/release memory barriers on the ready = true/false and owned = false settings. As well as within the CompareAndSwap, but usually that is just part of the function. – Tennies 16/4, 2010 at 21:41

That depends entirely upon your architecture's memory ordering rules. If writes are not reordered with respect to other writes issuing from the same processor (which is common), then the memory barrier with CompareAndSwap itself is sufficient for correctness. – Nobby 16/4, 2010 at 23:49

A RMW atomic operation isn't necessarily going to be faster than something that uses mutexes. If you want lock-free for speed then at the very least you'll need an algorithm that doesn't use RMW, nor sequential consistent atomic operations (aka, only release and acquire). – Waggle 13/7, 2019 at 15:27

O

12

Lock-free Multiple Producer Single Consumer (MPSC) Queue is one of the easiest lock-free algorithms to implement.

The most basic implementation requires a simple lock-free singly-linked list (SList) with only push() and flush(). The functions are available in the Windows API as InterlockedFlushSList() and InterlockedPushEntrySList() but these are very easy to roll on your own.

Multiple Producer push() items onto the SList using a CAS (interlocked compare-and-swap).

The Single Consumer does a flush() which swaps the head of the SList with a NULL using an XCHG (interlocked exchange). The Consumer then has a list of items in the reverse-order.

To process the items in order, you must simply reverse the list returned from flush() before processing it. If you do not care about order, you can simply walk the list immediately to process it.

Two notes if you roll your own functions:

1) If you are on a system with weak memory ordering (i.e. PowerPC), you need to put a "release memory barrier" at the beginning of the push() function and an "aquire memory barrier" at the end of the flush() function.

2) You can make the functions considerably simplified and optimized because the ABA-issue with SLists occur during the pop() function. You can not have ABA-issues with a SList if you use only push() and flush(). This means you can implement it as a single pointer very similar to the non-lockfree code and there is no need for an ABA-prevention sequence counter.

Oxcart answered 30/9, 2010 at 2:53 Comment(2)

You always need acquire / release barriers (or C++11 release / acquire loads without global barriers), but on x86 they compile to zero extra instructions and only prevent compile-time reordering. preshing.com/20120625/memory-ordering-at-compile-time. – Kayceekaye 28/12, 2017 at 16:30

Yeah.... the lightweight sync / compiler barrier / acquire / release is something to be aware of any time you're writing lockfree code. – Oxcart 29/12, 2017 at 21:28

L

3

Here's a paper from the University of Rochester illustrating a non-blocking concurrent queue. The algorithm described in the paper shows one technique for making a lockless queue.

Leo answered 28/1, 2010 at 2:1 Comment(3)

Java's ConcurrentLinkedQueue implements the algorithm in the paper, FYI. – Leviticus 28/1, 2010 at 2:15

I believe it's also, roughly, the basis for the net ConcurrentQueue<T> implementation in .NET: blogs.msdn.com/pfxteam/archive/2010/01/26/9953725.aspx – Leo 28/1, 2010 at 2:17

Looks like the link is broken! Here is an updated link to the paper from the University of Rochester (pdf can be downloaded at the top). – Mockingbird 30/9, 2023 at 9:58

N

3

Sure, if you have an atomic CompareAndSwap instruction:

for (i = 0; ; i = (i + 1) % MAILBOX_SIZE)
{
    if ((mailbox[i].owned == false) &&
        (CompareAndSwap(&mailbox[i].owned, true, false) == false))
        break;
}

mailbox[i].message = message;
mailbox[i].ready = true;

After reading a message, the consuming thread just sets mailbox[i].ready = false; mailbox[i].owned = false; (in that order).

Nobby answered 28/1, 2010 at 2:12 Comment(3)

you would need acquire/release memory barriers on the ready = true/false and owned = false settings. As well as within the CompareAndSwap, but usually that is just part of the function. – Tennies 16/4, 2010 at 21:41

That depends entirely upon your architecture's memory ordering rules. If writes are not reordered with respect to other writes issuing from the same processor (which is common), then the memory barrier with CompareAndSwap itself is sufficient for correctness. – Nobby 16/4, 2010 at 23:49

A RMW atomic operation isn't necessarily going to be faster than something that uses mutexes. If you want lock-free for speed then at the very least you'll need an algorithm that doesn't use RMW, nor sequential consistent atomic operations (aka, only release and acquire). – Waggle 13/7, 2019 at 15:27

S

0

may want to look at Intel thread building blocks, I recall being to lecture by Intel developer that mentioned something along those lines.

Sheritasherj answered 28/1, 2010 at 1:57 Comment(0)

Recommended topics

Hot tags