Circular lock-free buffer

Asked 15/5, 2009 at 23:11 Answered 17/7, 2020 at 20:14

c++algorithm multithreading concurrency lock-free

I'm in the process of designing a system which connects to one or more stream of data feeds and do some analysis on the data than trigger events based on the result. In a typical multi-threaded producer/consumer setup, I will have multiple producer threads putting data into a queue, and multiple consumer threads reading the data, and the consumers are only interested in the latest data point plus n number of points. The producer threads will have to block if slow consumer can not keep up, and of course consumer threads will block when there are no unprocessed updates. Using a typical concurrent queue with reader/writer lock will work nicely but the rate of data coming in could be huge, so i wanted to reduce my locking overhead especially writer locks for the producers. I think a circular lock-free buffer is what I needed.

Now two questions:

Is circular lock-free buffer the answer?
If so, before i roll my own, do you know any public implementation that will fit my need?

Any pointers in implementing a circular lock-free buffer are always welcome.

BTW, doing this in C++ on Linux.

Some additional info:

The response time is critical for my system. Ideally the consumer threads will want to see any updates coming in as soon as possible because an extra 1 millisecond delay could make the system worthless, or worth a lot less.

The design idea I'm leaning toward is a semi-lock-free circular buffer where the producer thread put data in the buffer as fast as it can, let's call the head of the buffer A, without blocking unless the buffer is full, when A meets the end of buffer Z. Consumer threads will each hold two pointers to the circular buffer, P and P_n, where P is the thread's local buffer head, and P_n is nth item after P. Each consumer thread will advance its P and P_n once it finish processing current P and the end of buffer pointer Z is advanced with the slowest P_n. When P catch up to A, which means no more new update to process, the consumer spins and do busy wait for A to advance again. If consumer thread spin for too long, it can be put to sleep and wait for a condition variable, but i'm okay with consumer taking up CPU cycle waiting for update because that does not increase my latency (I'll have more CPU cores than threads). Imagine you have a circular track, and the producer is running in front of a bunch of consumers, the key is to tune the system so that the producer is usually runing just a few step ahead of the consumers, and most of these operation can be done using lock-free techniques. I understand getting the details of the implementation right is not easy...okay, very hard, that's why I want to learn from others' mistakes before making a few of my own.

Exhilarate answered 15/5, 2009 at 23:11 Comment(10)

I think it would be helpful if you sketch the API that you want this data structure to implement. – Sechrist 15/5, 2009 at 23:23

Something I learned is take big chunks of work. I don't know the size of your work items, but you can increase efficiency if you can produce bigger chunks and consume bigger chunks. You can also increase it by consuming variable size chunks so the consumers don't all finish at once and contend on the data queue. – Haggai 15/5, 2009 at 23:42

Another thing to think of is if you need one buffer or a series of buffers. You could have producer/consumer pairs sharing one buffer, and when a buffer is full, the producer or consumer temporarily switches to another open buffer. It's a form of work stealing. – Haggai 15/5, 2009 at 23:45

@Sechrist the api is presumably just one enqueue method plus one dequeue method ... both of them being synchronous/blocking methods, i.e. dequeue should block if there were nothing in the queue, and enqueue block if the circular buffer were full. – Stambaugh 16/5, 2009 at 0:4

Efficient lock-free algorithms are unique snowflakes whose discovery usually merits a research article. I'm not going to attempt to answer this question until OP distinguishes his actual requirements from what he thinks a solution should look like. – Sechrist 16/5, 2009 at 0:27

Zan Lynx: Yes, bundling up work can reduce lock overhead. I have that in my previous systems. I also have dynamic bundling size base on work load. It worked pretty well, but this time the data rate is too fast for my old system to handle, that's why i have to rethink the whole thing. – Exhilarate 16/5, 2009 at 4:32

A millisecond is a very fast timing deadline on unmodified Linux. If another process gets to run, then you could easily miss it. You would need to use real-time priorities, and even then I'm not sure that you can reliably meet those deadlines. Are you sure that you need to be that responsive? Can you make just the producers that fast, implement them in e.g. a device driver, and relax the requirements on the consumers? – Chun 16/5, 2009 at 11:45

Doug, what i meant to say was 1 extra millisecond will make a big difference. I have not done any profiling to see if other system process maybe causing latency on my system, but i think on average preempting by system process isn't going have any significant impact. – Exhilarate 17/5, 2009 at 16:10

why don't you just use semaphores? No locking/blocking, consumer only goes sleep when the buffer is empty, producer when the buffer is full. What's wrong with that? – Halcyon 8/8, 2011 at 11:5

If you really need hard-real-time behavior, it's probably worthwhile looking in to running the program as a hard-real-time task on a hard-real-time OS. For a Linux-friendly version, check out Xenomai, it will let you run hard-real-time and regular (soft-real-time/non-real-time) processes simultaneously in the same environment. – Congruency 3/1, 2016 at 20:34

I've made a particular study of lock-free data structures in the last couple of years. I've read most of the papers in the field (there's only about fourty or so - although only about ten or fifteen are any real use :-)

AFAIK, a lock-free circular buffer has not been invented. The problem will be dealing with the complex condition where a reader overtakes a writer or vis-versa.

If you have not spent at least six months studying lock-free data structures, do not attempt to write one yourself. You will get it wrong and it may not be obvious to you that errors exist, until your code fails, after deployment, on new platforms.

I believe however there is a solution to your requirement.

You should pair a lock-free queue with a lock-free free-list.

The free-list will give you pre-allocation and so obviate the (fiscally expensive) requirement for a lock-free allocator; when the free-list is empty, you replicate the behaviour of a circular buffer by instantly dequeuing an element from the queue and using that instead.

(Of course, in a lock-based circular buffer, once the lock is obtained, obtaining an element is very quick - basically just a pointer dereference - but you won't get that in any lock-free algorithm; they often have to go well out of their way to do things; the overhead of failing a free-list pop followed by a dequeue is on a par with the amount of work any lock-free algorithm will need to be doing).

Michael and Scott developed a really good lock-free queue back in 1996. A link below will give you enough details to track down the PDF of their paper; Michael and Scott, FIFO

A lock-free free-list is the simplest lock-free algorithm and in fact I don't think I've seen an actual paper for it.

Viguerie answered 20/5, 2009 at 21:9 Comment(16)

@Blank Xavier: The Michael and Scott FIFO looks a lot like one I independently implemented in .net; it didn't seem hard. If the .net run-time and garbage-collector didn't guarantee that objects will never be recycled while references exist, it would have been hard to prevent the ABA problem (the Michael and Scott paper linked above didn't mention this) but the .net garbage collector solves that problem automatically. Out of curiosity, how did you solve the ABA problem? – Karlin 23/11, 2010 at 17:21

@Supercat: the M&S paper explicitly solves the ABA problem, by using pointer-counter pairs; "structure pointer_t {ptr: pointer to node_t, count: unsigned integer}". The counter is incremented when necessary, so that it is almost impossible for ABA to occur. – Viguerie 24/11, 2010 at 0:39

@Blank Xavier: Ah, I somehow missed that. Their CAS seems to assume the existence of an instruction to swap a two-item struct; I'm unavailable of any such feature in .net. – Karlin 24/11, 2010 at 2:2

@Supercat: I'm surprised. Double-word CAS is available in the Win32 API. If you implement hazard pointers, I think you can still use their algorithm. I'll be trying that out soon enough. – Viguerie 29/11, 2010 at 8:41

@Blank Xavier: Double-word CAS is exposed for arguments of type 'Long' only. I can think of three reasons: (1) A struct containing two 32-bit items may sit on an odd 32-bit boundary; (2) the most useful scenarios would involve pairing an object reference with something, and for some reason object references grow to 64 bits in x64 (I'd have left them as 32 bits, since 2+ billion objects would have been a lot more than 32-bit Windows can handle); (3) I'm not sure hardware will serialize a DCAS that occurs simutaneously with a CAS on half of it. – Karlin 29/11, 2010 at 15:54

In Win32, there are Interlocked*() API function calls for everything except 64-bit DCAS. 64-bit DCAS (e.g. 128 bits atomically CASed) has a different type of prototype, starting with an underscore, which may be why it's not obviously present. I don't -know- if DCAS serialises correctly with CAS, but I think it does - the basic mechanism is the lock the given cache lines, so DCAS will try to lock two cache lines whereas CAS will lock one - DCAS will notice if someone else has locked the cache line it is trying to obtain. – Viguerie 24/1, 2011 at 13:55

Note that a lock-free fixed-length (array based, in fact) circular buffer, for a single reader and single writer, where the buffer being full leads to failure to write, exists and has existed for a long time. – Viguerie 23/7, 2012 at 1:4

I've used such single-reader/single-writer queues often on embedded systems. No special atomic primitives are required except for "volatile" variables, since since every storage location will be "owned" by by either the reader or the writer, and neither the reader and writer will ever allow the other to see an illegitimate state. – Karlin 23/7, 2012 at 15:22

I wonder if volatile is enough? could there a re-ordering issue? might memory barriers be necessary? – Viguerie 23/7, 2012 at 15:39

Different compilers and implementations do different things with volatile variables. What is necessary here is mainly that if thread 1 writes two volatile variables X and Y, in that order, and if thread 2 reads Y then X, then if thread 2 sees the new value of Y, it must also see the new value of X. On the embedded systems where I've done such things, volatile variables provide such guarantees (not hard when there's only one processor). On multi-core systems, I believe most compilers will generate memory barriers sufficient to guarantee the above semantics with volatile variables... – Karlin 23/7, 2012 at 17:42

...although using volatile variables would likely also generate additional unnecessary (but costly) memory barriers as well. Personally, I suspect that as processors get more and more cores, it will become necessary to have memory-synchronization primitives that can operate at a finer level than memory barriers (whose cost increases with the number of cores). – Karlin 23/7, 2012 at 17:47

As I understand it, volatile won't be needed on a single core system - there are no external sources changing the values in the variables; only threads on the same single core. – Viguerie 23/7, 2012 at 18:31

On multi-core systems, I know recent MSVCs will for reading a volatile before a read memory barrier and for writing a write, but that I think is totally compiler dependent. I'd write explict memory barriers (hopefully the compiler will remove any duplicates!) – Viguerie 23/7, 2012 at 18:32

Even on a single-core system, at least in C, volatile is needed to ensure that the compiler will not attempt to reorder or eliminate reads and writes. For example, if SPI_* are not volatile, for(i=0;i<16;i++){SPI_DAT = tx[i]; temp=16000; do {} while(!SPI_RDY && --temp); if (!temp) break; rx[i] = SPI_DAT;} could be rewritten (assuming neither temp nor i was used further along in the code) as if (SPI_RDY) {memcpy(rx,tx,16); SPI_DAT = tx[15];} else SPI_DAT = tx[0];. The latter code would be much more "efficient", but wouldn't be very effective at reading or writing SPI data. – Karlin 23/7, 2012 at 18:49

I may be wrong, but I think unless the volatile is performing a memory barrier, it will not have any impact on the re-ordering behaviour of the compiler or CPU. All it will do is ensure its value is read from memory rather than a register or cache. – Viguerie 23/7, 2012 at 20:14

IIRC, volatile memory accesses won't be reordered with other volatile accesses. But in ISO C, that's all you get. In MSVC, volatile goes way beyond that, but these days you should just use std::atomic with memory_order_release or seq_cst or whatever you want. – Ripuarian 16/9, 2017 at 0:5

The term of art for what you want is a lock-free queue. There's an excellent set of notes with links to code and papers by Ross Bencina. The guy whose work I trust the most is Maurice Herlihy (for Americans, he pronounces his first name like "Morris").

Anderer answered 16/5, 2009 at 1:9 Comment(6)

A queue is not a circular buffer. – Viguerie 20/5, 2009 at 20:59

@Blank Xavier: No, but a circular buffer is a queue. The problem calls for a queue. And the most efficient implementation of a queue is as... (wait for it) a circular buffer. In any case, if you want to search, you search for 'lock-free queue', not 'lock-free circular buffer'. – Anderer 20/5, 2009 at 23:19

I don't see any reason why just not to use semaphores? No locking/blocking, consumer only goes sleep when the buffer is empty, producer when the buffer is full. What's wrong with that? How could some lock-free queue be better than this? – Halcyon 8/8, 2011 at 11:5

@Tomas: the lock-free queue can be better because there's no single lock to act as a performance bottleneck. OP was specifically concerned about reducing lock overhead when contention is very high. Semaphores don't help with contention. A lock-free queue does. – Anderer 9/8, 2011 at 17:25

liblfds.org has a well-regarded multi-producer/multi-consumer circular buffer queue. It's not technically lock-free: a producer that sleeps in the middle of adding an entry can block consumers from seeing anything from other producers. See #45907710 for an analysis of its progress guarantees. It's very low-contention (none between a producer and a consumer if it's not empty or full), and may be what you want in practice.. – Ripuarian 16/9, 2017 at 0:41

@Viguerie A circular buffer is an implementation of a queue with bounded capacity, right? – Glossectomy 30/10, 2020 at 14:43

The requirement that producers or consumers block if the buffer is empty or full suggests that you should use a normal locking data structure, with semaphores or condition variables to make the producers and consumers block until data is available. Lock-free code generally doesn't block on such conditions - it spins or abandons operations that can't be done instead of blocking using the OS. (If you can afford to wait until another thread produces or consumes data, then why is waiting on a lock for another thread to finish updating the data structure any worse?)

On (x86/x64) Linux, intra-thread synchronization using mutexes is reasonably cheap if there is no contention. Concentrate on minimizing the time that the producers and consumers need to hold onto their locks. Given that you've said that you only care about the last N recorded data points, I think a circular buffer would be do this reasonably well. However, I don't really understand how this fits in with the blocking requirement and the idea of consumers actually consuming (removing) the data they read. (Do you want consumers to only look at the last N data points, and not remove them? Do you want producers to not care if consumers can't keep up, and just overwrite old data?)

Also, as Zan Lynx commented, you can aggregate/buffer up your data into bigger chunks when you've got lots of it coming in. You could buffer up a fixed number of points, or all the data received within a certain amount of time. This means that there will be fewer synchronization operations. It does introduce latency, though, but if you're not using real-time Linux, then you'll have to deal with that to an extent anyway.

Chun answered 16/5, 2009 at 0:44 Comment(4)

absolutely agree with the first paragraph. Don't see any reason for not using semaphore here. – Halcyon 8/8, 2011 at 11:6

@TMS, An application that has dedicated producer threads and dedicated consumer threads (e.g., as described by the OP) can never be called "lock-free." In a lock-free application with N threads, you should be able to indefinitely suspend any combination of (N-1) or fewer of the threads, and the application should still continue to make progress. But a dedicated consumer thread can not indefinitely make progress if all of the producers are suspended, and if dropping the "product" on the floor is not allowed, then a producer can not make progress if none of the consumers is allowed to run. – Subcontinent 31/3, 2017 at 20:31

@jameslarge: "lock-free" can describe an algorithm or data structure (like a queue). en.wikipedia.org/wiki/Non-blocking_algorithm It's not usually applied to a whole application. Suspending all the producer threads simply means the queue will empty. But in a lock-free queue, suspending any one or more threads at any time must not stop the other threads from being able to enqueue & dequeue. (Even this is not easy to achieve; efficient implementations will often have a thread "claim" a slot: #45907710) – Ripuarian 16/9, 2017 at 0:16

@Doug: Yes, a consumer/producer should sleep if they find the queue empty/full. But no, it doesn't always mean you should implement the queue with traditional locking, especially not with one big lock for the whole data structure. This queue (same as previous comment) is not "lock-free" in the technical sense, but it does allow producers to be totally independent from consumers (no contention), giving potentially better throughput than you could get locking. Good point about efficient wake on empty->non-empty though. – Ripuarian 16/9, 2017 at 0:25

The implementation in the boost library is worth considering. It's easy to use and fairly high performance. I wrote a test & ran it on a quad core i7 laptop (8 threads) and get ~4M enqueue/dequeue operations a second. Another implementation not mentioned so far is the MPMC queue at http://moodycamel.com/blog/2014/detailed-design-of-a-lock-free-queue. I have done some simple testing with this implementation on the same laptop with 32 producers and 32 consumers. It is, as advertised, faster that the boost lockless queue.

As most of the other answers state lockless programming is hard. Most implementations will have hard to detect corner cases that take a lot of testing & debugging to fix. These are typically fixed with careful placement of memory barriers in the code. You will also find proofs of correctness published in many of the academic articles. I prefer testing these implementations with a brute force tool. Any lockless algorithm you plan on using in production should be checked for correctness using a tool like http://research.microsoft.com/en-us/um/people/lamport/tla/tla.html.

Millsaps answered 13/9, 2015 at 17:49 Comment(0)

There is a pretty good series of articles about this on DDJ. As a sign of how difficult this stuff can be, it's a correction on an earlier article that got it wrong. Make sure you understand the mistakes before you roll your own )-;

Mattress answered 16/5, 2009 at 0:2 Comment(0)

One useful technique to reduce contention is to hash the items into multiple queues and have each consumer dedicated to a "topic".

For most-recent number of items your consumers are interested in - you don't want to lock the whole queue and iterate over it to find an item to override - just publish items in N-tuples, i.e. all N recent items. Bonus points for implementation where producer would block on the full queue (when consumers can't keep up) with a timeout, updating its local tuple cache - that way you don't put back-pressure on the data source.

Halogenate answered 16/5, 2009 at 3:37 Comment(6)

I have also considered the boss/worker threading model where the boss thread multicast updates to worker threads' private queues. I think this is more/less the direction you're going. I have to give it more though, but when i was considering it, boss/worker seemed to have too much overhead because all worker must get the same updates. – Exhilarate 16/5, 2009 at 5:1

Not exactly - what I mean in the first point is slice your incoming stream so not all the threads compete for the same lock/queue. The second point is caching on the producer side to accommodate spikes on the input and also allow for slow consumers not to halt the producer. – Halogenate 16/5, 2009 at 17:4

But business logic requires all worker thread to know all data that is streaming in. There is one and only one type of data coming in and every data point is equally important, so i can't really slice my incoming stream and have different data in different queues. Cashing on the producer side and bundling updates to the data model to prevent hogging has been done and it wasn't enough to handle the load. – Exhilarate 17/5, 2009 at 16:28

How big is the input domain? If it's something like market data in financial world, you have limited, though big, number of items and only few types of updates. Are workers reactive to the input events or do they do their own processing and only poll for your input when necessary? – Halogenate 17/5, 2009 at 22:56

It is something like market data in the financial world. Workers do their own processing and they have random access to n numbers of update history when needed (n is a configurable number but will not change during the lifetime of the process). I would like to design a system that works well on both large and small n so I can have one code base. – Exhilarate 18/5, 2009 at 17:44

Well, you can hand off n-tuples to consumers with swap of a pointer to it (you'd need to worry about memory fences, etc. - ordered atomic) but then you'll get into memory management issues like with hazard pointers. – Halogenate 18/5, 2009 at 18:33

Sutter's queue is sub-optimal and he knows it. The Art of Multicore programming is a great reference but don't trust the Java guys on memory models, period. Ross's links will get you no definite answer because they had their libraries in such problems and so on.

Doing lock-free programming is asking for trouble, unless you want to spend a lot of time on something that you are clearly over-engineering before solving the problem (judging by the description of it, it is a common madness of 'looking for perfection' in cache coherency). It takes years and leads to not solving the problems first and optimising later, a common disease.

Impermissible answered 16/5, 2009 at 23:31 Comment(6)

Would you post a link to analysis of Sutter's queue? – Halogenate 18/5, 2009 at 0:0

it's all on DDJ and one of the guys following his blogs profiled it .. The point is that hot CAS isn't required for many scenarios and that you can beat that kind of fine granularity any day even with simple swaps. – Impermissible 18/5, 2009 at 7:37

That's it, thanks.. But I believe it still might have some races. Heck, anything out there as sensitive as expecting implicit barriers or specific or full coherency-understanding is a problem waiting to happen in production. I just don't believe that level of detail solves so much as focus on application-level design rather than low-level plumbing when/only-if it makes sense/is-identified to be so. I applaud the effort, the books, all of it; but its just articles on a touch topic even MS is struggling to do well for the mass-market PFX crowd. – Impermissible 18/5, 2009 at 18:31

Just an opinion there is always more important work to do than looking into plumbing. Parallel efforts ripple across the board not just queues, or indeed mid-1990s threading DDJ articles keep reinventing; that is, from NT to later Solaris and Unix adopting similar techniques or latest work on C++. The latter is and probably will take ages to complete and still fight the fact no clean OO-way to post P2-Pro-like out-of-order universe is sensible.. – Impermissible 18/5, 2009 at 18:32

Dennis' site has moved to landenlabs.com/code/ring/ring.html - It has a lock free ring buffer. – Longwood 16/11, 2015 at 13:20

@user1959190: that lock-free implementation uses volatile size_t m_rIndex, m_wIndx; instead of C++11 std::atomic for the indices, but it looks to me like it depends on acquire-load / release-store behaviour from it (e.g. other threads have to see the m_buffer[m_wIndex] = value store before they see m_wIndex = Next(m_wIndex)). So it happens to work on x86, but will break on ARM/PowerPC/whatever. It's also inefficient, because instead of loading from the volatile into a local variable, it keeps re-referencing the volatile value in the Get() and Put() functions. – Ripuarian 16/9, 2017 at 1:32

I am not expert of hardware memory models and lock free data structures and I tend to avoid using those in my projects and I go with traditional locked data structures.

However I`ve recently noticed that video : Lockless SPSC queue based on ring buffer

This is based on an open source high performance Java library called LMAX distruptor used by a trading system : LMAX Distruptor

Based on the presentation above, you make head and tail pointers atomic and atomically check for the condition where head catches tail from behind or vice versa.

Below you can see a very basic C++11 implementation for it :

// USING SEQUENTIAL MEMORY
#include<thread>
#include<atomic>
#include <cinttypes>
using namespace std;

#define RING_BUFFER_SIZE 1024  // power of 2 for efficient %
class lockless_ring_buffer_spsc
{
    public :

        lockless_ring_buffer_spsc()
        {
            write.store(0);
            read.store(0);
        }

        bool try_push(int64_t val)
        {
            const auto current_tail = write.load();
            const auto next_tail = increment(current_tail);
            if (next_tail != read.load())
            {
                buffer[current_tail] = val;
                write.store(next_tail);
                return true;
            }

            return false;  
        }

        void push(int64_t val)
        {
            while( ! try_push(val) );
            // TODO: exponential backoff / sleep
        }

        bool try_pop(int64_t* pval)
        {
            auto currentHead = read.load();

            if (currentHead == write.load())
            {
                return false;
            }

            *pval = buffer[currentHead];
            read.store(increment(currentHead));

            return true;
        }

        int64_t pop()
        {
            int64_t ret;
            while( ! try_pop(&ret) );
            // TODO: exponential backoff / sleep
            return ret;
        }

    private :
        std::atomic<int64_t> write;
        std::atomic<int64_t> read;
        static const int64_t size = RING_BUFFER_SIZE;
        int64_t buffer[RING_BUFFER_SIZE];

        int64_t increment(int n)
        {
            return (n + 1) % size;
        }
};

int main (int argc, char** argv)
{
    lockless_ring_buffer_spsc queue;

    std::thread write_thread( [&] () {
             for(int i = 0; i<1000000; i++)
             {
                    queue.push(i);
             }
         }  // End of lambda expression
                                                );
    std::thread read_thread( [&] () {
             for(int i = 0; i<1000000; i++)
             {
                    queue.pop();
             }
         }  // End of lambda expression
                                                );
    write_thread.join();
    read_thread.join();

     return 0;
}

Argybargy answered 17/6, 2015 at 14:30 Comment(7)

Use a power of 2 for your size, so the % (modulo) is just a bit bitwise AND. Also, storing a sequence number in your slots would reduce contention between producer and consumer. In this, the producer has to read the write position, and vice-versa, so the cache line containing those atomic variables ping-pongs between cores. See #45907710 for a slot sequence-number way. (It's a multi-producer multi-consumer queue, and could be greatly simplified to a single producer/consumer queue like this.) – Ripuarian 16/9, 2017 at 1:4

I'm pretty sure many of the loads/stores only need memory_order_acquire or release, not the default seq_cst. This is a big difference on x86, where seq_cst stores need mfence (or xchg), but release stores are just plain x86 stores. StoreLoad barriers are the most expensive barrier on most other architectures, too. (preshing.com/20120710/…) – Ripuarian 16/9, 2017 at 1:17

It would probably be better to put read after buffer in the class layout, so it's in a different cache line from write. So the two threads will only be reading cache lines written by the other, rather than both writing into the same cache line. Also, they should be size_t: there's no point having 64-bit counters with 32-bit pointers. And an unsigned type makes modulo much more efficient (godbolt.org/g/HMVL5C). Even uint32_t would be reasonable for almost all uses. It would probably be best to template this on size, or dynamically allocate the buffer. – Ripuarian 16/9, 2017 at 1:23

@PeterCordes why modulo is more efficient with power of 2? – Marshall 12/4, 2019 at 13:41

@BaiyanHuang: Because computers use binary integers, so you just keep the low n bits with an AND. e.g. x % 8 = x & 7, and bitwise AND is much cheaper than div, or even tricks you can do with compile-time-constant divisors. – Ripuarian 13/4, 2019 at 0:49

@PeterCordes Signed integers are generally faster for most operations, if your modulo is a constant power of two, you can use a value & (power - 1) to get the same optimized modulo operation as unsigned integers but still allow the compiler to perform extra optimizations because of the explicitly undefined behavior of edge cases in signed operations. – Baird 7/2, 2021 at 11:45

@Baird - for most operations they're the same speed. Do you have a specific ISA in mind for your performance claim? On x86-64, one of the differences is that zero-extension from 32 to 64-bit is often free in asm, happening implicitly every time you write a 32-bit register. Sign-extension from int to pointer-width requires a special instruction. I know some other ISAs have different standard calling conventions (e.g. MIPS64 keeps values sign-extended to 64-bit on pass/return, so receiving a signed arg can be cheaper for a non-inlined function.) Signed overflow UB allows some optimizations. – Ripuarian 7/2, 2021 at 18:35

I would agree with this article and recommend against using lock-free data structures. A relatively recent paper on lock-free fifo queues is this, search for further papers by the same author(s); there's also a PhD thesis on Chalmers regarding lock-free data structures (I lost the link). However, you did not say how large your elements are -- lock-free data structures work efficiently only with word-sized items, so you'll have to dynamically allocate your elements if they're larger than a machine word (32 or 64 bits). If you dynamically allocate elements, you shift the (supposed, since you haven't profiled your program and you're basically doing premature optimization) bottleneck to memory allocator, so you need a lock-free memory allocator, e.g., Streamflow, and integrate it with your application.

Gallo answered 16/5, 2009 at 5:29 Comment(1)

If you fully pre-allocate your elements, you do not stress the allocator. – Viguerie 24/1, 2011 at 13:56

This is an old thread, but since it hasn't been mentioned, yet - there is a lock-free, circular, 1 producer -> 1 consumer, FIFO available in the JUCE C++ framework.

https://www.juce.com/doc/classAbstractFifo#details

Radii answered 19/4, 2016 at 8:54 Comment(0)

Although this is an old question, no one mentioned DPDK's lockless ring buffer. It's a high throughput ring buffer that supports multiple producers and multiple consumers. It also provides single consumer and single producer modes, and the ring buffer is wait-free in SPSC mode. It's written in C and supports multiple architectures.

In addition, it supports Bulk and Burst modes where items can be enqueued/dequeued in bulk. The design let's multiple consumers or multiple producers write to the queue at the same time by simple reserving the space through moving an atomic pointer.

Pilcomayo answered 15/9, 2017 at 19:41 Comment(3)

Is it truly lock-free, or only if a producer / consumer doesn't sleep after claiming a slot but before finishing the enqueue/dequeue? See this analysis of the multi-producer multi-consumer queue in liblfds.org, which probably works pretty similarly. In practice it works great with low contention, but it's not technically lock-free. Anyway, upvoted because bulk/burst mode sounds like a good idea. – Ripuarian 16/9, 2017 at 0:50

I agree, it does not guarantee termination-safety and according to [1024cores] (1024cores.net/home/lock-free-algorithms/introduction), it is a blocking algorithm and system might not make forward progress. But it becomes wait-free in SPSC mode. I edit the answer to reflect this. – Pilcomayo 18/9, 2017 at 19:5

Also take a look at Dmitry's implementation of Bounded MPMC queue: 1024cores.net/home/lock-free-algorithms/queues/…. Again this is not lock-free but is lockless, and is very simple and efficient. However, in terms of performance the DPDK's queue in bulk/burst mode can reach up to hundred millions of operations per second in terms of throughput depending on the batch size. The combination of atomic operations and sequential read/write makes it very efficient. – Pilcomayo 18/9, 2017 at 19:42

Sometime ago, I've found a nice solution to this problem. I believe that it the smallest found so far.

The repository has a example of how use it to create N threads (readers and writers) and make then share a single seat.

I made some benchmarks, on the test example and got the following results (in million ops/sec) :

By buffer size

By number of threads

Notice how the number of threads do not change the throughput.

I think this is the ultimate solution to this problem. It works and is incredible fast and simple. Even with hundreds of threads and a queue of a single position. It can be used as a pipeline beween threads, allocating space inside the queue.

Can you break it?

Aldridge answered 31/12, 2019 at 3:32 Comment(0)

Check out Disruptor (How to use it) which is a ring-buffer that multiple threads can subscribe to:

Aretina answered 21/3, 2012 at 8:53 Comment(0)

Just for completeness: there's well tested lock-free circular buffer in OtlContainers, but it is written in Delphi (TOmniBaseBoundedQueue is circular buffer and TOmniBaseBoundedStack is bounded stack). There's also an unbounded queue in the same unit (TOmniBaseQueue). The unbounded queue is described in Dynamic lock-free queue – doing it right. The initial implementation of the bounded queue (circular buffer) was described in A lock-free queue, finally! but the code was updated since then.

Catheryncatheter answered 15/5, 2009 at 23:11 Comment(0)

Here is how I would do it:

map the queue into an array
keep state with a next read and next next write indexes
keep an empty full bit vector around

Insertion consists of using a CAS with an increment and roll over on the next write. Once you have a a slot, add your value and then set the empty/full bit that matches it.

Removals require a check of the bit before to test on underflows but other than that, are the same as for the write but using read index and clearing the empty/full bit.

Be warned,

I'm no expert in these things
atomic ASM ops seem to be very slow when I've used them so if you end up with more than a few of them, you might be faster to use locks embedded inside the insert/remove functions. The theory is that a single atomic op to grab the lock followed by (very) few non atomic ASM ops might be faster than the same thing done by several atomic ops. But to make this work would require manual or automatic inlineing so it's all one short block of ASM.

Cultivator answered 20/5, 2009 at 21:33 Comment(3)

Atomic operations are indeed slow, in and of themselves. What makes them useful is that they scale. – Viguerie 20/5, 2009 at 21:59

If the operations inside the lock are very small (as in 5-10 lines of ASM) you might still be ahead to use a locking strategy, particularly if you write the lock directly into the critical sections rather than as a function call. – Cultivator 20/5, 2009 at 22:3

I'm confused. A critical section to me is the section of code which must be serially executed. The lock is the mechanism which ensures serialiality of execution. Could you explain to me what you mean? – Viguerie 20/5, 2009 at 22:8

You may try lfqueue

It is simple to use, it is circular design lock free

int *ret;

lfqueue_t results;

lfqueue_init(&results);

/** Wrap This scope in multithread testing **/
int_data = (int*) malloc(sizeof(int));
assert(int_data != NULL);
*int_data = i++;
/*Enqueue*/
while (lfqueue_enq(&results, int_data) != 1) ;

/*Dequeue*/
while ( (ret = lfqueue_deq(&results)) == NULL);

// printf("%d\n", *(int*) ret );
free(ret);
/** End **/

lfqueue_clear(&results);

Umpire answered 24/7, 2018 at 14:32 Comment(0)

There are situations that you don't need locking to prevent race condition, especially when you have only one producer and consumer.

Consider this paragraph from LDD3:

When carefully implemented, a circular buffer requires no locking in the absence of multiple producers or consumers. The producer is the only thread that is allowed to modify the write index and the array location it points to. As long as the writer stores a new value into the buffer before updating the write index, the reader will always see a consistent view. The reader, in turn, is the only thread that can access the read index and the value it points to. With a bit of care to ensure that the two pointers do not overrun each other, the producer and the consumer can access the buffer concurrently with no race conditions.

Solicit answered 29/11, 2018 at 9:39 Comment(1)

LDD3? Linux Device Drivers 3? – Resurge 20/7, 2020 at 15:34

If you take as a prerequisite that the buffer will never become full, consider using this lock-free algorithm:

capacity must be a power of 2
buffer = new T[capacity] ~ on different cache line
mask = capacity - 1
write_index ~ on different cache line
read_index ~ on different cache line

enqueue:
    write_i = write_index.fetch_add(1) & mask
    buffer[write_i] = element ~ release store

dequeue:
    read_i = read_index.fetch_add(1) & mask
    element
    while ((element = buffer[read_i] ~ acquire load) == NULL) {
        spin loop
    }
    buffer[read_i] = NULL ~ relaxed store
    return element

Trust answered 17/7, 2020 at 20:14 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags