Lock-free Progress Guarantees in a circular buffer queue

Asked 27/8, 2017 at 16:50 Answered 25/3, 2024 at 0:30

Solved c++multithreading algorithm concurrency lock-free

Anecdotally, I've found that a lot of programmers mistakenly believe that "lock-free" simply means "concurrent programming without mutexes". Usually, there's also a correlated misunderstanding that the purpose of writing lock-free code is for better concurrent performance. Of course, the correct definition of lock-free is actually about progress guarantees. A lock-free algorithm guarantees that at least one thread is able to make forward progress regardless of what any other threads are doing.

This means a lock-free algorithm can never have code where one thread is depending on another thread in order to proceed. E.g., lock-free code can not have a situation where Thread A sets a flag, and then Thread B keeps looping while waiting for Thread A to unset the flag. Code like that is basically implementing a lock (or what I would call a mutex in disguise).

However, other cases are more subtle and there are some cases where I honestly can't really tell if an algorithm qualifies as lock-free or not, because the notion of "making progress" sometimes appears subjective to me.

One such case is in the (well-regarded, afaik) concurrency library, liblfds. I was studying the implementation of a multi-producer/multi-consumer bounded queue in liblfds - the implementation is very straightforward, but I cannot really tell if it should qualify as lock-free.

The relevant algorithm is in lfds711_queue_bmm_enqueue.c. Liblfds uses custom atomics and memory barriers, but the algorithm is simple enough for me to describe in a paragraph or so.

The queue itself is a bounded contiguous array (ringbuffer). There is a shared read_index and write_index. Each slot in the queue contains a field for user-data, and a sequence_number value, which is basically like an epoch counter. (This avoids ABA issues).

The PUSH algorithm is as follows:

Atomically LOAD the write_index
Attempt to reserve a slot in the queue at write_index % queue_size using a CompareAndSwap loop that attempts to set write_index to write_index + 1.
If the CompareAndSwap is successful, copy the user data into the reserved slot.
Finally, update the sequence_index on the slot by making it equal to write_index + 1.

The actual source code uses custom atomics and memory barriers, so for further clarity about this algorithm I've briefly translated it into (untested) standard C++ atomics for better readability, as follows:

bool mcmp_queue::enqueue(void* data)
{
    int write_index = m_write_index.load(std::memory_order_relaxed);

    for (;;)
    {
        slot& s = m_slots[write_index % m_num_slots];
        int sequence_number = s.sequence_number.load(std::memory_order_acquire);
        int difference = sequence_number - write_index;

        if (difference == 0)
        {
            if (m_write_index.compare_exchange_weak(
                write_index,
                write_index + 1,
                std::memory_order_acq_rel
            ))
            {
                break;
            }
        }

        if (difference < 0) return false; // queue is full
    }

    // Copy user-data and update sequence number
    //
    s.user_data = data;
    s.sequence_number.store(write_index + 1, std::memory_order_release);
    return true;
}

Now, a thread that wants to POP an element from the slot at read_index will not be able to do so until it observes that the slot's sequence_number is equal to read_index + 1.

Okay, so there are no mutexes here, and the algorithm likely performs well (it's only a single CAS for PUSH and POP), but is this lock-free? The reason it's unclear to me is because the definition of "making progress" seems murky when there is the possibility that a PUSH or POP can always just fail if the queue is observed to be full or empty.

But what's questionable to me is that the PUSH algorithm essentially reserves a slot, meaning that the slot can never be POP'd until the push thread gets around to updating the sequence number. This means that a POP thread that wants to pop a value depends on the PUSH thread having completed the operation. Otherwise, the POP thread will always return false because it thinks the queue is EMPTY. It seems debatable to me whether this actually falls within the definition of "making progress".

Generally, truly lock-free algorithms involve a phase where a pre-empted thread actually tries to ASSIST the other thread in completing an operation. So, in order to be truly lock-free, I would think that a POP thread that observes an in-progress PUSH would actually need to try and complete the PUSH, and then only after that, perform the original POP operation. If the POP thread simply returns that the queue is EMPTY when a PUSH is in progress, the POP thread is basically blocked until the PUSH thread completes the operation. If the PUSH thread dies, or goes to sleep for 1,000 years, or otherwise gets scheduled into oblivion, the POP thread can do nothing except continuously report that the queue is EMPTY.

So does this fit the defintion of lock-free? From one perspective, you can argue that the POP thread can always make progress, because it can always report that the queue is EMPTY (which is at least some form of progress I guess.) But to me, this isn't really making progress, since the only reason the queue is observed as empty is because we are blocked by a concurrent PUSH operation.

So, my question is: is this algorithm truly lock-free? Or is the index reservation system basically a mutex in disguise?

Unsearchable answered 27/8, 2017 at 16:50 Comment(10)

Mutexes interact with the OS's scheduler and this algorithm does not, how could they possibly be equivalent? (maybe you meant a spin-lock?) – Distraint 27/8, 2017 at 17:13

Working on an answer. I think it's technically not lock-free, because a thread that sleeps for a long time can keep a slot tied up, eventually having the queue wrap around and have writers and readers all stuck waiting for one thread to finish with its slot. But I don't think you could get this behaviour this efficiently with locking. Maybe with per-element mutexes, and atomic fetch_add for read_index and write_index. In practice, an algorithm doesn't have to be 100% lock-free to be useful or performant. It just has to not actually block with a large enough queue on real HW + OS. – Surrogate 27/8, 2017 at 19:43

@Frank - IMO a spin-lock is simply an particular implementation of a mutex. Do you have any reference that indicates that a mutex has to be allow graceful blocking or interact with the scheduler? As I see it, any implementation which allows mutual exclusion is fine and graceful blocking is only a QOI issue (the argument might be different on non-preemptive operating systems). – Marmite 27/8, 2017 at 22:48

@PeterCordes - you can get this behavior efficiently (broadly speaking) without locking. The usual approach is to flip everything around and stop making the update of m_write_index the point at which the element is inserted, but instead treat it as a "hint" of where to insert. The actual important operation then is a CAS of s.data from NULL (or some other reserved value) to the user data. At that single point the value is considered "in" the structure and then m_write_index can be updated by the pushing thread, but even if it doesn't get updated, the structure keeps working. – Marmite 27/8, 2017 at 23:20

I.e., the next writer will end up seeing that m_write_index happens to point to a full slot and just look for the next empty slot and proceed as usual, leading to the cooperative behavior that's often necessary for something to be lock free as mentioned by the OP. Note that I'm not making any claim that this structure will perform as well than the one above in practical cases (but neither is it too terrible), but it should still be relatively efficient. – Marmite 27/8, 2017 at 23:22

@Frank: The actual library uses LFDS711_BACKOFF_EXPONENTIAL_BACKOFF( qbmms->dequeue_backoff, backoff_iteration ); IDK what that does, but it may invoke OS support after some tries. That's not really the point, though. The point is that if you find that you could get equivalent behaviour with an actual mutex, should you use a mutex? Or are you still gaining anything by using atomics. (In this case I think you are.) – Surrogate 28/8, 2017 at 0:16

As a side note, I believe the first load should use at least acquire ordering – Blither 29/8, 2017 at 17:34

@LWimsey: I don't think anything bad happens with m_write_index.load(relaxed);. The acq_rel CAS is where order matters. It's typical to use a relaxed load to set up for a CAS. Readers don't touch m_write_index, so it could only sync with other writers. In practice, the compiler will probably generate code where the first s.sequence_number.load(mo_acquire) is dependency-ordered (like what mo_consume guarantees) because the load address has a data dependency on write_index. But I think it's safe even on Alpha AXP (which doesn't enforce ordering for dependent loads). – Surrogate 29/9, 2017 at 15:0

@LWimsey: And I don't think it's possible for any load reordering to lead to spurious queue-full, so at worst you get a CAS retry. Remember that the global visibility of stores to sequence number and m_write_index are still ordered, so there are limits to what load-reordering can observe. – Surrogate 29/9, 2017 at 15:3

I know this is from a long time ago, but you said that there's only one atomic operation to enqueue or push something, at the very start there's an atomic load, and when entering the loop there's another one, making a minimum of two atomic operations at the very least. – Ingratiating 7/10, 2023 at 0:38

This queue data structure is not strictly lock-free by what I consider the most reasonable definition. That definition is something like:

A structure is lock-free if only if any thread can be indefinitely suspended at any point while still leaving the structure usable by the remaining threads.

Of course this implies a suitable definition of usable, but for most structures this is fairly simple: the structure should continue to obey its contracts and allow elements to be inserted and removed as expected.

In this case a thread that has succeeded in incrementing m_write_increment, but hasn't yet written s.sequence_number leaves the container in what will soon be an unusable state. If such a thread is killed, the container will eventually report both "full" and "empty" to push and pop respectively, violating the contract of a fixed size queue.

There is a hidden mutex here (the combination of m_write_index and the associated s.sequence_number) - but it basically works like a per-element mutex. So the failure only becomes apparent to writers once you've looped around and a new writer tries to get the mutex, but in fact all subsequent writers have effectively failed to insert their element into the queue since no reader will ever see it.

Now this doesn't mean this is a bad implementation of a concurrent queue. For some uses it may behave mostly as if it was lock free. For example, this structure may have most of the useful performance properties of a truly lock-free structure, but at the same time it lacks some of the useful correctness properties. Basically the term lock-free usually implies a whole bunch of properties, only a subset of which will usually be important for any particular use. Let's look at them one by one and see how this structure does. We'll broadly categorize them into performance and functional categories.

Performance

Uncontended Performance

The uncontended or "best case" performance is important for many structures. While you need a concurrent structure for correctness, you'll usually still try to design your application so that contention is kept to a minimum, so the uncontended cost is often important. Some lock-free structures help here, by reducing the number of expensive atomic operations in the uncontended fast-path, or avoiding a syscall.

This queue implementation does a reasonable job here: there is only a single "definitely expensive" operation: the compare_exchange_weak, and a couple of possibly expensive operations (the memory_order_acquire load and memory_order_release store)¹, and little other overhead.

This compares to something like std::mutex which would imply something like one atomic operation for lock and another for unlock, and in practice on Linux the pthread calls have non-negligible overhead as well.

So I expect this queue to perform reasonably well in the uncontended fast-path.

Contended Performance

One advantage of lock-free structures is that they often allow better scaling when a structure is heavily contended. This isn't necessarily an inherent advantage: some lock-based structures with multiple locks or read-write locks may exhibit scaling that matches or exceeds some lock-free approaches, but it is usually that case that lock-free structures exhibit better scaling that a simple one-lock-to-rule-them-all alternative.

This queue performs reasonably in this respect. The m_write_index variable is atomically updated by all readers and will be a point of contention, but the behavior should be reasonable as long as the underlying hardware CAS implementation is reasonable.

Note that a queue is generally a fairly poor concurrent structure since inserts and removals all happen at the same places (the head and the tail), so contention is inherent in the definition of the structure. Compare this to a concurrent map, where different elements have no particular ordered relationship: such a structure can offer efficient contention-free simultaneous mutation if different elements are being accessed.

Context-switch Immunity

One performance advantage of lock-free structures that is related to the core definition above (and also to the functional guarantees) is that a context switch of a thread which is mutating the structure doesn't delay all the other mutators. In a heavily loaded system (especially when runnable threads >> available cores), a thread may be switched out for hundreds of milliseconds or seconds. During this time, any concurrent mutators will block and incur additional scheduling costs (or they will spin which may also produce poor behavior). Even though such "unluckly scheduling" may be rare, when it does occur the entire system may incur a serious latency spike.

Lock-free structures avoid this since there is no "critical region" where a thread can be context switched out and subsequently block forward progress by other threads.

This structure offers partial protection in this area — the specifics of which depend on the queue size and application behavior. Even if a thread is switched out in the critical region between the m_write_index update and the sequence number write, other threads can continue to push elements to the queue as long as they don't wrap all the way around to the in-progress element from the stalled thread. Threads can also pop elements, but only up to the in-progress element.

While the push behavior may not be a problem for high-capacity queues, the pop behavior can be a problem: if the queue has a high throughput compared to the average time a thread is context switched out, and the average fullness, the queue will quickly appear empty to all consumer threads, even if there are many elements added beyond the in-progress element. This isn't affected by the queue capacity, but simply the application behavior. It means that the consumer side may completely stall when this occurs. In this respect, the queue doesn't look very lock-free at all!

Functional Aspects

Async Thread Termination

On advantage of lock-free structures it they are safe for use by threads that may be asynchronously canceled or may otherwise terminate exceptionally in the critical region. Cancelling a thread at any point leaves the structure is a consistent state.

This is not the case for this queue, as described above.

Queue Access from Interrupt or Signal

A related advantage is that lock-free structures can usually be examined or mutated from an interrupt or signal. This is useful in many cases where an interrupt or signal shares a structure with regular process threads.

This queue mostly supports this use case. Even if the signal or interrupt occurs when another thread is in the critical region, the asynchronous code can still push an element onto the queue (which will only be seen later by consuming threads) and can still pop an element off of the queue.

The behavior isn't as complete as a true lock-free structure: imagine a signal handler with a way to tell the remaining application threads (other than the interrupted one) to quiesce and which then drains all the remaining elements of the queue. With a true lock-free structure, this would allow the signal handler to full drain all the elements, but this queue might fail to do that in the case a thread was interrupted or switched out in the critical region.

¹ In particular, on x86, this will only use an atomic operation for the CAS as the memory model is strong enough to avoid the need for atomics or fencing for the other operations. Recent ARM can do acquire and release fairly efficiently as well.

Marmite answered 27/8, 2017 at 23:13 Comment(8)

I was trying to find a way for the reader and writer to use fetch_add (x86 lock xadd) instead of CAS to avoid the need for retry when incrementing m_write_index. But the current algo only attempts the CAS with a write_index+1 obtained before reading and checking the sequence_number of that index. fetch_add could result in multiple increments from multiple threads. And if you fetch_add before checking the sequence number, and then have to fetch_sub it to correct an overshoot on queue full, it gets super messy. In theory fetch_add is perfect to get each thread a unique idx :/ – Surrogate 28/8, 2017 at 3:35

The readers blocking on a stalled writer isn't affected by queue capacity, but it is affected by queue average fullness. If the queue is usually nearly empty, this will be a problem right away. But if it's usually nearly full, you have a longer window before readers get to the stalled write slot. (However, it will only be nearly-empty if your system bottlenecks on producer throughput rather than consumers, so stalling consumers for a while may be ok if it doesn't fill up before the stalled write finishes; consumers can catch up. Of course that's bad for latency!) – Surrogate 28/8, 2017 at 3:43

I'm curious: could you implement a queue with equal or better performance characteristics using only traditional locking, with no atomics / fences? (Not even store or load; i.e. no C++11 data race UB without #include <atomic>). Pretty clearly not for the shared m_write_index / m_read_index, but what if you do that part with atomics and use locking somehow, e.g. actual per-element semaphore? Reader tries to down() the semaphore (so it waits if the counter was zero for an unwritten element), writer up()s the semaphore? But queue-full detection is a problem. Maybe 2 locks per slot? – Surrogate 28/8, 2017 at 3:55

I have made concurrent structures with "unconditional" operations like lock xadd in the past, including with "overshoot correction", but unless things have changed on very recent chips it never really pays off: CAS was always about the same cost as the other lock instructions, and unless contention was off the charts the retry was more of a theoretical problem than a real one. You can efficiently track retries and it was common too see "contended" structures with billions inserts with a handful of retries. Of course, "undoing overshoot" is just another type of retry! @PeterCordes – Marmite 28/8, 2017 at 4:0

@PeterCordes - yes, the likelihood of the queue becoming spuriously empty is a function of the queue fullness, the pop throughput, and the context switch "off CPU" time. It doesn't need a nearly empty queue though: if the throughput and "off time" are high then you can stall even if the queue started full (so in that respect it does depend a bit on capacity). I'll update it to mention that fullness is a factor. – Marmite 28/8, 2017 at 4:4

@PeterCordes - well you'd have to define "traditional locking", but some ways that locking can outperform lock-free in general are things like using a scalable rw-lock (one where readers don't all contend for a single shared variable but split their accesses across various cache lines) to lock a single-threaded structure, with lots of read access or (b) cases where the number of atomic ops for the lock-free structure is say > 2 and so the locking version does less total work or more generally when the locking can be amortized across several operations... – Marmite 28/8, 2017 at 4:15

..., e.g., an iteration which only has to lock once. On x86 it's a bit harder to find example because the "read path" for many lock free structures may not involve any atomics at all so it's harder to beat. On the specific question of the queue, I don't see an obvious way, as the queue is already a pretty crappy structure with "inherent" contention since it orders elements. Fine-grained locking as you suggest sometimes makes structures or operations on structures possible that you couldn't even do with a lock-free approach so you can't always compare apples to apples. – Marmite 28/8, 2017 at 4:19

Also contrast lock-free and traditional locking with userspace RCU and seqlocks which are both potent techniques when they apply. – Marmite 30/8, 2017 at 4:18

I am the author of liblfds.

The OP is correct in his description of this queue.

It is the single data structure in the library which is not lock-free.

This is described in the documentation for the queue;

http://www.liblfds.org/mediawiki/index.php?title=r7.1.1:Queue_%28bounded,_many_producer,_many_consumer%29#Lock-free_Specific_Behaviour

"It must be understood though that this is not actually a lock-free data structure."

This queue is an implementation of an idea from Dmitry Vyukov (1024cores.net) and I only realised it was not lock-free while I was making the test code work.

By then it was working, so I included it.

I do have some thought to remove it, since it is not lock-free.

Suzannasuzanne answered 18/2, 2019 at 21:26 Comment(0)

Most of the time people use lock-free when they really mean lockless. lockless means a data-structure or algorithm that does not use locks, but there is no guarantee for forward progress. Also check this question. So the queue in liblfds is lockless, but as BeeOnRope mentioned is not lock-free.

Mitrailleuse answered 18/9, 2017 at 19:41 Comment(2)

You are confusing lock-free with wait-free. – Lionel 18/9, 2018 at 9:25

@MaximEgorushkin nope, wait-free means forward progress for each individual thread is guaranteed, but lock-free means forward progress for the system as a whole is guaranteed but not for individual threads, check here. But lockless means there is no guarantee for any kind of forward progress if a single thread is interrupted or exits in the middle of spinning. It requires studying some code to fully grasp the concept. – Mitrailleuse 21/9, 2018 at 17:31

A thread that calls POP before the next update in sequence is complete is NOT "effectively blocked" if the POP call returns FALSE immediately. The thread can go off and do something else. I'd say that this queue qualifies as lock-free.

However, I wouldn't say that it qualifies as a "queue" -- at least not the kind of queue that you could publish as a queue in a library or something -- because it doesn't guarantee a lot of the behaviors that you can normally expect from a queue. In particular, you can PUSH and element and then try and FAIL to POP it, because some other thread is busy pushing an earlier item.

Even so, this queue could still be useful in some lock-free solutions for various problems.

For many applications, however, I would worry about the possibility for consumer threads to be starved while a producer thread is pre-empted. Maybe liblfds does something about that?

Author answered 27/8, 2017 at 17:29 Comment(3)

Some lock libraries have a trylock function which returns failure if the lock wasn't available, but being able to do something else in that thread doesn't mean the queue algorithm is lock-free. That requires that at least one thread in the system can make forward progress on that algorithm at any time. en.wikipedia.org/wiki/Non-blocking_algorithm#Lock-freedom. – Surrogate 27/8, 2017 at 18:6

The queue in question is lock-free, but that property doesn't automatically extend to algorithms that use it. In the same way you can make a spin lock from lock-free primitives, you can get threads to block each other using this queue if you want. – Author 27/8, 2017 at 23:35

The push and pop algorithms are separately lock-free, but the queue as a whole is not even obstruction-free. A writer that sleeps indefinitely between CAS and release-store eventually leaves both writes and reads blocks (effective queue size of zero), as BeeOnRope explains nicely. At this point it degenerates into a trylock situation. I think in practice it will be vanishingly rare for a process to use up its timeslice between CAS and store. You make a good point that a blocked writer can block readers right away if the queue is near-empty. That could be a problem in practice. – Surrogate 28/8, 2017 at 0:12

"Lock-free" is a property of the algorithm, which implements some functionality. The property doesn't correlate with a way, how given functionality is used by a program.

When talk about mcmp_queue::enqueue function, which returns FALSE if underlying queue is full, its implementation (given in the question post) is lock-free.

However, implementing mcmp_queue::dequeue in lock-free manner would be difficult. E.g., this pattern is obviously not-lock free, as it spins on the variable changed by other thread:

while(s.sequence_number.load(std::memory_order_acquire) == read_index);
data = s.user_data;
...
return data;

Wanting answered 27/8, 2017 at 21:9 Comment(5)

I don't think this a valid assessment of the functionality offered by the queue. If a thread is killed after reserving its slot but before writing it, the queue is essentially broken. It will report that it is both simultaneously full and empty, even to a single thread. So unless the functionality of this queue involves the caveat that it may permanently appear to have a fixed length of zero (making it useless) the queue doesn't offer its functionality in a lock-free way. – Marmite 27/8, 2017 at 23:24

If a thread is killed after reserving its slot but before writing it, the queue is essentially broken. It will report that it is both simultaneously full and empty, even to a single thread.

- That would mean the algorithm itself is incorrect, because it reports fullness when it shouldn't (or reports emptiness when it shouldn't). No reason to check incorrect algorithm for lock-freedom. – Wanting 28/8, 2017 at 8:46

Yes, I agree we're in a grey area here. One could argue however that the the authors are perfectly aware of this behavior, and so the queue is behaving as designed. So it it a lock-free queue? It certainly doesn't operation exactly as you'd expect a lock-free queue to do, but is that because it is not lock-free or because it is not a queue or because it is incorrect? At this point it is mostly semantics, but I think you can still say that a structure that is permanently impacted by indefinite suspension of any thread should not qualify as lock-free. – Marmite 30/8, 2017 at 21:26

Yes, it is actually an incomplete question, which doesn't show dequeue function. Without it, it is difficult to say anything even about correctness of enqueue. But, if talk only about enqueue and assuming its correctness, the function by itself seems to not suffer from "indefinite suspension" when run concurrently with itself: every its iteration takes new m_write_index. So, enqueue function itself is lock-free. But I agree that dequeue function probably cannot be implemented in lock-free manner, so the whole data-structure is unlikely lock-free. – Wanting 30/8, 2017 at 22:37

Yes, I think the implementation of dequeue is implied by the enqueue code and is fairly obvious (in fact, relevant parts of the dequeue operation are described, but only in prose), but even only looking at enqueue you can see in the case of infinite suspension that the enqueue method will stop actually adding elements after m_num_slots additional elements have been added. So regardless of how many elements have been popped the method will "fail" eventually. Finally, my assumption is that the question is about the structure, despite that only the enqueue function was provided. – Marmite 30/8, 2017 at 23:33

I did formal verification on this same code using Spin a couple years ago for a course in concurrency testing and it is definitely not lock-free.

Just because there is no explicit "locking", doesn't mean it's lock-free. When it comes to reasoning about progress conditions, think of it from an individual thread's perspective:

Blocking/locking: if another thread gets descheduled and this can block my progress, then it is blocking.
Lock-free/non-blocking: if I am able to eventually make progress in the absence of contention from other threads, then it is at most lock-free.
If no other thread can block my progress indefinitely, then it is wait-free.

Wedlock answered 29/3, 2019 at 15:24 Comment(0)

I agree that lock-free is more than just not using mutexes. A spinlock is still a lock.

That said, I think the implementation is lock free in any practical usecase that matters. At the very least it cannot block other writers if the capacity is large enough for the usecase. If you chose the capacity too small, that's a fault in system design. If estimating the required capacity is too difficult, perhaps you need a real queue and not a circular buffer.

It also won't block readers from reading available data. You can only say that there can be latency in written (but not published yet) data becoming available for readers. But from the readers' perspective it is the same as if the writes had simply happened at a later time. The worst that can happen is that the circular buffer is empty for a bit longer, so you can't really call that blocking.

The writer needs some way to communicate when the reader is allowed to start reading the written data, and that's what the seqnum.store_release() is for. Before that, you simply have to act as if the data isn't there yet, and if the pop() function is nonblocking and just returns false, the reader thread can decide to do other useful work.

Of course, my assumption is that a thread doesn't go to sleep for an hour between the m_write_index.compare_exchange() and the seqnum.store_release(). This seems plausible enough since the critical section is just straight procedural code with no loops and no potentially blocking calls. If any thread could just randomly go to sleep for an hour, I would expect there to be a lot more complaints that the system's scheduler is broken.

Cephalad answered 25/3, 2024 at 0:30 Comment(5)

There are downsides (TLB / cache) to huge buffers, so you'd want to tune for a reasonable size, not absolute worst case. With just scheduling, a thread could go to sleep for tens of milliseconds. But under high memory pressure and I/O load (close to thrashing), code and/or data could be paged out, leading to several times that much delay on the page fault. I guess probably data since paging out code would affect all readers or all writers. And it could be on an SSD or USB flash drive that's in the middle of a slow rebalance, or an SMR rotational drive whose buffer got full and blocks. – Surrogate 25/3, 2024 at 2:37

So there can be plausible use-cases where a page fault blocks for multiple seconds. That could be long enough for a high-throughput reasonably-sized buffer to fill up. A truly lock-free queue wouldn't let one half-written entry block all others so would degrade more gracefully. This is good enough for most use-cases, because as you say it's lock-free when operating in its normal regime, but I can imagine getting to its limits with a reasonably-sized queue and paged virtual memory. – Surrogate 25/3, 2024 at 2:41

@PeterCordes, I would say if you choose to use a circular buffer, then either a) you know the worst case capacity needed (not just a reasonable capacity), b) you accept blocking when full. Blocking when empty is non negotiable: you can't read what hasn't been published. That said, with your examples of causes of delay, could this slow just a single writer thread, whilst other writers still run at full speed? Ultimately if you want to fix this, the reserved slot must be usable by other writers. The cure could be worse than the disease in that the 'improved' version may have worse performance. – Cephalad 25/3, 2024 at 21:41

Yes, exactly. You accept blocking in pathological cases in exchange for more performance in typical cases. As en.wikipedia.org/wiki/Non-blocking_algorithm#Lock-freedom notes, having other threads being able to abort or even assist another thread's partial operation typically has big performance costs even when that doesn't actually happen. So yes you'd need more bookkeeping, and more accesses to cache-lines that were recently written by other threads, all very bad for the normal case. – Surrogate 25/3, 2024 at 21:55

For a nasty page-fault to block one thread but not others, note that there's a memcpy or struct assignment from source data into the queue while it's half-written. Consider a case where the data it's adding to the queue is from a page that got paged out to a slow disk while many big file-copy operations are in progress so there are long queues for I/O, but memory pressure isn't extreme so other threads aren't also having their pages get paged out. (In the code in the question, s.user_data is just a void*, but I was thinking of an alternate version with a small struct or array.) – Surrogate 25/3, 2024 at 21:57

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++