Anecdotally, I've found that a lot of programmers mistakenly believe that "lock-free" simply means "concurrent programming without mutexes". Usually, there's also a correlated misunderstanding that the purpose of writing lock-free code is for better concurrent performance. Of course, the correct definition of lock-free is actually about progress guarantees. A lock-free algorithm guarantees that at least one thread is able to make forward progress regardless of what any other threads are doing.
This means a lock-free algorithm can never have code where one thread is depending on another thread in order to proceed. E.g., lock-free code can not have a situation where Thread A sets a flag, and then Thread B keeps looping while waiting for Thread A to unset the flag. Code like that is basically implementing a lock (or what I would call a mutex in disguise).
However, other cases are more subtle and there are some cases where I honestly can't really tell if an algorithm qualifies as lock-free or not, because the notion of "making progress" sometimes appears subjective to me.
One such case is in the (well-regarded, afaik) concurrency library, liblfds. I was studying the implementation of a multi-producer/multi-consumer bounded queue in liblfds - the implementation is very straightforward, but I cannot really tell if it should qualify as lock-free.
The relevant algorithm is in lfds711_queue_bmm_enqueue.c
. Liblfds uses custom atomics and memory barriers, but the algorithm is simple enough for me to describe in a paragraph or so.
The queue itself is a bounded contiguous array (ringbuffer). There is a shared read_index
and write_index
. Each slot in the queue contains a field for user-data, and a sequence_number
value, which is basically like an epoch counter. (This avoids ABA issues).
The PUSH algorithm is as follows:
- Atomically LOAD the
write_index
- Attempt to reserve a slot in the queue at
write_index % queue_size
using a CompareAndSwap loop that attempts to setwrite_index
towrite_index + 1
. - If the CompareAndSwap is successful, copy the user data into the reserved slot.
- Finally, update the
sequence_index
on the slot by making it equal towrite_index + 1
.
The actual source code uses custom atomics and memory barriers, so for further clarity about this algorithm I've briefly translated it into (untested) standard C++ atomics for better readability, as follows:
bool mcmp_queue::enqueue(void* data)
{
int write_index = m_write_index.load(std::memory_order_relaxed);
for (;;)
{
slot& s = m_slots[write_index % m_num_slots];
int sequence_number = s.sequence_number.load(std::memory_order_acquire);
int difference = sequence_number - write_index;
if (difference == 0)
{
if (m_write_index.compare_exchange_weak(
write_index,
write_index + 1,
std::memory_order_acq_rel
))
{
break;
}
}
if (difference < 0) return false; // queue is full
}
// Copy user-data and update sequence number
//
s.user_data = data;
s.sequence_number.store(write_index + 1, std::memory_order_release);
return true;
}
Now, a thread that wants to POP an element from the slot at read_index
will not be able to do so until it observes that the slot's sequence_number
is equal to read_index + 1
.
Okay, so there are no mutexes here, and the algorithm likely performs well (it's only a single CAS for PUSH and POP), but is this lock-free? The reason it's unclear to me is because the definition of "making progress" seems murky when there is the possibility that a PUSH or POP can always just fail if the queue is observed to be full or empty.
But what's questionable to me is that the PUSH algorithm essentially reserves a slot, meaning that the slot can never be POP'd until the push thread gets around to updating the sequence number. This means that a POP thread that wants to pop a value depends on the PUSH thread having completed the operation. Otherwise, the POP thread will always return false
because it thinks the queue is EMPTY. It seems debatable to me whether this actually falls within the definition of "making progress".
Generally, truly lock-free algorithms involve a phase where a pre-empted thread actually tries to ASSIST the other thread in completing an operation. So, in order to be truly lock-free, I would think that a POP thread that observes an in-progress PUSH would actually need to try and complete the PUSH, and then only after that, perform the original POP operation. If the POP thread simply returns that the queue is EMPTY when a PUSH is in progress, the POP thread is basically blocked until the PUSH thread completes the operation. If the PUSH thread dies, or goes to sleep for 1,000 years, or otherwise gets scheduled into oblivion, the POP thread can do nothing except continuously report that the queue is EMPTY.
So does this fit the defintion of lock-free? From one perspective, you can argue that the POP thread can always make progress, because it can always report that the queue is EMPTY (which is at least some form of progress I guess.) But to me, this isn't really making progress, since the only reason the queue is observed as empty is because we are blocked by a concurrent PUSH operation.
So, my question is: is this algorithm truly lock-free? Or is the index reservation system basically a mutex in disguise?
read_index
andwrite_index
. In practice, an algorithm doesn't have to be 100% lock-free to be useful or performant. It just has to not actually block with a large enough queue on real HW + OS. – Surrogatem_write_index
the point at which the element is inserted, but instead treat it as a "hint" of where to insert. The actual important operation then is a CAS ofs.data
from NULL (or some other reserved value) to the user data. At that single point the value is considered "in" the structure and thenm_write_index
can be updated by the pushing thread, but even if it doesn't get updated, the structure keeps working. – Marmitem_write_index
happens to point to a full slot and just look for the next empty slot and proceed as usual, leading to the cooperative behavior that's often necessary for something to be lock free as mentioned by the OP. Note that I'm not making any claim that this structure will perform as well than the one above in practical cases (but neither is it too terrible), but it should still be relatively efficient. – MarmiteLFDS711_BACKOFF_EXPONENTIAL_BACKOFF( qbmms->dequeue_backoff, backoff_iteration );
IDK what that does, but it may invoke OS support after some tries. That's not really the point, though. The point is that if you find that you could get equivalent behaviour with an actual mutex, should you use a mutex? Or are you still gaining anything by using atomics. (In this case I think you are.) – Surrogateload
should use at leastacquire
ordering – Blitherm_write_index.load(relaxed);
. Theacq_rel
CAS is where order matters. It's typical to use a relaxed load to set up for a CAS. Readers don't touchm_write_index
, so it could only sync with other writers. In practice, the compiler will probably generate code where the firsts.sequence_number.load(mo_acquire)
is dependency-ordered (like whatmo_consume
guarantees) because the load address has a data dependency onwrite_index
. But I think it's safe even on Alpha AXP (which doesn't enforce ordering for dependent loads). – Surrogatem_write_index
are still ordered, so there are limits to what load-reordering can observe. – Surrogate