Lock-free stack with freelist: why don't the next pointers need to be atomic?
Asked Answered
L

2

6

A lock-free stack can be implemented as a singly linked list. This seems simple until we have to think about what to do with nodes after they've been popped. One strategy is to simply move them to a per-stack LIFO freelist (from which nodes can be reused by subsequent push operations) until eventually all threads are done with the stack, at which point a single thread destroys all nodes in the stack and all nodes in the freelist. Boost.Lockfree uses this strategy. So does Chris Wellons's C11 implementation. I will refer to the latter, because it's easier to read and the details are essentially the same, since C11 atomics are very similar to C++11 atomics.

In Wellons's implementation, which can be found on GitHub here, all lstack_node objects are non-atomic. In particular, this means that all accesses to the next member of an lstack_node object are non-atomic. What I am unable to understand is: why is it that such accesses never race with each other?

The next member is read at lstack.c:30. It is written at lstack.c:39. If these two lines can execute concurrently on the same lstack_node object, then the program contains a race. Is this possible? It seems possible to me:

  • Thread 1 calls lstack_pop, which calls pop. It atomically loads the head node's value into the local variable orig. Now, orig.node is a pointer to the node that was at the top of the stack just now. (Note that up until this point, only local variables have been modified, so it is impossible for anything that thread 1 has done so far to make a CAS fail in any other thread.) Meanwhile...
  • Thread 2 calls lstack_pop. pop succeeds and returns node, a pointer to the node that has just been excised from the stack; this is the same node that orig.node points to in thread 1. It then begins to call push in order to add node to the freelist. The freelist head node is atomically loaded, and node->next is set to point to the first node in the freelist.
  • Oops. This races with the read to orig.node->next in thread 1.

Could Wellons's implementation simply be wrong? I doubt it. If his implementation is wrong, then so is the Boost one, because the only way to fix (what appears to me to be) the race condition is to make next atomic. But I don't think the Boost implementation could be wrong in such a basic way without it having been noticed and fixed by now. So I must have made a mistake in my reasoning.

Lydgate answered 2/3, 2022 at 2:56 Comment(1)
Maybe the algorithm assumed atomic operations on machine pointers. It marks larger-than-pointer structs as atomic, and size_t which I can believe on some systems is 64 with 32 bit machine words.Gladstone
M
5

The key thing to note is that the next fields are read-only for every node that is currently in a linked list. The next can only be modified when the node has been successfully removed from a linked list. Once that happens, the thread that removed it 'owns' it and noone else can sensibly look at it (they might read a value, but that value will be thrown away when their compare_and_set fails.) So that owning thread can safely modify the next field as part of pushing it on another list.

In your hypothetical, you're missing the fact that the two pops (done by the two threads) cannot both succeed and return the same node. If two threads try to simultaneously pop, they might get the same node pointer, but one will fail in the compare_and_set atomic instruction and will loop back with a different node pointer.

This does require that read/write races are "safe" -- that is when you have a read/write race between two threads, the reader might get any value but won't otherwise fail (no trap or other undefined behavior), and won't otherwise interfere with the write, but that tends to be the case on most (all?) hardware. As long as the reader does not depend on the value read during a race, you can detect the race after the fact and disregard that value.

Magic answered 2/3, 2022 at 23:21 Comment(12)
"The next can only be modified when the node has been successfully removed from a linked list". Well, in my example, the node has been removed from the stack list and the thread that succeeded at doing so is in the process of adding it to the free list (which involves modifying next). But the other thread is still reading next. It is true that the other thread will not acquire ownership of the node because the CAS will fail, but the race occurs before that.Lydgate
True, but the race just means that it gets some value that it will use for the new value in the atomic_compare_exchange which will be disregarded as the compare part will fail.Magic
The problem is that the mere existence of the race (a concurrent read and write of the same memory location) causes the program to have undefined behavior.Lydgate
@FrockLee: In ISO C++ yes, that's true. But it's more or less safe in practice real hardware, as long as the memory hasn't actually been deallocated (potentially unmapped). See is a concurent write and read to a non-atomic variable of fundamental type without using it undefined behavior? - It's a common real-world technique, used in stuff like a SeqLock. OTOH, clang -fsanitize=thread is a real-world C++ implementation which does some race detection, so might trip up on this.Jingoism
@PeterCordes The point is that if the code is in C or C++, we have to adhere to the rules of the language. And as @FrockLee has correctly pointed out, even if the value read from next is discarded because the CAS is doomed to fail, it is still a data race as defined in the standard. It might be safe on real hardware, but only as long as the compiler actually produces the correct code. And while this might be the case right now and probably also in future versions, but there is simply no guarantee, because once you have UB in your code, all bets are off!Suicide
FWIW - correct SeqLock implementations in C++20 use atomic_ref to avoid just that.Suicide
@mpoeter: Depending on which set of implementations you care about, you don't need to limit yourself to just ISO C++. You can use behaviour that implementations define. For mainstream compilers that includes at least volatile being thread-safe (at least compilers that can compile the Linux kernel) and possibly compiling to more efficient asm for a wide read, for example movups or 2x movq instead of lock cmpxchg16 on x86-64. It is unfortunately a tradeoff between potential future-proofness vs. performance, though, when ISO C++ gets in the way and doesn't define enough stuff.Jingoism
@Suicide atomic_ref doesn't help you avoid UB, does it? The writers and readers both need to do atomic accesses to it to avoid data-race UB. So when would you ever be able to do non-atomic access? I assume the C++ only defines the behaviour if the writer and readers access the same memory using the same width atomic_ref, not like writer using 4-byte writes and reader using 8-byte atomic reads. I guess you could plausibly pick a type width to access with based on run-time CPU detection instead of compile-time, but the efficient atomic access width is usually a compile-time thing.Jingoism
(The problem with atomic or atomic_ref is that it forces you to tell the compiler this is 4 chunks of 4 bytes, for example, or 2 chunks of 8, potentially forcing it to do a more expensive 8-byte atomic load on a 32-bit machine, or gimping itself on a 64-bit machine. Modulo some clunky parameterized type stuff and adapting between the access width you pick vs. the actual data types . When really you want to let it choose, like it would to some degree for volatile.)Jingoism
@PeterCordes atomic_ref can help you avoid data races (if used correctly). If you have conflicting operations, then they must all be atomic (e.g. using atomic_ref). If they are not conflicting (i.e., they are ordered by happens-before), then simple non-atomic access is fine. Regarding seqlock and atomic_ref see open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1478r1.htmlSuicide
@mpoeter: ok yes, the proposed atomic_source_memcpy is exactly what we want. It behaves roughly as if you used atomic_ref to read or write each byte separately, but without the garbage performance that would entail. Compilers don't optimize atomics (for now), so you'd be forcing the compiler to emit asm that loads 1 byte at a time. The exact opposite of what you want. If you were going to use existing C++ tools, you'd probably use atomic<unsigned long> chunks for your data, with memcpy between unsigned long temporaries and the actual structs or whatever if they aren't that type.Jingoism
atomic_ref (or atomic_source_memcpy) are much stronger (and thus probably more expensive) than what is really needed -- all that is needed is that whenever there is a race between 1 writer and 1 or more readers, the readers just get unspecified values (not necessarily the value before or after the write -- could be anything. Even if the writer is writing an unchanged value).Magic
S
3

I just wrote a long text trying to explain why there cannot be a race, until I took a closer look at how Wellson implemented the free list, and I came to the conclusion that you are correct!

The important point here is what you mentioned at the very beginning of your question:

This seems simple until we have to think about what to do with nodes after they've been popped. One strategy is to simply move them to a per-stack LIFO freelist until eventually all threads are done with the stack, at which point a single thread destroys all nodes in the stack and all nodes in the freelist.

But that is not how the freelist in Wellson's implementation works! Instead it tries to reuse nodes from the freelist, but then next also needs to be atomic as you observed correctly. If the free list would have been implemented as you described, i.e., the popped nodes would be added to some freelist (unaltered, i.e., the freelist uses a different pointer than next) and only released once the stack is no longer used by any thread, then next could be a plain variable since it would not be change once the node has been pushed successfully.

That does not necessarily mean that the boost lock-free queue is also incorrect, but I don't know the code good enough to make a qualified statement about the boost implementation.

FWIW - this is usually referred as the memory reclamation problem. This freelist approach is a simple solution, though usually not practical for real-world scenarios. For a real-world scenario you probably want to use a memory reclamation scheme like hazard pointers or epoch based reclamation. You can take a look at my xenium library where I have implemented several different reclamation schemes as well as lock-free data structures that use them. More information about the memory reclamation problem and my implementations in xenium can be also found in my thesis Effective memory reclamation for lock-free data structures in C++.

Suicide answered 2/3, 2022 at 13:15 Comment(3)
I'll update the question to clarify. To me, a "freelist" inherently contains resources that are available for reuse.Lydgate
Good stuff. About xenium, I can't find a word about patented reclamation algorithms, which seemed to be quite a problem until recently (when I last looked at github.com/khizmax/libcds e.g.). Are you aware of the patent status/licensing concerns?Sirree
@Sirree unfortunately I do not know much about patent status/licenses. I know that there is a patent application for hazard pointers, but AFAIK this was abandoned.Suicide

© 2022 - 2024 — McMap. All rights reserved.