How can I implement ABA counter with c++11 CAS?
Asked Answered
A

1

7

I am implementing a lock-free queue based on this algorithm, which uses a counter to solve the ABA problem. But I don't know how to implement this counter with c++11 CAS. For example, from the algorithm:

E9:    if CAS(&tail.ptr->next, next, <node, next.count+1>)

It is an atomic operation, meaning if tail.ptr->next is equal to next, let tail.ptr->next point to node and simultaneously (atomically) make next.count+1. However, using C++11 CAS, I can only implement:

std::atomic_compare_exchange_weak(&tail.ptr->next, next, node);

which cannot make next.count+1 simultaneously happen.

Argon answered 16/8, 2016 at 20:44 Comment(10)
OK, you've pointed out that you can't do this with CAS. What's your question?Tide
@Tide The question is how to do this in c++11, if it is possible.Argon
I'm not even sure what the pseudo-code is doing. What does it mean to swap with a tuplle <nod, next.count+1>?Tide
You can atomically compare-exchange a whole struct, instead of just a pointer. x86 / x86-64 has hardware support for swapping things that are the size of two pointers, but no larger, so be careful with this or your code will actually use a lock. See for example https://mcmap.net/q/22932/-the-same-instances-of-the-same-class-but-different-behaviour-probable-ub/224132, a Q&A about compare_exchange on a struct with padding causing them to not be bitwise equal.Edifice
@PeterCordes it's not about CAS a whole struct. The problem is to execute a CAS operation and a plus operation in one atomic operation.Argon
@BugRepairMan: The whole idea of the ABA problem is that it is impossible to operate on two memory addresses atomically.Wherefrom
Do you have a link on something that explains what the brackets mean in the expression CAS(&tail.ptr->next, next, <node, next.count+1>) please ?Tug
@BugRepairMan: If you want to modify two things at once as a single atomic operation, your best option is to put them together into a struct and CAS that, because real hardware can do that. Your other options are using a lock, or using en.wikipedia.org/wiki/Transactional_memory (which is only available in hardware on very recent Intel CPUs).Edifice
@Tug The link above in the question includes a article which provides the original algorithm.Argon
@PeterCordes You are right. This problem requires DWCAS(cmpxchg16b) instructions instead of just CAS. As C++ does not provide this API, some one has implemented this link. Thanks.Argon
E
23

To atomically modify two things at once with a single atomic operation, you need to put them in adjacent memory, e.g. in a two-member struct. Then you can use std::atomic<my_struct> to get gcc to emit lock cmpxchg16b on x86-64, for example.

You don't need inline asm for this, and it's worth a bit of C++ syntax pain to avoid it. https://gcc.gnu.org/wiki/DontUseInlineAsm.

Unfortunately, with current compilers, you need to use a union to get efficient code for reading just one of the pair. The "obvious" way of doing an atomic load of the struct and then only using one member still leads to a lock cmpxchg16b to read the whole struct even though we only need one member. (Much slower, and dirties the cache line so readers contend with other readers). I'm confident that a normal 64b load of the pointer would still correctly implement the acquire memory-ordering semantics on x86 (as well as atomicity), but current compilers don't do that optimization even for std::memory_order_relaxed, so we trick them into it with a union.

(submitted GCC bug 80835 about this. TODO: same for clang if this is a useful idea.)


Checklist:

  • Make sure your compiler is generating efficient code for loading just one member in the read-only case, not a lock cmpxchg16b of the pair. e.g. using a union.

  • Make sure your compiler guarantees that accessing one member of a union after writing a different union member has well-defined behaviour in that implementation. Union type-punning is legal in in C99 (so this should work well with C11 stdatomic), but it's UB in ISO C++11. However, it's legal in the GNU dialect of C++ (supported by gcc, clang, and ICC, among others).

  • Make sure your object is 16B-aligned, or 8B-aligned for 32-bit pointers. More generally, alignas(2*sizeof(void*)) should work. Misaligned locked instructions can be very slow on x86, especially if they cross a cache-line boundary. clang3.8 even compiles it to a library call if the object isn't aligned.

  • Compile with -mcx16 for x86-64 builds. cmpxchg16b was not supported by the earliest x86-64 CPUs (AMD K8), but should be on everything after that. Without -mcx16, you get a library function call (which probably uses a global lock). The 32-bit equivalent, cmpxchg8b, is old enough that modern compilers assume its support. (And can use SSE, MMX, or even x87 to do 64b atomic loads/stores, so using a union is somewhat less important for good performance when reading one member).

  • Make sure that a pointer+uintptr_t atomic object is lock-free. This is pretty much guaranteed for x32 and 32-bit ABIs (8B object), but not for 16B objects. e.g. MSVC uses a lock for x86-64.

    gcc7 and later will call libatomic instead of inlining lock cmpxchg16b, and will return false from atomic_is_lock_free (for reasons including that it's so slow it's not what users expect is_lock_free to mean), but at least for now the libatomic implementation still uses lock cmpxchg16b on targets where that instruction is available. (It can even segfault for read-only atomic objects, so it's really not ideal. More importantly, readers contend with other readers for exclusive access to the cache line. That's why we're going to so much trouble to avoid lock cmpxchg16b for the read side here when we only want one 8-byte half.)


Here's an example of code with a CAS retry-loop that compiles to asm that looks right, and I think is free of UB or other unsafe C++ for implementations that allow union type punning. It's written in C style (non-member functions, etc.) but it would be the same if you wrote member functions.

See the code with asm output from gcc6.3 on the Godbolt compiler explorer. With -m32, it uses cmpxchg8b the same way 64-bit code uses cmpxchg16b. With -mx32 (32-bit pointers in long mode) it can simply use a 64-bit cmpxchg, and normal 64-bit integer loads to grab both members in one atomic load.

This is portable C++11 (except the union type-punning), with nothing x86-specific. It's only efficient on targets that can CAS an object the size of two pointers, though. e.g. it compiles to a call to a __atomic_compare_exchange_16 library function for ARM / ARM64 and MIPS64, as you can see on Godbolt.

It doesn't compile on MSVC, where atomic<counted_ptr> is larger than counted_ptr_separate, so the static_assert catches it. Presumably MSVC includes a lock member in the atomic object.

#include <atomic>
#include <stdint.h>

using namespace std;

struct node {
  // This alignas is essential for clang to use cmpxchg16b instead of a function call
  // Apparently just having it on the union member isn't enough.
  struct alignas(2*sizeof(node*)) counted_ptr {
    node *    ptr;
    uintptr_t count;  // use pointer-sized integers to avoid padding
  };
  
  // hack to allow reading just the pointer without lock-cmpxchg16b,
  // but still without any C++ data race
  struct counted_ptr_separate {
    atomic<node *>    ptr;
    atomic<uintptr_t> count_separate;  // var name emphasizes that accessing this way isn't atomic with ptr
  };

  static_assert(sizeof(atomic<counted_ptr>) == sizeof(counted_ptr_separate), "atomic<counted_ptr> isn't the same size as the separate version; union type-punning will be bogus");
  //static_assert(std::atomic<counted_ptr>{}.is_lock_free());
  
  union {  // anonymous union: the members are directly part of struct node
    alignas(2*sizeof(node*)) atomic<counted_ptr> next_and_count;
    counted_ptr_separate  next;
  };
  // TODO: write member functions to read next.ptr or read/write next_and_count

  int data[4];
};


// make sure read-only access is efficient.
node *follow(node *p) {   // good asm, just a mov load
  return p->next.ptr.load(memory_order_acquire);
}
node *follow_nounion(node *p) {  // really bad asm, using cmpxchg16b to load the whole thing
  return p->next_and_count.load(memory_order_acquire).ptr;
}


void update_next(node &target, node *desired)
{
  // read the old value efficiently to avoid overhead for the no-contention case
  // tearing (or stale data from a relaxed load) will just lead to a retry
  node::counted_ptr expected = {
      target.next.ptr.load(memory_order_relaxed),
      target.next.count_separate.load(memory_order_relaxed) };

  bool success;
  do {
    node::counted_ptr newval = { desired, expected.count + 1 };
    // x86-64: compiles to cmpxchg16b
    success = target.next_and_count.compare_exchange_weak(
                           expected, newval, memory_order_acq_rel);
    // updates exected on failure
  } while( !success );
}

The asm output from clang 4.0 -O3 -mcx16 is:

update_next(node&, node*):
    push    rbx             # cmpxchg16b uses rbx implicitly so it has to be saved/restored
    mov     rbx, rsi
    mov     rax, qword ptr [rdi]     # load the pointer
    mov     rdx, qword ptr [rdi + 8] # load the counter
.LBB2_1:                        # =>This Inner Loop Header: Depth=1
    lea     rcx, [rdx + 1]
    lock
    cmpxchg16b      xmmword ptr [rdi]
    jne     .LBB2_1
    pop     rbx
    ret

gcc does some clunky store/reloads, but is basically the same logic.

follow(node*) compiles to mov rax, [rdi] / ret, so read-only access to the pointer is as cheap as it should be, thanks to the union hack.


It does depend on writing a union through one member and reading it through another, to do efficient reads of just the pointer without using a lock cmpxchg16b. This is guaranteed to work in GNU C++ (and ISO C99/C11), but not in ISO C++. Many other C++ compilers do guarantee that union type-punning works, but even without that it would probably still work: We're always using std::atomic loads which have to assume that the value was modified asynchronously. So we should be immune from aliasing-like problems where values in registers are still considered live after writing the value through another pointer (or union member). Compile-time reordering of things the compiler thinks are independent could be a problem, though.

Atomically reading just the pointer after an atomic cmpxchg of the pointer+counter should still give you acquire/release semantics on x86, but I don't think ISO C++ says anything about it. I would guess that a wide release-store (as part of the compare_exchange_weak will synchronize with a narrower load from the same address on most architectures (like it does on x86), but AFAIK C++ std::atomic doesn't guarantee anything about type-punning.

Not relevant for pointer + ABA-counter, but could come up for other applications of using a union to allow access to subsets of larger atomic object: Don't use the union to allow atomic stores to just the pointer or just the counter. At least not if you care about synchronization with an acquire-load of the pair. Even strongly-ordered x86 can reorder a narrow store with a wider load that fully contains it. Everything is still atomic, but you get into weird territory as far as memory-ordering goes.

On x86-64, an atomic 16B load requires a lock cmpxchg16b (which is a full memory barrier, preventing a preceding narrow store from becoming globally visible after it). But you could easily have a problem if you were using this with 32-bit pointers (or 32-bit array indices), since both halves could be loaded with a regular 64b load. And I have no idea what problems you could see on other architectures, if you need synchronization with other threads and not just atomicity.


To learn more about std::memory_order acquire and release, see Jeff Preshing's excellent articles.

Edifice answered 17/8, 2016 at 8:38 Comment(5)
Great answer! I just have another question regarding memory_order_*. After some search, I found it too complex to understand their usage. From this qustion, what I understand is that these orderings are for optimising performance. If I use default ordering(sequentially consistent), it runs still correctly. right?Argon
@BugRepairMan: Yes, seq_cst is the safest/slowest. In this case, when targeting x86, you'll get identical asm for seq_cst, because loads / stores implicitly have acquire / release semantics, and atomic read-modify-write operations like CAS are always full memory barriers, so they always have seq_cst semantics even if you only require relaxed. To learn more about ordering semantics, read preshing.com/20120913/acquire-and-release-semantics, and Jeff's other block posts about lock-free programming. They're a gold mine.Edifice
Because of the way you are using unions, this is technically not valid C++. As you say "It does depend on writing a union through one member and reading it through another..." and that's not valid nor portable. But otherwise, to answer the original question, yeah, ptr + counter in a single atomic struct, such as in your example.Ransome
@Ransome It is guaranteed to work on a lot of real compilers, including everything that implements the GNU extensions to the language. Any suggestion for how to reword it to convey the same point without actually making the erroneous claim that it's valid C++? I think it will only break on "weird" C++ implementations where layout in memory is not like we expect on all normal ISAs (x86, ARM, MIPS, etc. etc.), so I thought I could get away with claiming it was portable. Anyway, feel free to make an edit if the wording is bothering you. I can't think of a way to still keep it short.Edifice
If you align counted_ptr to 16 bytes, clang will generate the cmpxchg16b instruction godbolt.org/g/5qGCNDPileus

© 2022 - 2025 — McMap. All rights reserved.