Bytewise atomic memcpy and sequence locks in C++23
Asked Answered
S

1

6

I want to implement a sequence lock in C++23. If possible, it should not rely on non-standard extensions or undefined behavior.

There is the proposal P1478R8: Byte-wise atomic memcpy, which covers my exact use case. This proposal suggests to add atomic_load_per_byte_memcpy and atomic_store_per_byte_memcpy to a new header bytewise_atomic_memcpy, which can copy bytewise using atomic semantics.

How are sequence locks correctly implemented in C++ up to C++23? How to implement the functions from P1478 currently? I have not found a reference implementation of the proposal and also no other implementation of a sequence lock that handles this problem specifically. Of course, I could implement this manually, but it probably would not lead to the best performance, similar to simple implementations of memcpy. Is there a better way?

I have the feeling, that, although it's undefined behavior according to the C++ standard, in the real life, the problem is often ignored and plain memcpy gets just used.

Shirring answered 22/8, 2024 at 21:14 Comment(0)
M
7

How are sequence locks correctly implemented in C++ up to C++23?

They aren't, unless all the locked variables are atomics themselves.

How to implement the functions from P1478 currently?

You can't.

Sequence locks typically have data races on the payload, which are justified by ignoring the result of the data race when the sequence number was changed.

However, in the current C++ memory model, data races are immediately undefined behavior by definition, so there is no correct way to implement a sequence lock in C++ today if the payload isn't an atomic type.

That's the entire point of P1478, to allow it.

I have the feeling, that, although it's undefined behavior according to the C++ standard, in the real life, the problem is often ignored and plain memcpy gets just used.

Yes, and those implementations could break at any time.

Maidinwaiting answered 22/8, 2024 at 21:21 Comment(14)
Well, but seqlocks are definitely fully supported and reliable on at least one platform/compiler/architecture. So what do they do?Buhl
@Buhl Which platform/compiler/architecture are you referring to? You can write sequence locks directly in assembly today, just not in C++.Maidinwaiting
@Sneftel: some compilers define the behaviour of volatile strongly enough that using it for the payload might have actual guarantees. It definitely works in practice with known compilers, since in real compilers like GCC, memory barriers can order a load or store wrt. an earlier or later non-atomic access even in ways that aren't release or acquire. See Implementing 64 bit atomic counter with 32 bit atomics for my attempt at implementing it which I think is fairly robust in GNU C or C++Beard
@Sneftel: see also GCC reordering up across load with `memory_order_seq_cst`. Is this allowed? (a GCC bug affecting SeqLocks) and Is the transformation of fetch_add(0, memory_order_relaxed/release) to mfence + mov legal? (an example of hypothetical or real compiler optimizations that might affect a seqlock)Beard
@Sneftel: But yes, orlp is correct, the only way to write a seqlock with no UB is for the payload to be multiple std::atomic<long> elements which you copy using memory_order_relaxed. This will prevent compilers from using a wider copy (like x86 movdqa to grab 16 bytes at a time). If you know the target ISA is e.g. 32-bit ARM and the payload is 64-bit, you could pick 32-bit element size since you probably want the payload in a pair of integer regs. But it sucks in general, preventing optimization to a single 64-bit register on 64-bit ISAs for uint64_t.Beard
@Peter Cordes: But wouldn't that violate type aliasing rules unless my real type is long, too?Shirring
@Peter Cordes: Given we have x86_64, would a 16-byte load be a problem? I think the hardware is fine with this, the only point is it's UB for C++ and potentially dangerous compiler optimizations could kick in.Shirring
@sedor: a 16-byte load is only guaranteed atomic on CPUs that have AVX. So on GCC or clang, std::atomic<__int128>::load(relaxed) results in a call to a libatomic library function that uses vmovaps or similar if that CPU feature is available, otherwise uses lock cmpxchg16b. And on some implementations (like MSVC), std::atomic<> for a 16-byte struct uses an actual lock. (GCC7 and later reports that it's not lock-free because the lock cmpxchg16b fallback doesn't have the expected performance like read-side scalability, but GCC's version is technically lock-free.)Beard
@sedor: Re: strict aliasing: right, you'd do tmp[i] = payload[i].load(relaxed); to copy to a local tmp array of uint32_t elements, then memcpy from there into your actual payload struct. (memcpy is strict-aliasing safe, like std::bit_cast). Or do something manually for simpler cases, like (uint64_t(high) << 32) | low to recombine u32 halves of a u64.Beard
@sedor: as far as compilers optimizing 4 adjacent std::atomic<uint32_t> loads into one 16-byte load, that's allowed in theory on targets where a 16-byte aligned load is guaranteed atomic. But there's unfortunately no documented guarantee in the Intel or AMD manuals about Per-element atomicity of vector load/store and gather/scatter? even though it's not plausible in reality to have tearing inside 4-byte boundaries when they're part of a wider load.Beard
And regardless, compilers don't optimize atomics currently. See Why don't compilers merge redundant std::atomic writes? - the way they're not optimized is pretty similar to how real compilers deal with volatile, both not coalescing contiguous access and not eliminating dead stores even with no intervening reads.Beard
@Peter Cordes: But can the non-atomicity of 16-byte loads be a problem? The result can be bogus, but that would be detected by the seqlock. For the hardware part, if it doesn't raise an exception, it should be okay, isn't it? For the compiler part, we need to guard against UB. Or asked otherwise: If I use 16-byte loads via inline assembly, would that be okay?Shirring
@sedor: for a seqlock no, it can't be a problem. But std::atomic<T> or std::atomic_ref<T> is our only option to avoid data-race UB for the payload in current C++. Of course you can do whatever is efficient if writing a seqlock in assembly, or using compiler-specific C++ extensions like inline asm or the semantics of volatile (which in GNU C is supported for stuff like this; it's how the Linux kernel rolls it own atomics.)Beard
Thanks for all the links, very interesting to read.Shirring

© 2022 - 2025 — McMap. All rights reserved.