This is called a seqlock. It's a known pattern, and it works well for publish occasionally, read often. If you republish too often (especially for a buffer as large as 5000 bytes), you risk too many retries by the readers as they keep detecting possible tearing. It's commonly used to e.g. publish a 64-bit or 128-bit timestamp from a timer interrupt handler to all cores, where the fact that the writer doesn't have to acquire a lock is great, and so is the fact that readers are read-only and have negligible overhead in the fast-path.
You need atomic_thread_fence(mo_acquire)
before the 2nd load of the sequence number in the reader to make sure it doesn't happen earlier, before the memcpy finishes. And same for atomic_thread_fence(mo_release)
in the writer, after the first store before writing the data. Note that acquire / release fences are 2-way barriers, and do affect non-atomic variables1. (Despite misconceptions to the contrary, fences really are 2-way barriers, unlike acquire or release operations. Jeff Preshing explains and debunks the confusion)
See also Implementing 64 bit atomic counter with 32 bit atomics for my attempt at a templated SeqLock class. I required the template class T
to provide an assignment operator to copy itself around, but using memcpy
might be better. I was using volatile
for extra safety against the C++ UB we include. That works easily for uint64_t
but is a huge pain in C++ for anything wider, unlike in C where you can get the compiler to efficiently emit code to load from a volatile
struct into a non-volatile
temporary.
You're going to have C++ data-race UB either way (because C++ makes best efficiency impossible without UB: the whole point of a SeqLock is to let tearing potentially happen on data[]
, but detect that and never actually look at the torn data). You could avoid UB by copying your data as an array of atomic<unsigned long>
or something, but current compilers aren't smart enough to use SIMD for that so the access to the shared data would be inefficient. (And HW vendors fail to document Per-element atomicity of vector load/store and gather/scatter?, even though we all know that current CPUs do give that and future CPUs almost certainly will too.)
A memory barrier is probably sufficient, but it would be nice to do something to "launder" the value to make extra sure the compiler doesn't put another reload of the non-atomic data after the 2nd load. Like What is the purpose of glibc's atomic_forced_read function?. But as I said, I think atomic_thread_fence()
is sufficient. At least in practice with compilers like GCC, which treat thread_fence like asm("":::"memory")
that tells the compiler all values in memory might have changed.
Footnote 1: Maxim points out that atomic_thread_fence
may be sort of a hack because ISO C++ specifies things only in terms of barriers and release-sequences synchronizing with loads that see the value stored.
But it's well known how fences and acq/rel loads/stores map to asm for any given target platform. It's implausible that a compiler will do enough whole-program inter-thread analysis to prove that it can break your code.
There might be an argument to be made in terms of the language used in the C++ standard about establishing happens-before relationships between the store of tmp+1
and at least some hypothetical reader. In practice that's enough to stop a compiler from breaking the writer: it can't know what code will be reading the data it's writing so it has to respect barriers. And probably the language in the standard is strong enough that a reader that sees an odd sequence number (and avoids reading data[]
) can avoid data-race UB, so there would be a valid happens-before relationship between an atomic store that has to stay ahead of some non-atomic stores. So I'm not convinced that there's any room for a malicious all-seeing compiler to not respect atomic_thread_fence()
there, let alone any real compiler.
In any case, you definitely do not want _mm_lfence()
on x86. You want the compiler barrier against runtime reordering, but you definitely do not want the main effect of lfence
: blocking out-of-order execution.Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths and Are loads and stores the only instructions that gets reordered?
i.e. you just want GNU C asm("":::"memory")
, aka atomic_signal_fence(mo_seq_cst)
. Also equivalent to atomic_thread_fence(mo_acq_rel)
on x86, which only has to block compile-time reordering to control runtime ordering, because the only runtime reording x86's strong memory model allows is StoreLoad (except for NT stores). x86's memory model is seq_cst + a store-buffer with store forwarding (which weakens seq_cst to acq/rel, and occasionally has other funky effects especially for loads that partially overlap a store).
For more about _mm_lfence()
and so on vs. the asm instructions, see When should I use _mm_sfence _mm_lfence and _mm_mfence.
Other tweaks
Your sequence number is unnecessarily wide, and 64-bit atomics are less efficient on some 32-bit platforms, and very inefficient on a few. A 32-bit sequence number won't wrap in any reasonable thread-sleep time. (e.g. a 4GHz CPU will take about a whole second to do 2^32 stores at 1 store per clock, and that's with zero contention for writes to the cache line. And no cycles spend executing stores of the actual data. And practical use-cases don't have the writer in a tight loop publishing new values constantly: that could lead to something similar to livelock with readers constantly retrying and making no progress.)
unsigned long
is never (AFAIK) too wide to handle efficiently, except on CPUs narrower than 32-bit. So atomic<long>
or atomic<unsigned long>
would use a 64-bit counter on CPUs where that's fine, but definitely avoid the risk of using a 64-bit atomic in 32-bit code. And long
is required to be at least 32 bits wide.
Also, you don't need two copies of the write sequence number. Just have the writer do an atomic load into a tmp var, then separate atomic stores of tmp+1 and tmp+2.
(You're correct in wanting to avoid sequence_number++
; it would be a bad idea to ask the compiler to do two atomic RMWs). The only advantage of a separate non-atomic var for the writer's private seq number is if this can inline into a write loop and keep it in a register so the writer never reloads the value.
try_copy
):::memcpy(copy, payload, 5000);
auto after = sequence_number.load(std::memory_order::memory_order_acquire);
I feel like that should be arelease
? Similarly,start_writing
seems like it should useacquire
. – Gapstr = { "..." }
quite odd: curly braces are usually used for structures or arrays not for strings. – Dermatome