The ordering requirements you describe are exactly what release/acquire semantics provide. (http://preshing.com/20120913/acquire-and-release-semantics/).
The problem is that the unit of atomicity for efficient guaranteed-atomic loads/stores is at most 8 bytes on all x86 and some ARM. Otherwise only 4 bytes on other ARMs. (Why is integer assignment on a naturally aligned variable atomic on x86?). Some Intel CPUs probably in practice have atomic 32 or even 64-byte (AVX512) stores, but neither Intel nor AMD have ever made any guarantees official.
We don't even know if SIMD vector stores have a guaranteed order when they potentially break up a wide aligned store into multiple 8-byte aligned chunks. Or even if those chunks are individually atomic. Per-element atomicity of vector load/store and gather/scatter? There's every reason to believe that they are per-element atomic, even if the documentation doesn't guarantee it.
If having large "objects" is performance critical, you could consider testing vector load/store atomicity on a specific server that you care about, but you're totally on your own as far as guarantees and getting the compiler to use it. (There are intrinsics.) Make sure you test between cores on different sockets, to catch cases like SSE instructions: which CPUs can do atomic 16B memory operations? tearing at 8-byte boundaries because of HyperTransport between sockets on a K10 Opteron. This is probably a really bad idea; you can't guess what if any microarchitectural conditions could make a wide vector store non-atomic in rare cases even when it normally looks like it is atomic.
You can easily have release/acquire ordering for the elements of an array like
alignas(64) atomic<uint64_t> arr[1024];
. You just have to ask the compiler nicely:
copy_to_atomic(std::atomic<uint64_t> *__restrict dst_a,
const uint64_t *__restrict src, size_t len) {
const uint64_t *endsrc = src+len;
while (src < src+len) {
dst_a->store( *src, std::memory_order_release );
dst_a++; src++;
}
}
On x86-64 it doesn't auto-vectorize or anything, because compilers don't optimize atomics, and because there's no documentation that it's safe to use vectors to store consecutive elements of an array of atomic elements. :( So this basically sucks. See it on the Godbolt compiler explorer
I'd consider rolling your own with volatile __m256i*
pointers (aligned load/store), and compiler barriers like atomic_thread_fence(std::memory_order_release)
to prevent compile-time reordering. Per-element ordering/atomicity should be ok (but again not guaranteed). And definitely don't count on the whole 32 bytes being atomic, just that higher uint64_t
elements are written after lower uint64_t
elements (and those stores become visible to other cores in that order).
On ARM32: even an atomic store of a uint64_t
is not great. gcc uses a ldrexd
/ strexd
pair (LL/SC), because apparently there is no 8-byte atomic pure store. (I compiled with gcc7.2 -O3 -march=armv7-a. With armv8-a in AArch32 mode, store-pair is atomic. AArch64 also has atomic 8-byte load/store of course.)
You must avoid using a normal C library memcpy
implementation. On x86, it can use weakly-ordered stores for large copies, allowing reordering between its own stores (but not with later stores that weren't part of the memcpy
, because that could break later release-stores.)
movnt
cache-bypassing stores in a vector loop, or rep movsb
on a CPU with the ERMSB feature, could both create this effect. Does the Intel Memory Model make SFENCE and LFENCE redundant?.
Or a memcpy
implementation could simply choose to do the last (partial) vector first, before entering its main loop.
Concurrent write+read or write+write on non-atomic
types in UB in C and C++; that's why memcpy
has so much freedom to do whatever it wants, including use weakly-ordered stores as long as it uses sfence
if necessary to make sure the memcpy
as a whole respects the ordering the compiler expects when it emits code for later mo_release
operations.
(i.e. current C++ implementations for x86 do std::atomic
with the assumption that there are no weakly-ordered stores for them to worry about. Any code that wants their NT stores to respect the ordering of compiler-generated atomic<T>
code must use _mm_sfence()
. Or if writing asm by hand, the sfence
instruction directly. Or just use xchg
if you want to do a sequential-release store and give your asm function the effect of a atomic_thread_fence(mo_seq_cst)
as well.)
memcpy
operation may be interrupted (by various things, including I/O). In that case, you are going to have a reload of the cache. – Nonetseq_cst
and then see if someone understands if a weaker level is still legal. – Barberabarberrymemcpy
is compiler dependent, OS dependent and hardware dependent. For example, the ARM has a specialized instruction that can load up to 16 32-bit registers from memory (not interruptable) and likewise one that writes. However, the compiler may refuse to use the instruction and instead, loop (which is interruptable). Also, depends on how the copying utilizes the processor's register. The brute force is one byte at a time, more optimal is to use a word at a time. – Nonetseq_cst
is relatively expensive. I don't have all this stuff memorized, but refreshing my memory, it looks likereq
+acq
can do it cleanly, which is cheap on sane arches like x86 (yes, I just said that) - what arch are you using? Also, keep in mind that you can't have any non-atomic accesses - but also, you shouldn't worry about cheap atomics. – Barberabarberrydmb ish
(full memory barrier). godbolt.org/z/r08GzK). – Pyrimidine