You already have an answer to the language-lawyer part of this. But I want to answer the related question of how to understand why this can be possible in asm on a possible CPU architecture that uses LL/SC for RMW atomics.
It doesn't make sense for C++11 to forbid this reordering: it would require a store-load barrier in this case where some CPU architectures could avoid one.
It might actually be possible with real compilers on PowerPC, given the way they map C++11 memory-orders to asm instructions.
On PowerPC64, a function with an acq_rel exchange and an acquire load (using pointer args instead of static variables) compiles as follows with gcc6.3 -O3 -mregnames
. This is from a C11 version because I wanted to look at clang output for MIPS and SPARC, and Godbolt's clang setup works for C11 <atomic.h>
but fails for C++11 <atomic>
when you use -target sparc64
.
#include <stdatomic.h> // This is C11, not C++11, for Godbolt reasons
long foo(_Atomic long *a, _Atomic int *b) {
atomic_exchange_explicit(b, 1, memory_order_acq_rel);
//++*a;
return atomic_load_explicit(a, memory_order_acquire);
}
(source + asm on Godbolt for MIPS32R6, SPARC64, ARM 32, and PowerPC64.)
foo:
lwsync # with seq_cst exchange this is full sync, not just lwsync
# gone if we use exchage with mo_acquire or relaxed
# so this barrier is providing release-store ordering
li %r9,1
.L2:
lwarx %r10,0,%r4 # load-linked from 0(%r4)
stwcx. %r9,0,%r4 # store-conditional 0(%r4)
bne %cr0,.L2 # retry if SC failed
isync # missing if we use exchange(1, mo_release) or relaxed
ld %r3,0(%r3) # 64-bit load double-word of *a
cmpw %cr7,%r3,%r3
bne- %cr7,$+4 # skip over the isync if something about the load? PowerPC is weird
isync # make the *a load a load-acquire
blr
isync
is not a store-load barrier; it only requires the preceding instructions to complete locally (retire from the out-of-order part of the core). It doesn't wait for the store buffer to be flushed so other threads can see the earlier stores.
Thus the SC (stwcx.
) store that's part of the exchange can sit in the store buffer and become globally visible after the pure acquire-load that follows it. In fact, another Q&A already asked this, and the answer is that we think this reordering is possible. Does `isync` prevent Store-Load reordering on CPU PowerPC?
If the pure load is seq_cst
, PowerPC64 gcc puts a sync
before the ld
. Making the exchange
seq_cst
does not prevent the reordering. Remember that C++11 only guarantees a single total order for SC operations, so the exchange and the load both need to be SC for C++11 to guarantee it.
So PowerPC has a bit of an unusual mapping from C++11 to asm for atomics. Most systems put the heavier barriers on stores, allowing seq-cst loads to be cheaper or only have a barrier on one side. I'm not sure if this was required for PowerPC's famously-weak memory ordering, or if another choice was possible.
https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html shows some possible implementations on various architectures. It mentions multiple alternatives for ARM.
On AArch64, we get this for the question's original C++ version of thread1:
thread1():
adrp x0, .LANCHOR0
mov w1, 1
add x0, x0, :lo12:.LANCHOR0
.L2:
ldaxr w2, [x0] @ load-linked with acquire semantics
stlxr w3, w1, [x0] @ store-conditional with sc-release semantics
cbnz w3, .L2 @ retry until exchange succeeds
add x1, x0, 8 @ the compiler noticed the variables were next to each other
ldar w1, [x1] @ load-acquire
str w1, [x0, 12] @ r1 = load result
ret
The reordering can't happen there because AArch64 acquire loads interact with release stores to give sequential consistency, not just plain acq/rel. Release stores can't reorder with later acquire loads.
(They can reorder with later plain loads, on paper and probably in some real hardware. AArch64 seq_cst can be cheaper than on other ISAs, if you avoid acquire loads right after release stores.
But unfortunately it makes acq/rel worse than x86. This is fixed with ARMv8.3-A LDAPR, a load that's just acquire not sequential-acquire. It allows earlier stores, even STLR, to reorder with it. So you get just acq_rel, allowing StoreLoad reordering but not other reordering. (It's also an optional feature in ARMv8.2-A).)
On a machine that also or instead had plain-release LL/SC atomics, it's easy to see that an acq_rel doesn't stop later loads to different cache lines from becoming globally visible after the LL but before the SC of the exchange.
If exchange
is implemented with a single transaction like on x86, so the load and store are adjacent in the global order of memory operations, then certainly no later operations can be reordered with an acq_rel
exchange and it's basically equivalent to seq_cst
.
But LL/SC doesn't have to be a true atomic transaction to give RMW atomicity for that location.
In fact, a single asm swap
instruction could have relaxed or acq_rel semantics. SPARC64 needs membar
instructions around its swap
instruction, so unlike x86's xchg
it's not seq-cst on its own. (SPARC has really nice / human readable instruction mnemonics, especially compared to PowerPC. Well basically anything is more readable that PowerPC.)
Thus it doesn't make sense for C++11 to require that it did: it would hurt an implementation on a CPU that didn't otherwise need a store-load barrier.
exchange(std::memory_order_acq_rel)
withexchange(std::memory_order_relaxed)
would not change the program correctness, but if I understand correctly it changes the program: compiler or CPU can now reorder the exchange with load and, thus, make outcomer1 == 0 && r2 == 0
possible. Am I wrong? – Adkinsexchange
. – Adkinsr1==0 && r2 == 0
- no any synchronization point between 2 threads – Fillr1==0 && r2 == 0
here, because acquire-release (and consume-release) are paired. Only one thread acquires (and releases) each atomic, so the following relaxed load can see an "old" value – Lamamemory_orders
withseq_cst
? – Adkinsmemory_orders
withseq_cst
you can never get r1 == 0 and r2 == 0, because you have one global order of modifications. My question is whether you can replaceseq_cst
with a more relaxed memory ordering. – Adkinsx
andy
are used by both threads. – Adkinsz.exchange(1, std::memory_order_acq_rel);
(M
) in both threads - all what you write beforeM
in thread1 visible afterM
in thread2 and visa versa. but no such common z. they.exchange(1, memory_order_acq_rel);
not synchronization point withx.exchange(1, memory_order_acq_rel);
because x and y different memory locations – Fillmemory_order_acq_rel
is useless and so all operations should be relaxed as they are already relaxed anyway. – Tournament