Yes, it's possible for both loads to get 0
.
Within thread 1, y.load
can "pass" x.store(mo_release)
because they're not both seq_cst. The global total order of seq_cst operations that ISO C++ guarantees must exist only includes seq_cst operations.
(In terms of hardware / cpu-architecture for a normal CPU, the load can take a value from coherent cache before the release-store leaves the store buffer. In this case, I found it much easier to reason in terms of how I know it compiles for x86 (or to generic release and acquire operations), then apply asm memory-ordering rules. Applying this reasoning assumes that the normal C++->asm mappings are safe, and are always at least as strong as the C++ memory model. If you can find a legal reordering this way, you don't need to wade through the C++ formalism. But if you don't, that of course doesn't prove it's safe in the C++ abstract machine.)
Anyway, the key point to realize is that a seq_cst operation isn't like atomic_thread_fence(mo_seq_cst)
- Individual seq_cst
operations only have to recover/maintain sequential consistency in the way they interact with other seq_cst
operations, not with plain acquire/release/acq_rel. (Similarly, acquire and release fences are stronger 2-way barriers, unlike acquire and release operations as Jeff Preshing explains.)
The reordering that makes this happen
That's the only reordering possible; the other possibilities are just interleavings of the program-order of the two threads. Having the store "happen" (become visible) last leads to the 0, 0
result.
I renamed one
and two
to r1
and r2
(local "registers" within each thread), to avoid writing things like one == 0
.
// x=0 nominally executes in T1, but doesn't have to drain the store buffer before later loads
auto r1 = y.load(std::memory_order_seq_cst); // T1b r1 = 0 (y)
y.fetch_add(1, std::memory_order_seq_cst); // T2a y = 1 becomes globally visible
auto r2 = x.load(std::memory_order_seq_cst); // T2b r2 = 0 (x)
x.store(1, std::memory_order_release); // T1a x = 0 eventually becomes globally visible
This can happen in practice on x86, but interestingly not AArch64. x86 can do release-store without additional barriers (just a normal store), and seq_cst loads are compiled the same as plain acquire, just a normal load.
On AArch64, release and seq_cst stores use STLR. seq_cst loads use LDAR, which has a special interaction with STLR, not being allowed to read cache until the last STLR drains from the store buffer. So release-store / seq_cst load on ARMv8 is the same as seq_cst store / seq_cst load. (ARMv8.3 added LDAPR, allowing true acquire / release by letting acquire loads compile differently; see this Q&A.)
However, it can also happen on many ISAs that use separate barrier instructions, like ARM32: a release store will typically be done with a barrier and then a plain store, preventing reordering with earlier loads / stores, but not stopping reordering with later. If the seq_cst load avoids needing a full barrier before itself (which is the normal case), then the store can reorder after the load.
For example, a release store on ARMv7 is dmb ish; str
, and a seq_cst load is ldr; dmb ish
, so you have str / ldr with no barrier between them.
On PowerPC, as seq_cst load is hwsync; ld; cmp; bc; isync
, so there's a full barrier before the load. (The HeavyWeight Sync is I think part of preventing IRIW reordering, to block store-forwarding between SMT threads on the same physical core, only seeing stores from other cores when they actually become globally visible.)
one
ortwo
beforestore
orfetch_add
), but this can't happen due to memory ordering. And absent such reordering, at least one of them must be non-0. – Fredela