That cppreference summary of SC is too weak, and indeed isn't strong enough to forbid this reordering.
What it says looks to me only as strong as x86-TSO (acq_rel plus no IRIW reordering, i.e a total store order that all reader threads can agree on).
ISO C++ actually guarantees that there's a total order of all SC operations including loads (and also SC fences) that's consistent with program order. (That's basically the standard definition of sequential consistency in computer science; C++ programs that use only seq_cst atomic operations and are data-race-free for their non-atomic accesses execute sequentially consistently, i.e. "recover sequential consistency" despite full optimization being allowed for the non-atomic accesses.) Sequential consistency must forbid any reordering between any two SC operations in the same thread, even StoreLoad reordering.
This means an expensive full barrier (including StoreLoad) after every seq_cst store, or for example AArch64 STLR / LDAR can't StoreLoad reorder with each other, but are otherwise only release and acquire wrt. reordering with other operations. (So cache-hit SC stores can be quite a lot cheaper on AArch64 than x86, if you don't do any SC load or RMW operations in the same thread right afterwards.)
See https://eel.is/c++draft/atomics.order#4 That makes it clear that SC operations aren't reordered wrt. each other. The current draft standard says:
31.4 [atomics.order]
- There is a single total order S on all
memory_order::seq_cst
operations, including fences, that satisfies the following constraints. First, if A and B are memory_order::seq_cst
operations and A strongly happens before B, then A precedes B in S.
Second, for every pair of atomic operations A and B on an object M, where A is coherence-ordered before B, the following four conditions are required to be satisfied by S:
- (4.1) if A and B are both memory_order::seq_cst operations, then A precedes B in S; and
- (4.2 .. 4.4) - basically the same thing for sc fences wrt. operations.
Sequenced before implies strongly happens before, so the opening paragraph guarantees that S is consistent with program order.
4.1 is about ops that are coherenced-ordered before/after each other. i.e. a load that happens to see the value from a store. That ties inter-thread visibility into the total order S, making it match program order. The combination of those two requirements forces a compiler to use full barriers (including StoreLoad) to recover sequential consistency from whatever weaker hardware model it's targeting.
(In the original, all of 4. is one paragraph. I split it to emphasize that there are two separate things here, one for strongly-happens-before and the list of ops/barriers for coherence-ordered-before.)
These guarantees, plus syncs-with / happens-before, are enough to recover sequential consistency for the whole program, if it's data-race free (that would be UB), and if you don't use any weaker memory orders.
These rules do still hold if the program involves weaker orders, but for example an SC fence between two relaxed
operations isn't as strong as two SC loads. For example on PowerPC that wouldn't rule out IRIW reordering the way using only SC operations does; IIRC PowerPC needs barriers before SC loads, as well as after.
So having some SC operations isn't necessarily enough to recover sequential consistency everywhere; that's rather the point of using weaker operations, but it can be a bit surprising that other ops can reorder wrt. SC ops. SC ops aren't SC fences. See also this Q&A for an example with the same "store buffer" litmus test: weakening one store from seq_cst
to release
allows reordering.
y.load()
yeilds0
, theny.store(1);
can't have ran yet and as such,auto r2 = x.load();
will load1
sincex.store(1);
has already ran. – Countenancex.load()
comes beforey.store(1)
. However, each thread sees its instruction in program order, so for a global order to be present, the other threads must also agree to see this thread's loads/stores in program order. Since this is valid for any thread, the end result is that a global order respect program order. – Maxma