Does C++11 sequential consistency memory order forbid store buffer litmus test?

Consider the store buffer litmus test with SC atomics:

// Initial
std::atomic<int> x(0), y(0);

// Thread 1           // Thread 2
x.store(1);           y.store(1);
auto r1 = y.load();   auto r2 = x.load();

Can this program end with both r1 and r2 being zero?

I can't see how this result is forbidden by the description about memory_order_seq_cst in cppreference:

A load operation with this memory order performs an acquire operation, a store performs a release operation, and read-modify-write performs both an acquire operation and a release operation, plus a single total order exists in which all threads observe all modifications in the same order

It seems to me that memory_order_seq_cst is just acquire-release plus a global store order. And I don't think the global store order comes into play in this specific litmus test.

That cppreference summary of SC is too weak, and indeed isn't strong enough to forbid this reordering.

What it says looks to me only as strong as x86-TSO (acq_rel plus no IRIW reordering, i.e a total store order that all reader threads can agree on).

ISO C++ actually guarantees that there's a total order of all SC operations including loads (and also SC fences) that's consistent with program order. (That's basically the standard definition of sequential consistency in computer science; C++ programs that use only seq_cst atomic operations and are data-race-free for their non-atomic accesses execute sequentially consistently, i.e. "recover sequential consistency" despite full optimization being allowed for the non-atomic accesses.) Sequential consistency must forbid any reordering between any two SC operations in the same thread, even StoreLoad reordering.

This means an expensive full barrier (including StoreLoad) after every seq_cst store, or for example AArch64 STLR / LDAR can't StoreLoad reorder with each other, but are otherwise only release and acquire wrt. reordering with other operations. (So cache-hit SC stores can be quite a lot cheaper on AArch64 than x86, if you don't do any SC load or RMW operations in the same thread right afterwards.)

See https://eel.is/c++draft/atomics.order#4 That makes it clear that SC operations aren't reordered wrt. each other. The current draft standard says:

31.4 [atomics.order]

There is a single total order S on all memory_order::seq_cst operations, including fences, that satisfies the following constraints. First, if A and B are memory_order::seq_cst operations and A strongly happens before B, then A precedes B in S.

Second, for every pair of atomic operations A and B on an object M, where A is coherence-ordered before B, the following four conditions are required to be satisfied by S:

(4.1) if A and B are both memory_order::seq_cst operations, then A precedes B in S; and

(4.2 .. 4.4) - basically the same thing for sc fences wrt. operations.

Sequenced before implies strongly happens before, so the opening paragraph guarantees that S is consistent with program order.

4.1 is about ops that are coherenced-ordered before/after each other. i.e. a load that happens to see the value from a store. That ties inter-thread visibility into the total order S, making it match program order. The combination of those two requirements forces a compiler to use full barriers (including StoreLoad) to recover sequential consistency from whatever weaker hardware model it's targeting.

(In the original, all of 4. is one paragraph. I split it to emphasize that there are two separate things here, one for strongly-happens-before and the list of ops/barriers for coherence-ordered-before.)

These guarantees, plus syncs-with / happens-before, are enough to recover sequential consistency for the whole program, if it's data-race free (that would be UB), and if you don't use any weaker memory orders.

These rules do still hold if the program involves weaker orders, but for example an SC fence between two relaxed operations isn't as strong as two SC loads. For example on PowerPC that wouldn't rule out IRIW reordering the way using only SC operations does; IIRC PowerPC needs barriers before SC loads, as well as after.

So having some SC operations isn't necessarily enough to recover sequential consistency everywhere; that's rather the point of using weaker operations, but it can be a bit surprising that other ops can reorder wrt. SC ops. SC ops aren't SC fences. See also this Q&A for an example with the same "store buffer" litmus test: weakening one store from seq_cst to release allows reordering.

Recommended topics

Hot tags