Does C++11 sequential consistency memory order forbid store buffer litmus test?
Asked Answered
H

1

1

Consider the store buffer litmus test with SC atomics:

// Initial
std::atomic<int> x(0), y(0);

// Thread 1           // Thread 2
x.store(1);           y.store(1);
auto r1 = y.load();   auto r2 = x.load();

Can this program end with both r1 and r2 being zero?

I can't see how this result is forbidden by the description about memory_order_seq_cst in cppreference:

A load operation with this memory order performs an acquire operation, a store performs a release operation, and read-modify-write performs both an acquire operation and a release operation, plus a single total order exists in which all threads observe all modifications in the same order

It seems to me that memory_order_seq_cst is just acquire-release plus a global store order. And I don't think the global store order comes into play in this specific litmus test.

Hob answered 2/12, 2021 at 18:7 Comment(13)
plus a single total order exists in which all threads observe all modifications in the same order . What single total order (which all threads see the same) could give 0 & 0?Nardone
I don't think both variables could be zero, but one of them sure could be. As soon as the store happens, all threads "get" the updated value. If y.load() yeilds 0, then y.store(1); can't have ran yet and as such, auto r2 = x.load(); will load 1 since x.store(1); has already ran.Countenance
A global order means all threads see (agree) a specific order of the loads and stores in the instructions of every single thread (i.e. they all see, agree, a specific sequence of load/store instructions). So far, this global order may be any sequence, even a sequence where x.load() comes before y.store(1). However, each thread sees its instruction in program order, so for a global order to be present, the other threads must also agree to see this thread's loads/stores in program order. Since this is valid for any thread, the end result is that a global order respect program order.Maxma
So you can interleave thread1 loads/stores with thread2's but you cannot reorder loads/stores in thread1 or thread2. With these conditions, there is no way both loads come before a store. In order to have x = 0 and y = 0 we need a slightly relaxed model called TSO where the store buffer (by delaying the global visibility of a store) is essentially performing a Store-Load reordering.Maxma
@MargaretBloom Thanks Margaret. But I have questions about "each thread sees its instruction in program order". Is this statement specific for sequential consistency or does it apply to other memory orders as well? If the former is true, is it in the standard? Else if the latter is the case, how comes all these "reorderings" permitted by other memory orders?Hob
seq_cst means that all behaviour must be explainable by some interleaving of program-order (for data-race-free programs! If you go looking at non-atomic variables when another thread might be writing them, that's UB). Anyway, that's stronger than acq_rel because it effectively means no StoreLoad reordering. preshing.com/20120710/… / preshing.com/20120515/memory-reordering-caught-in-the-actBandy
@PeterCordes Actually my question comes from when I read about this question #67694187, knowing that the store-sc/load-acquire reordering can happen. I think I didn't understand the single total order very well. Now it seems to me that the single total order includes load-sc, but for anything weaker than sc this single total order is no stronger than acquire-release - am I right here? It's still shocking to me that sc atomic doesn't put extra constraints on weaker atomic.Hob
I was looking for duplicates for this, and actually just found that question you linked as a related non-duplicate, was about to link it when you commented. That Q&A already assume an understanding of the fact that making all the operations SC would stop both threads from reading 0, because that's not possible with an interleaving of source order. Yeah, it's surprising that ISO C++'s model is weaker than what you get from compiling for ISAs where every SC store or RMW includes a full SC barrier that drains the store buffer, like x86 or 32-bit ARM.Bandy
But it allows ISAs like AArch64 to have significantly more efficient SC stores; memory operations only have to wait when there's an SC load in the same thread that recently did an SC store. (STLR then LDAR, as opposed to plain-acquire LDAPR which doesn't wait for previous STLR stores to become globally visible by committing to L1d cache.) It's very rare that programs need more than acq/rel sync, so most of the time full SC is just a burden, but languages like Java don't provide any memory orders other than SC.Bandy
@PeterCordes Actually before I read that related question, I was just wondering how a weaker hardware memory model, i.e. as strong as SC-DRF but no stronger, is more preferable in terms of performance (yeah I read it in Herb's atomic<> weapons talk). IT ALL MAKES SENSE NOW, by knowing that ISO C++'s SC is weaker than I (wrongly) understood before. Thank you Peter, you helped me over and over again!Hob
BTW, the answer to this question is that the cppreference quote is too weak What it says is actually only as strong as x86-TSO (acq_rel plus no IRIW reordering, i.e a total store order, rather than the total order of all SC operations that matches program order, which ISO C++ actually guarantees in eel.is/c++draft/atomics.order#4 That makes it clear that SC operations aren't reordered wrt. each other.) So you're right, that bad summary on cppref wouldn't forbid the reordering here.Bandy
@PeterCordes Wow! Would you mind to put that into an answer? I think that's exactly what this question is looking for.Hob
Guess I might as well post it. I think I've seen another Q&A that revolved around that insufficiently-strong description of SC on cppreference, so I was going to look for a duplicate. If we later find it, we can link them together as dups or with links in my answer.Bandy
B
3

That cppreference summary of SC is too weak, and indeed isn't strong enough to forbid this reordering.

What it says looks to me only as strong as x86-TSO (acq_rel plus no IRIW reordering, i.e a total store order that all reader threads can agree on).

ISO C++ actually guarantees that there's a total order of all SC operations including loads (and also SC fences) that's consistent with program order. (That's basically the standard definition of sequential consistency in computer science; C++ programs that use only seq_cst atomic operations and are data-race-free for their non-atomic accesses execute sequentially consistently, i.e. "recover sequential consistency" despite full optimization being allowed for the non-atomic accesses.) Sequential consistency must forbid any reordering between any two SC operations in the same thread, even StoreLoad reordering.

This means an expensive full barrier (including StoreLoad) after every seq_cst store, or for example AArch64 STLR / LDAR can't StoreLoad reorder with each other, but are otherwise only release and acquire wrt. reordering with other operations. (So cache-hit SC stores can be quite a lot cheaper on AArch64 than x86, if you don't do any SC load or RMW operations in the same thread right afterwards.)

See https://eel.is/c++draft/atomics.order#4 That makes it clear that SC operations aren't reordered wrt. each other. The current draft standard says:

31.4 [atomics.order]

  1. There is a single total order S on all memory_­order​::​seq_­cst operations, including fences, that satisfies the following constraints. First, if A and B are memory_­order​::​seq_­cst operations and A strongly happens before B, then A precedes B in S.

Second, for every pair of atomic operations A and B on an object M, where A is coherence-ordered before B, the following four conditions are required to be satisfied by S:

  • (4.1) if A and B are both memory_­order​::​seq_­cst operations, then A precedes B in S; and
  • (4.2 .. 4.4) - basically the same thing for sc fences wrt. operations.

Sequenced before implies strongly happens before, so the opening paragraph guarantees that S is consistent with program order.

4.1 is about ops that are coherenced-ordered before/after each other. i.e. a load that happens to see the value from a store. That ties inter-thread visibility into the total order S, making it match program order. The combination of those two requirements forces a compiler to use full barriers (including StoreLoad) to recover sequential consistency from whatever weaker hardware model it's targeting.

(In the original, all of 4. is one paragraph. I split it to emphasize that there are two separate things here, one for strongly-happens-before and the list of ops/barriers for coherence-ordered-before.)


These guarantees, plus syncs-with / happens-before, are enough to recover sequential consistency for the whole program, if it's data-race free (that would be UB), and if you don't use any weaker memory orders.

These rules do still hold if the program involves weaker orders, but for example an SC fence between two relaxed operations isn't as strong as two SC loads. For example on PowerPC that wouldn't rule out IRIW reordering the way using only SC operations does; IIRC PowerPC needs barriers before SC loads, as well as after.

So having some SC operations isn't necessarily enough to recover sequential consistency everywhere; that's rather the point of using weaker operations, but it can be a bit surprising that other ops can reorder wrt. SC ops. SC ops aren't SC fences. See also this Q&A for an example with the same "store buffer" litmus test: weakening one store from seq_cst to release allows reordering.

Bandy answered 3/12, 2021 at 7:35 Comment(1)
Actually my mind journey was a stack of questions: (bottom) I got curious about how can a hardware do cheaper (C++11) SC store -> I read about Q&A in #67694187 and got shocked by the allowed reordering of SC -> I read cppref's SC description and got curious about its weak wording about SC -> this SO question (top). This answer not only unwinds this stack, but also even empties it! Now I don't have to ask more questions that I was about to.Hob

© 2022 - 2024 — McMap. All rights reserved.