Can you split an atomic load-store with stricter memory ordering into separate relaxed load-stores along with memory barrier instructions?
Asked Answered
A

0

5

Here is a simple example of acquire-release semantics used for data synchronization across threads.

// thread 1                                     // thread 2
data = 100;
flag.store(true, std::memory_order_release);
                                                while(!flag.load(std::memory_order_acquire));
                                                assert(data == 100);

As I understand, this is accurately showing the use of acquire-release memory ordering and program will work as intended.

But what is the case if I were to use standalone barriers?

// thread 1                                     // thread 2
data = 100;
std::atomic_thread_fence(std::memory_order_release);
flag.store(true, std::memory_order_relaxed);
                                                while(!flag.load(std::memory_order_relaxed))
                                                    std::atomic_thread_fence(std::memory_order_acquire);
                                                assert(data == 100);

I always thought that this is exactly equivalent to the first example.

But today I watched a talk at CppCon by Herb Sutter (C++ and Beyond 2012: Herb Sutter - atomic Weapons). At 1:07:10 in the video, he gives an example to show that standalone fences are suboptimal. After seeing that I am confused.

The example is this:

   // thread 1                                     // thread 2
   widget *temp = new widget();
XX mb();  XXXXXXXXXXXXXXXXXXXXX  // a
   global = temp;
                                                   temp2 = global;
                                                XX mb();  XXXXXXXXXXXXXXXXXXXXX  // b
                                                   temp2->do_something();
                                                   temp2 = global;
                                                XX mb();  XXXXXXXXXXXXXXXXXXXXX
                                                   temp2->do_something_else();

He says that, at a and b, you need a full barrier, not just release and acquire, since those are not associated with any particular stores or loads. Furthermore, he says that, standalone acquire and release barriers doesn't make any sense. Is this correct? (For simplicity, it is assumed that, reads and writes to global are indivisible, i.e. global never contains a torn value).

Why this does not work?

   // thread 1                                     // thread 2
   widget *temp = new widget();
XX release();  XXXXXXXXXXXXXXXX  // a
   global = temp;
                                                   temp2 = global;
                                                XX acquire();  XXXXXXXXXXXXXXXX  // b
                                                   temp2->do_something();
                                                   temp2 = global;
                                                XX acquire();  XXXXXXXXXXXXXXXX
                                                   temp2->do_something_else();
Anabolism answered 16/7, 2022 at 13:5 Comment(22)
When you say an "atomic load-store", you mean "load or store", not RMW, right? Yes, your example is fine, differing only in performance. Re: Sutter's claim that acquire and release fences are broken, he was misreading the standard and forgetting the difference between an acquire fence and an acquire operation. preshing.com/20131125/… debunks this misconception, which apparently was common in the early days of C++11 at least, and unfortunately is present in Herb's otherwise excellent talk.Synder
@PeterCordes > Your example is fine, differing only in performance. Which two examples are you talking about? Does first two examples have any difference? Even the slightest???Anabolism
I'm talking about the first 2 code blocks. Both are fine for correctness, but the 2nd one with separate barriers will compile less efficiently for AArch64 and especially 32-bit ARM with ARMv8 features. godbolt.org/z/567n1dcPW shows how ARMv8 can use lab as an acquire load, but a separate barrier compiles to a separate barrier instruction. For 32-bit mode, it's dmb ish (a full barrier). AArch64 has an acquire-fence instruction so that's less bad.Synder
@PeterCordes Is it just a missed optimization by the compiler or is there really some difference in semantics between those two examples? In other words, does splitting them into a separate barrier makes any difference in any case?Anabolism
It makes all previous relaxed operations into acquire wrt. things after the barrier, unlike an acquire operation which isn't a 2-way barrier. So it wouldn't be a valid optimization; the barrier form is stronger. (But yes, acquire_op -> (relaxed + acquire barrier) is legal as far as my understanding, with the barrier version being stronger. On some ISAs, that's how the asm has to loop, although there can still be a difference in terms of compile-time reordering allowed. On other ISAs where there are acquire-load instructions, it's slower so compilers don't.)Synder
@PeterCordes Thanks, Is there an exhaustive article/post/guide about these concepts?Anabolism
The ISO C++ standard is pretty detailed about the correctness properties, although it doesn't exhaustively cover the implications, just lays out the rules. cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html covers how std::atomic functions map to asm for various ISAs.Synder
@PeterCordes, SouravKannanthaB, I believe the second example is broken. The acquire fence is called inside the loop, but once the flag is set, the loop exits and it misses the fence. You'd want it after the loop, not in it. Imagine thread 2 running after thread 1 has already finished.Derange
@Derange Ok, then fence must be both in and after the loop I think. If there is no fence inside the loop, then there is nothing preventing the compiler from checking the condition only once, and going into infinite loop.Anabolism
@LWimsey: Oh right, I knew something looked weird with the ARM asm (and the x86 asm, with a weird jump over nothing). When I first looked at the code, I had been assuming the fence was after the loop, because that's the obvious place to put it. (If you're "optimizing" so you can spin faster by not running a barrier instruction before checking again; generally a false optimization.) It wasn't until putting it on Godbolt that I realized it was inside the loop, but I missed the correctness problem.Synder
It's still true that load(acquire) can be replaced by load(relaxed)+fence(acquire) in every case, but you might need to wrap that in a function or statement-expression if you want to use it as a loop condition, because yeah, both have to run every time, otherwise one of the loads is only relaxed.Synder
@SouravKannanthaB The fence must be placed after the loop, not in it. An infinite loop is not possible since the relaxed load is always going to see the updated flag value.Derange
@Derange AFAIK, with relaxed loads, r1 = a.load(relaxed); r2 = a.load(relaxed); can be collapsed into r1 = a.load(relaxed); r2 = r1; as per the standard. So I guess while(a.load(relaxed)); can turn into r1 = a.load(relaxed); while(r1);. Correct me if I am wrong. This optimization is not possible if there is a barrier after load. (Apparently, currently no compiler is doing these things. Still it is allowed)Anabolism
@SouravKannanthaB Separate loads can be merged into a single operation (although AFAIK there is no compiler that does that).. Hoisting a relaxed load out of a loop is illegal as it would violate atomic visibility requirements.. It's a common misconception that an acquire fence is necessary for the relaxed load to get an up-to-date value.. this is not true. The fence puts ordering constraints on operations sequenced after the fence, wrt the relaxed atomic load itself.Derange
@SouravKannanthaB: I agree with LWimsey that the relaxed load cannot be hoisted, even without a barrier. The abstract machine that defines observable behavior has to perform every load. The real machine can only optimize them out if observable behavior does not change, or was undefined to begin with. That's the case with a non-atomic load: the compiler can assume there are no stores, because if there were, it would be a data race and behavior is undefined. But atomic loads and stores, even when relaxed, do not cause data races, so the exemption does not apply.Amphibolite
@SouravKannanthaB: To put it another way, it is legal to collapse two loads into one, since it is possible that by mere timing, no store would occur in between the two. Likewise for any finite number of loads. But that logic does not let us collapse infinitely many loads into one. The store is going to become visible eventually, and so no matter how long it takes, one of the abstract loads is going to observe it. In the abstract machine, the loop must terminate, and so the same must happen in the real program.Amphibolite
@NateEldredge Ohh. I thought loads can be hoisted because compilers only see single thread at a time, and in that thread loaded variable is invariant. Since this is the same case in this example with relaxed load, I thought hoisting would happen here too. I never thought data race was the reason for hoisting. :)Anabolism
@SouravKannanthaB: "Compilers only see a single thread at a time" is not to be found in the standard anywhere; it is sort of just a corollary of the data race rule. For non-atomic variables, the compiler may assume that no other thread is accessing them, because to do so would be a data race, and so compiles as if there are no other threads. But for atomic variables, the compiler does have to assume that other threads may access them at any time, because doing so would not cause UB.Amphibolite
@NateEldredge In x86, naturally aligned integer load-stores are atomic. So, if thread1 is doing while(!x); and thread2 concurrently stores x=1;, still there is no data race since x can never have a torn value, isn't it? But compilers still hoists x out of the loop!Anabolism
@SouravKannanthaB: This is the distinction between the machine's memory model, and the language's. The x86 machine has well-defined behavior when separate threads do aligned integer loads and stores. But the C++ language does not. Therefore, the C++ compiler is not obliged to provide the behavior of the underlying x86 machine. It is allowed to compile in such a way as to provide the "illusion" of a machine that has worse semantics than the underlying machine actually has, and this allows it to do optimizations like the load hoist.Amphibolite
@SouravKannanthaB: So in reasoning about what a C++ program would do, you cannot base your reasoning on the x86 memory model, even if you will actually be compiling and running the code only on an x86 machine. This is a common mistake I have seen many times on this site. It's true that a single aligned store instruction on x86 is atomic and free of tearing. But the C++ language does not promise that x=1; is atomic and free of UB, and so when you compile x=1; on an x86 machine, the compiler does not have to compile it into a single aligned store instruction. [...]Amphibolite
@SouravKannanthaB: It could compile it into multiple sub-word stores, or an unaligned store of part of the word, or two stores, or (as in the hoisting example) no stores at all. I have seen all of these in real compiler output. Just so long as the semantics of the C++ language are preserved. And in the case of code that is a data race by the C++ definition, it is UB with no semantics at all, so the compiler can do literally whatever it wants.Amphibolite

© 2022 - 2024 — McMap. All rights reserved.