Can non-atomic-load be reordered after atomic-acquire-load?

Asked 30/7, 2016 at 18:13 Answered 26/2, 2020 at 23:2

Solved c++multithreading c++11 concurrency memory-fences

As known in since C++11 there are 6 memory orders, and in documentation written about std::memory_order_acquire:

http://en.cppreference.com/w/cpp/atomic/memory_order

memory_order_acquire

A load operation with this memory order performs the acquire operation on the affected memory location: no memory accesses in the current thread can be reordered before this load. This ensures that all writes in other threads that release the same atomic variable are visible in the current thread.

1. Non-atomic-load can be reordered after atomic-acquire-load:

I.e. it does not guarantee that non-atomic-load can not be reordered after acquire-atomic-load.

static std::atomic<int> X;
static int L;
...

void thread_func() 
{
    int local1 = L;  // load(L)-load(X) - can be reordered with X ?

    int x_local = X.load(std::memory_order_acquire);  // load(X)

    int local2 = L;  // load(X)-load(L) - can't be reordered with X
}

Can load int local1 = L; be reordered after X.load(std::memory_order_acquire);?

2. We can think that non-atomic-load can not be reordered after atomic-acquire-load:

Some articles contained a picture showing the essence of acquire-release semantics. That is easy to understand, but can cause confusion.

For example, we may think that std::memory_order_acquire can't reorder any series of Load-Load operations, even non-atomic-load can't be reordered after atomic-acquire-load.

3. Non-atomic-load can be reordered after atomic-acquire-load:

Good thing that there is clarified: Acquire semantics prevent memory reordering of the read-acquire with any read or write operation which follows it in program order. http://preshing.com/20120913/acquire-and-release-semantics/

But also known, that: On strongly-ordered systems (x86, SPARC TSO, IBM mainframe), release-acquire ordering is automatic for the majority of operations.

And Herb Sutter on page 34 shows: https://onedrive.live.com/view.aspx?resid=4E86B0CF20EF15AD!24884&app=WordPdf&authkey=!AMtj_EflYn2507c

4. I.e. again, we can think that non-atomic-load can not be reordered after atomic-acquire-load:

I.e. for x86:

release-acquire ordering is automatic for the majority of operations
Reads are not reordered with any reads. (any - i.e. regardless of older or not)

So can non-atomic-load be reordered after atomic-acquire-load in C++11?

Oxy answered 30/7, 2016 at 18:13 Comment(6)

In C++11, it doesn't matter - a race-free program cannot detect whether or not the reordering has taken place (and a program containing races exhibits undefined behavior, at which point any outcome whatsoever is possible). In your example, if there's another thread that writes to L, you have a data race. Therefore, from the standpoint of the C++ standard, the question is moot. – Shulem 30/7, 2016 at 18:24

@Igor: I beg to differ. There is no data race if L is only read with a happens-before relationship to when it was written, which the load-acquire may provide (depending on how the store is done). It would be impossible to safely construct singletons, otherwise (every single member would have to be atomic, etc.). – Assailant 30/7, 2016 at 18:46

Concerning the free acquire/release support on x86: This is true at the hardware level, but (besides the benefit of portable code) this doesn't mean you can omit the correct memory ordering in your code, because the compiler is free to re-order stuff otherwise, not just the CPU. – Assailant 30/7, 2016 at 18:48

@Assailant By what mechanism, in the example shown, would a happens-before relationship be established? What could another thread possibly do before or after modifying L to ensure such a relationship? I just don't see it. – Shulem 31/7, 2016 at 1:30

@Assailant In any case, if we assume there's some synchronization added to ensure that writes to L are synchronized with this read, then it wouldn't matter whether or not it gets reordered with the load of X. Once again, a program with such synchronization wouldn't be able to detect whether the reordering occurred, so why would anyone care? Can you show an example that a) is race-free, and b) would produce different output depending on whether the reordering occurred? – Shulem 31/7, 2016 at 1:39

Is thread_func() your whole thread? – Abyssal 14/12, 2019 at 4:22

The reference you cited is pretty clear: you can't move reads before this load. In your example:

static std::atomic<int> X;
static int L;


void thread_func() 
{
    int local1 = L;  // (1)
    int x_local = X.load(std::memory_order_acquire);  // (2)
    int local2 = L;  // (3)
}

memory_order_acquire means that (3) cannot happen before (2) (the load in (2) is sequenced before thr load in (3)). It says nothing about the relationship between (1) and (2).

Imaginal answered 30/7, 2016 at 18:40 Comment(5)

Yep! Also note that even if (1) is ordered before (2), the value read by (2) may be less recent than the value read by (1). So even if (1) stays before (2), it doesn't really matter. All you know is that the value read at (3) is guaranteed to be at least as recent as the value read at (2), provided the stores to X are done with a release fence. – Assailant 30/7, 2016 at 18:51

Uh, (1) is happily sequenced-before (2). I'm not quite sure what you are trying to say, but the statements "(1) is sequenced-before (2)" and "(2) is sequenced-before (3)" are obviously true. – Shulem 31/7, 2016 at 1:34

I think you are being, shall we say, somewhat sloppy in your terminology. And once you try to express yourself in the proper standardese, you'll discover that there's really nothing to say. The standard terminology won't even let you express the concept of (1) and (2) being reordered, because, I believe, such reordering is undetectable to a race-free program. Either there is a data race and then the program exhibits undefined behavior; or there isn't, and then the program behaves in sequentially consistent manner. – Shulem 31/7, 2016 at 1:52

In fact, the standard has a non-normative note saying just that: "[intro.multithread]/15 [ Note: This states that operations on ordinary objects are not visibly reordered. This is not actually detectable without data races, but it is necessary to ensure that data races, as defined below, and with suitable restrictions on the use of atomics, correspond to data races in a simple interleaved (sequentially consistent) execution. —end note ]" – Shulem 31/7, 2016 at 1:54

The only significance of acquire at (2) is that synchronizes-with a release operation on X in another thread that precedes it in X's modification order; this in turn may help establish a happens-before relationship between (3) and a modification of L (one that happens-before that release operation). But (2) has no such effect on (1). I guess that's how you reason about this example within the standard. – Shulem 31/7, 2016 at 2:28

I believe this is the correct way to reason about your example within the C++ standard:

X.load(std::memory_order_acquire) (let's call it "operation (A)") may synchronize with a certain release operation on X (operation (R)) - roughly, the operation that assigned the value to X that (A) is reading.

[atomics.order]/2 An atomic operation A that performs a release operation on an atomic object M synchronizes with an atomic operation B that performs an acquire operation on M and takes its value from any side effect in the release sequence headed by A.

This synchronizes-with relationship may help establish a happens-before relationship between some modification of L and the assignment local2 = L. If that modification of L happens-before (R), then, due to the fact that (R) synchronizes-with (A) and (A) is sequenced-before the read of L, that modification of L happens-before this read of L.
But (A) has no effect whatsoever on the assignment local1 = L. It neither causes data races involving this assignment, nor helps prevent them. If the program is race-free, then it must necessarily employ some other mechanism to ensure that modifications of L are synchronized with this read (and if it's not race-free, then it exhibits undefined behavior and the standard has nothing further to say about it).

It is meaningless to talk about "instruction reordering" within the four corners of the C++ standard. One may talk about machine instructions generated by a particular compiler, or the way those instructions are executed by a particular CPU. But from the standard's standpoint, these are merely irrelevant implementation details, as long as that compiler and that CPU produce observable behavior consistent with one possible execution path of an abstract machine described by the standard (the As-If rule).

Shulem answered 31/7, 2016 at 3:2 Comment(2)

> and takes its value from any side effect in the release sequence headed by A But X.load value is never actually used/read. Suppose it's not even assigned to the local variable, will this acquire load then provide any sort of guarantees/restrictions or could it theoretically be safely removed by compiler? – Corunna 21/4, 2022 at 6:13

@DanM. Yes, I believe the load can be removed under as-if rule. I can't think of any way to extend the example so that a race-free program is produced that would be able to detect, via a change in observable behavior, whether or not that load was actually performed. – Shulem 22/4, 2022 at 1:22

A load operation with this memory order performs the acquire operation on the affected memory location: no memory accesses in the current thread can be reordered before this load.

That's like a rule of thumb of compiler code generation.

But that's absolutely not an axiom of C++.

There are many cases, some trivially detectable, some requiring more work, where an operation on memory Op on V can be provably reordered with an atomic operation X on A.

The two most obvious cases:

when V is a strictly local variable: one that can't be accessed by any other thread (or signal handler) because its address is not made available outside of the function;
when A is such a strictly local variable.

(Note that these two reordering by the compiler are valid for any of the possible memory ordering specified for X.)

In any case, the transformation is not visible, it doesn't change the possible executions of valid programs.

There are less obvious cases where these types of code transformations are valid. Some are contrived, some are realistic.

I can easily come up with this contrived example:

using namespace std;

static atomic<int> A;

int do_acq() {
  return A.load(memory_order_acquire);
}

void do_rel() {
  A.store(0, memory_order_release);
} // that's all folks for that TU

Note:

the use of static variable to be able to see all operations on the object, on separately compiled code; the functions which access the atomic synchronization object are not static and can be called from all the program.

As a synchronization primitive, operations on A establish synchronize-with relations: there is one between:

thread X that calls do_rel() at point pX
and thread Y that calls do_acq() at point pY

There is a well defined order of modification M of A corresponding to the calls to do_rel() in different threads. Each call to do_acq() either:

observes the result of a call to do_rel() at pX_i and synchronizes with thread X by pulling in the history of X at pX_i
observes the initial value of A

On the other hand, the value is always 0, so the calling code only gets a 0 from do_acq() and cannot determine what happened from the return value. It can know a priori that a modification of A has already happened, but it can't know only a posteriori. The a priori knowledge can come from another synchronization operation. A priori knowledge is part of the history of thread Y. Either way, the acquire operation does not had knowledge and does not add a past history: the known part of the acquire operation is empty, it doesn't reliably acquire anything that was in the past of thread Y at pY_i. So the acquire on A is meaningless and can be optimized out.

In other words: A program valid for all possible values of M must be valid when do_acq() sees the most recent do_rel() in history of Y, the one that is before all modifications of A that can be seen. So do_rel() adds nothing in general: do_rel() can add a non redundant synchronize-with in some executions, but the minimum of what it adds Y is nothing, so a correct program, one that doesn't have a race condition (expressed as: its behavior depends on M, such as its correctness is a function of getting some subset of the allowable values for M) must be prepared to handle getting nothing from do_rel(); so the compiler can make do_rel() a NOP.

[Note: That the line of argument doesn't easily generalizes to all RMW operations that read a 0 and store a 0. It probably can't work for acq-rel RMW. In other words, acq+rel RMW are more powerful than separate loads and stores, for their “side effect”.]

Summary: in that particular example, not only the memory operations can move up and down with respect to an atomic acquire operation, the atomic operations can be removed completely.

Abyssal answered 14/12, 2019 at 4:22 Comment(0)

Just to answer your headline question: yes, any loads (whether atomic or non-atomic) can be re-ordered after an atomic load. Similarly any stores can be re-ordered before an atomic store.

However, an atomic store is not necessarily allowed to be re-ordered after an atomic load or vice verse (atomic load re-ordered before atomic store).

See Herb Sutter's talk around 44:00.

Maharanee answered 26/2, 2020 at 23:2 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags