Memory ordering with multiple releases and a single acquire

Asked 7/6, 2024 at 10:37 Answered 7/6, 2024 at 15:25

Solved c++multithreading atomic memory-barriers

Consider the following code:

#include <atomic>
#include <thread>
#include <cassert>
#include <memory>

int i = 0;
std::atomic_int a{0};

int main()
{
    std::thread thr1{[]
    {
        i = 1; // A
        a.store(1, std::memory_order::release); // B
    }};
    std::thread thr2{[]
    {
        while (a.load(std::memory_order::relaxed) != 1); // C
        a.store(2, std::memory_order::release); // D
    }};
    std::thread thr3{[]
    {
        while (a.load(std::memory_order::acquire) != 2); // E
        assert(i == 1); // F
        
    }};
    thr1.join();
    thr2.join();
    thr3.join();
}

My assumption is that the assert may or may not fail and the behavior here is undefined.
Although we have the happens-before relationships such as A->B, C->D, D->E, E->F, we don't have such a relationship for B->C because of the relaxed load in C. On the other hand, https://en.cppreference.com/w/cpp/atomic/memory_order says that

All memory writes (including non-atomic and relaxed atomic) that happened-before the atomic store from the point of view of thread A, become visible side-effects in thread B. That is, once the atomic load is completed, thread B is guaranteed to see everything thread A wrote to memory. This promise only holds if B actually returns the value that A stored, or a value from later in the release sequence.

But we can't say that B->D is a release sequence headed by B because there are no read-modify-write operations on a whatsoever, so this paragraph doesn't work here.

Am I right in my understanding?

Calque answered 7/6, 2024 at 10:37 Comment(3)

Your assumption is wrong. Because of the while loops, the variable i can only be tested after the code in threads 1 and 2 have executed. The assert can never fail. That is even independent of the memory orders used. – Whitby 7/6, 2024 at 12:10

@MichaëlRoy, yes, I agree that it can only be tested after the code in the previous 2 threads, but how does this guarantee that the result of i = 1 will be visible there? – Calque 7/6, 2024 at 12:26

That is inverted. The assert not firing (the ordering) is that which must be proven. The absence of that means the assert may fire in the memory model. – Brittaney 7/6, 2024 at 14:9

You are correct. Either the assertion succeeds, or the behavior is undefined (though in practice, that means the assertion should fail).

There is an intuitive way to explain it and a formal way.

Intuition

In an acquire/release memory model, every thread views the memory changes done by other threads as a timeline, and observes some point on that timeline. Threads can disagree over the value that objects hold because they each see their own time point for other threads' modification timeline.

While thr3 can only reach assert(i == 1); once thr2 has completed all its work due to the busy-wait in thr3, at the point of the assertion, thr3 can still have an outdated view of thr1 (where i is set). This outdated view is older than the one thr2 has.

Note that thr1 has to complete all of its work before thr2 can do its work, and thr2 has to complete all of its work before thr3 can do its work. However, this restriction on the sequence of events doesn't restrict what memory operations become visible between threads. Even if thr2 did a.load(std::memory_order::acquire), this wouldn't impose that thr3 has to have a more recent view of thr1.

Formal explanation

i = 1; and assert(i == 1); are possibly a data race. Either the assertion succeeds, or the behavior is undefined.

The problem is that it is unspecified whether a.store(1, std::memory_order::release); synchronizes with a.load(std::memory_order::acquire). The condition for this is found in [atomics.order] p2

An atomic operation A that performs a release operation on an atomic object M synchronizes with an atomic operation B that performs an acquire operation on M and takes its value from any side effect in the release sequence headed by A.

If the load takes its value from the store(1) at some point, then a happens before relationship is formed because i = 1 is sequenced before that store(1) and store(1) synchronizes with load() in thr3.

However, there is no guarantee that this happens ([intro.races] p14):

The value of an atomic object M, as determined by evaluation B, is the value stored by some unspecified side effect A that modifies M, where B does not happen before A.

A plausible sequence of events is that thr2 completes all of its work before thr3 has started. thr3 then loads its value for a from the side effect a.store(2, std::memory_order::release); // D, so it never takes its value from the store in thr1.

In that event, you have a data race (between i = 1 and i == 1), and the behavior is undefined.

Yarvis answered 7/6, 2024 at 15:25 Comment(2)

On most hardware, an acquire load does sync-with all earlier stores to that location. (i.e. a pure store can continue a release-sequence). This is a natural consequence of MESI for CPUs where the only way for a store to be seen by another thread is for the writer to get MESI exclusive ownership of the cache line and then for the reader to get a copy of it. But some ISAs are weaker, allowing hardware implementations that can e.g. forward stores between logical cores of the same physical core (such as POWER); the same mechanism allows IRIW reordering. – Lunchroom 8/6, 2024 at 16:20

So the C++ rules are that weak because of a few ISAs like POWER and NVidia GPUs; most are stronger including x86 and ARMv8. (And no real ARMv7 CPUs were as weak as ARMv7 allowed on paper, since there hasn't been an ARM with SMT (multiple logical cores per physical).) See Why release sequence can only contain read-modify-write but not pure write / What does "release sequence" mean? – Lunchroom 8/6, 2024 at 16:22

You are correct.

The following are sequenced before: A << B, C << D, E << F
The following are coherence ordered before: B << C, C << D, D << E
The following synchronize with: B ~ E

There are two flavors of orderings: happens before and coherence ordered before. Happens before is consistent with program order, and it is how we reason about release-acquire synchronizations. Coherence ordered before is consistent with the coherence ordering, the modification order on a single atomic and on all sequentially consistent atomic operations.

The two orderings do not agree. And that is intentional. See P0668 for the problems in making them agree, and how it even caused problems on certain real architectures.

The reasoning that one would like to apply in the OP program essentially mixes the two: the modification order of a plus the release-acquire and program ordering. This is why one cannot piece together an ordering.

Brittaney answered 7/6, 2024 at 14:34 Comment(1)

Thank you for the answer. The only things I do not understand are: 1). Doesn't coherence ordered before come into play only when the memory_order_seq_cst mode is engaged and 2). If B synchronizes-with E after all then it obviously seems that I'm not correct :( – Calque 7/6, 2024 at 15:54

On any typical implementation the assert should be true.

The compiler should add 2 memory barriers,

in thread1, B will be flushed to the main memory after i due to the release.
in thread3, i can only be loaded after E due to the acquire.

If we can prove that E will happen after B then this assert will always be true.

In thread2 the branch prediction cannot save to the main memory before the branch is taken, so on any typical hardware the effect of D will be visible after the effect of B is visible.

so the assert will always be True for any typical implementation.

this is only about typical implementations, i cannot prove whether the C++ memory model will always make this assert true or not.

Edit: from the comments apparently this is not portable to all platforms.

Richrichara answered 7/6, 2024 at 13:18 Comment(4)

I agree this is what a typical implementation would do, but I'm not convinced that the C standard actually requires it. Remember that the standard's memory model doesn't work in terms of "memory barriers" or "branch prediction" or even "after", only in terms of happens-before, and like OP I do not see any way to establish a happens-before ordering for this code. – Supernumerary 7/6, 2024 at 13:41

I mean C++ rather than C above. – Supernumerary 7/6, 2024 at 13:47

@Nate is correct, and this isn't just a hypothetical problem if you want full portability. Only an RMW continues a release-sequence; a pure store doesn't. So a load doesn't necessarily sync-with all earlier stores in the modification-order. (It does on most ISAs, but at least one mainstream ISA can violate that, probably PowerPC. C++20 even weakened the rules further; pure stores from the same thread used to also continue a release-sequence like RMWs from any thread.) – Lunchroom 7/6, 2024 at 14:25

Why release sequence can only contain read-modify-write but not pure write / What does "release sequence" mean? – Lunchroom 7/6, 2024 at 14:26

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Intuition

Formal explanation

Recommended topics

Hot tags