Using an atomic read-modify-write operation in a release sequence

Asked 15/8, 2017 at 14:0 Answered 1/9, 2017 at 21:41

c++multithreading c++11 atomic memory-model

Say, I create an object of type Foo in thread #1 and want to be able to access it in thread #3.
I can try something like:

std::atomic<int> sync{10};
Foo *fp;

// thread 1: modifies sync: 10 -> 11
fp = new Foo;
sync.store(11, std::memory_order_release);

// thread 2a: modifies sync: 11 -> 12
while (sync.load(std::memory_order_relaxed) != 11);
sync.store(12, std::memory_order_relaxed);

// thread 3
while (sync.load(std::memory_order_acquire) != 12);
fp->do_something();

The store/release in thread #1 orders Foo with the update to 11
thread #2a non-atomically increments the value of sync to 12
the synchronizes-with relationship between thread #1 and #3 is only established when #3 loads 11

The scenario is broken because thread #3 spins until it loads 12, which may arrive out of order (wrt 11) and Foo is not ordered with 12 (due to the relaxed operations in thread #2a).
This is somewhat counter-intuitive since the modification order of sync is 10 → 11 → 12

The standard says (§ 1.10.1-6):

an atomic store-release synchronizes with a load-acquire that takes its value from the store (29.3). [ Note: Except in the specified cases, reading a later value does not necessarily ensure visibility as described below. Such a requirement would sometimes interfere with efficient implementation. —end note ]

It also says in (§ 1.10.1-5):

A release sequence headed by a release operation A on an atomic object M is a maximal contiguous subsequence of side effects in the modification order of M, where the first operation is A, and every subsequent operation
- is performed by the same thread that performed A, or
- is an atomic read-modify-write operation.

Now, thread #2a is modified to use an atomic read-modify-write operation:

// thread 2b: modifies sync: 11 -> 12
int val;
while ((val = 11) && !sync.compare_exchange_weak(val, 12, std::memory_order_relaxed));

If this release sequence is correct, Foo is synchronized with thread #3 when it loads either 11 or 12. My questions about the use of an atomic read-modify-write are:

Does the scenario with thread #2b constitute a correct release sequence ?

And if so:

What are the specific properties of a read-modify-write operation that ensure this scenario is correct ?

Embank answered 15/8, 2017 at 14:0 Comment(32)

Do you have any particular reason to doubt that store(11) and compare_exchange(11, 12) constitute a release sequence? They satisfy all the requirements in the paragraph you quoted. – Goerke 16/8, 2017 at 9:50

@user3290797 Well, maybe because I have seen these chains before with RMW's at the end, but never in the middle. You are right, it should be correct per the standard. I guess it is more about the follow-up questions. – Embank 16/8, 2017 at 10:35

RMW is special because the store part can't happen ahead of the load part. Conditional branches / speculative execution could let the sync.store(12, mo_relaxed); execute and become globally visible before the spin-loop had actually loaded an 11, violating causality. There can't be a control dependency as part of the implementation an atomic RMW, only a true data dependency from load to store, so it can't violate causality this way (or any other way, because of that C++ rule allowing an atomic RMW to be part of a release sequence!) – Lycanthropy 1/9, 2017 at 4:14

I should probably post that as an answer, but I'm not 100% confident my reasoning is correct. – Lycanthropy 1/9, 2017 at 4:14

@PeterCordes The way I interpret the RMW scenario is not that Foo is ordered wrt 12 (I don't think it is), but that 12 is guaranteed to arrive 'after' 11 on thread #3. But even then, on a weak platform, what guarantees this ordering. If the RMW store part cannot do speculative execution, how would that impact the ordering between 11 and 12 ? – Embank 1/9, 2017 at 5:16

I think that in your first code block (with while(load) then store), the store(12) could become globally visible first (before the store that made the loop condition false), and the store(11) could step on it. e.g. branch prediction predicts that the spin-loop ends, the store runs, then eventually the load happens and the branch condition is evaluated and found to have gone the right way. I think x86 won't do this, because it disallows LoadStore reordering, but weakly-ordered ISAs could. – Lycanthropy 1/9, 2017 at 5:27

Yes, I agree.. I can see why the first is broken, but how the RMW fixes this is not clear to me – Embank 1/9, 2017 at 5:33

But for an atomic RMW, even on a platform that normally has no total order for separate loads or stores, atomic RMW ops have to make sure that all other threads/cores see their output after the operation (from whatever thread) that produced their input. Imagine if thread 3 could see 12 and then 11. On a system that worked like that, three threads incrementing a shared counter could lose counts (by incrementing a value that had already been incremented). Since atomic increments have to not lose counts, RMW atomics can't be that weakly ordered. – Lycanthropy 1/9, 2017 at 5:33

Yes, but the counter example works with RMW stores and RMW loads which guarantees access to the latest in the modification order. A plain load does not have to return the latest value. In this case, we only have an RMW store – Embank 1/9, 2017 at 5:39

Somehow, the RMW store (12) seems to order the value globally with the earlier store (11).. Not sure if the load is related to that; it simpy returns whatever shows up first – Embank 1/9, 2017 at 5:40

You said "In this case, we only have an RMW store". Did you mean the cmpxchg? A CAS / cmpxchg is an RMW; it has to atomically load / check / store, so it's equivalent to load/inc/store as far as dependencies go. An RMW is always a load + store. There's no such thing as "an RMW store" in isolation, not when we're talking about atomic RMW. The load and store always appear back-to-back in the global order, which is what allows the operation to appear atomic to all observers in the system. See #39394350. – Lycanthropy 1/9, 2017 at 6:44

I meant the store(12) part of the RMW which is seen by the load in thread #3. I was wondering how the RMW enforces the ordering between 11 and 12. The load in #3 returns whatever shows up first. – Embank 1/9, 2017 at 7:30

@LWimsey: (don't forget to @ ping me. I don't get notified if you leave that out. You do because it's your post.) The load in #3 might never see the 11, but if it sees 12 it can't later see the 11 (because the 12 came from an RMW atomic that had the 11 as input.) – Lycanthropy 1/9, 2017 at 16:20

@PeterCordes My wording was a bit clumsy, but I agree.. thread #3 may never see 11, that applies to both scenario's 2a and 2b. But in the 2a case, Foo only becomes (reliably) visible when (and if) thread #3 loads 11. If it loads 12, it has become impossible to access Foo because it is is unordered wrt 12, and 11 is 'lost' (I referred to that scenario in the question as 'broken') – Embank 1/9, 2017 at 20:26

@PeterCordes In scenario 2b, the RMW somehow enforces ordering so that you can reliably access Foo in #3 on loading either 11 or 12. Of course, in both scenario's, spinning to load 11 would be a bug (race condition) since nothing guarantees #3 will ever see 11, but if it did, accessing Foo would be fine. – Embank 1/9, 2017 at 20:26

Oh right, I lost track of the big picture. Yes, I think in 2b, the RMW preserves causality, because it can't make 12 globally visible before 11 was. So seeing 12 means that Foo is ready. A separate store doesn't have this property in C++11. – Lycanthropy 1/9, 2017 at 20:33

In asm for real hardware, I think it's usually safe to atomically load, then atomically store something that has a data dependency on the load. But value-prediction for loads is a theoretical possibility that would break this the same way a speculative control dependency does.) C++ rules are conservative here, and disallow anything but atomic-RMW propagating a dependency. – Lycanthropy 1/9, 2017 at 20:34

Weakly-ordered ISAs have specific rules about which instructions carry a data dependency for mo_consume ordering (loading and then dereferencing a pointer only needs a LoadLoad barrier on DEC Alpha). I guess it would also apply to storing a new value back into the shared variable, even with relaxed ordering. But like I said, in C++11 this always breaks a release sequence. C++11's memory model is as weak as Alpha. A consume and release store might work. – Lycanthropy 1/9, 2017 at 20:41

@PeterCordes the way the RMW preserves causality, why don't you pour that into an answer – Embank 1/9, 2017 at 20:42

Yup, was just thinking that it's about time to put this into an answer, now that we cleared up exactly which part you were wondering about. – Lycanthropy 1/9, 2017 at 20:43

BeeOnRope wrote this up nicely :) Are you sure it's safe for Foo not to be an atomic type? In some cases, non-atomic variables aren't synchronized by atomic operations or barriers. e.g. #40579842 shows that atomic_thread_fence doesn't order non-atomics, but atomic_signal_fence does (at least as an implementation detail on gcc). – Lycanthropy 2/9, 2017 at 8:2

@PeterCordes Yes, I'm positive Foo needs no atomicity. The question are referring may seem somewhat surprising with the thread_fence behavior, but in that case B is not atomic, so the compiler does not have to take inter-thread behavior into account. If you change B to atomic<int>, it is a different story, because then you are releasing the first store to A to become visible to another thread – Embank 2/9, 2017 at 8:56

Thanks. I know the question there is misguided, but the fact that signal_fence "works" while thread_fence didn't had me wondering if I was missing something. This clears it up some (although I should probably post a separate question about whether signal_fence barriering non-atomic ops in gcc is an implementation detail or required.) – Lycanthropy 2/9, 2017 at 9:0

@PeterCordes I want to give it some more thought anyway, but let's move this discussion to that question – Embank 2/9, 2017 at 9:2

@PeterCordes Regarding your reasoning on why the first case might fail: "...branch prediction predicts that the spin-loop ends, the store runs, then eventually the load happens and the branch condition is evaluated and found to have gone the right way...", but before the moment branch condition becomes true, store instruction (sync=12) isn't yet retired, so it isn't yet globally visible and at the moment branch condition becomes true, sync=11 store is already (globally?) visible, and so is fp = new Foo, so it seems thread 3 receives non-null pointer even in such case. – Oligoclase 31/12, 2018 at 23:38

@PeterCordes "even on a platform that normally has no total order for separate loads or stores" what kind of platform would that be? – Natalee 30/6, 2019 at 21:13

@curiousguy: ARM and PowerPC memory models for example both allow that. Real PowerPC hardware exists that can lead to 2 readers disagreeing about the order of 2 stores done by 2 other threads. Will two atomic writes to different locations in different threads always be seen in the same order by other threads? This is impossible on x86 (Concurrent stores seen in a consistent order) – Lycanthropy 30/6, 2019 at 21:19

@PeterCordes So to clarify, on these memory models, there is no global order on operations of all locations, but there is one of each particular location, right? – Natalee 30/6, 2019 at 21:22

@curiousguy: I'm not sure if even that's true. C++'s memory model is weak enough to allow value-prediction (for relaxed loads), so reads can happen before the write of that value. One thread might have correct value-prediction and see a write before it happens, but another thread might not. So there's probably not a total order of reads + writes for a single location. But I think there's probably an order of writes + atomic-RMWs that all threads can agree on for any given atomic object. (It doesn't have to make logical sense, though.) I'm not sure about PPC / ARM asm for this. – Lycanthropy 30/6, 2019 at 21:30

@PeterCordes Reads of values coming from the future is crazy and needs to be disallowed, if it won't happen in any real CPU. – Natalee 30/6, 2019 at 22:15

@curiousguy: Why? In most cases there's no difference from the reader being delayed relative to the writer. The value-prediction does have to get verified before the load can retire, just like branch prediction. Value-prediction is still mostly a theoretical idea in computer architecture, but I don't see why you'd disallow it for relaxed loads. Also, Alpha 21264's cache banking is another way to violate causality and have what looks like a load from the future ahead of a data-dependent load of a pointer. I forget exactly how that works, but it's in real HW any why mo_consume != relaxed. – Lycanthropy 30/6, 2019 at 22:25

@curiousguy: if you care about ordering, use mo_acquire or mo_consume. But on most C++ implementations on real HW there's no way loads from the future can happen in practice; it's not something that compilers can create at compile time generally. But IIRC, PowerPC's on-paper memory model is weak enough that code compiled now but run on a hypothetical / future PPC with value prediction could do that. Code compiled now with mo_acquire already has to use enough barriers that there's no problem. – Lycanthropy 30/6, 2019 at 22:26

Does the scenario with thread #2b constitute a correct release sequence ?

Yes, per your quote from the standard.

What are the specific properties of a read-modify-write operation that ensure this scenario is correct?

Well, the somewhat circular answer is that the only important specific property is that "The C++ standard defines it so".

As a practical matter, one may ask why the standard defines it like this. I don't think you'll find that the answer has a deep theoretical basis: I think the committee could have also defined it such that the RMW doesn't participate in the release sequence, or (perhaps with more difficulty) have defined so that both the RMW and the separate mo_relaxed load and store participate in the release sequence, without compromising the "soundness" of the model.

They already give a performance related as to why they didn't choose the latter approach:

Such a requirement would sometimes interfere with efficient implementation.

In particular, on any hardware platform that allowed load-store reordering, it would imply that even mo_relaxed loads and/or stores might require barriers! Such platforms exist today. Even on more strongly ordered platforms, it may inhibit compiler optimizations.

So why didn't they take then take other "consistent" approach of not requiring RMW mo_relaxed to participate in the release sequence? Probably because existing hardware implementations of RMW operations provide such guarantees and the nature of RMW operations makes it likely that this will be true in the future. In particular, as Peter points in the comments above, RMW operations, even with mo_relaxed are conceptually and practically¹ stronger than separate loads and stores: they would be quite useless if they didn't have a consistent total order.

Once you accept that is how hardware works, it makes sense from a performance angle to align the standard: if you didn't, you'd have people using more restrictive orderings such as mo_acq_rel just to get the release sequence guarantees, but on real hardware that has weakly ordered CAS, this doesn't come for free.

¹ The "practically" part means that even the weakest forms of RMW instructions are usually relatively "expensive" operations taking a dozen cycles or more on modern hardware, while mo_relaxed loads and stores generally just compile to plain loads and stores in the target ISA.

Pickmeup answered 1/9, 2017 at 21:41 Comment(2)

The circular reasoning reaffirms what the standard defines, but it lacks detail on 'why'. You say: Probably because existing hardware implementations of RMW operations provide such guarantees.. Yes, apparently the behavior of RMW's is so consistent across platforms that the committee decided to include it in the standard. But it probably does not have the answer since it is not a text book; it does not (have to) explain things. – Embank 2/9, 2017 at 8:31

Also, you refer to consistent total order.. If you mean modification order of sync, that is 10,11,12 , in both scenarios (with and without the RMW). maybe somewhat confusing – Embank 2/9, 2017 at 8:37

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags