Difference between memory_order_consume and memory_order_acquire

Asked 13/8, 2015 at 16:16 Answered 17/8, 2015 at 7:59

Solved c++c atomic memory-model stdatomic

I have a question regarding a GCC-Wiki article. Under the headline "Overall Summary" the following code example is given:

Thread 1:

y.store (20);
x.store (10);

Thread 2:

if (x.load() == 10) {
  assert (y.load() == 20)
  y.store (10)
}

It is said that, if all stores are release and all loads are acquire, the assert in thread 2 cannot fail. This is clear to me (because the store to x in thread 1 synchronizes with the load from x in thread 2).

But now comes the part that I don't understand. It is also said that, if all stores are release and all loads are consume, the results are the same. Wouldn't it be possible that the load from y is hoisted before the load from x (because there is no dependency between these variables)? That would mean that the assert in thread 2 actually can fail.

Osi answered 13/8, 2015 at 16:16 Comment(17)

That part of the comment from the article doesn't apply to the code you posted. – Benedikt 13/8, 2015 at 16:46

Why not? The comment from the article refers to the code I posted. I only skipped the code from "Thread 3" (because that part was not relevant for this question). – Osi 13/8, 2015 at 17:16

The article literally tells you that the assert can fail when using Release/acquire mode: d(), thread 3's assert can fail. – Benedikt 13/8, 2015 at 19:8

@Benedikt But thread 3 is irrelevant for the OP's example. The article also says that "The assert in thread 2 must still be true since thread 1 and 2 synchronize with x.load()" when using release+acquire, and then "If the stores are release and loads are consume the results are the same as release/acquire except there may be less hardware synchronization required." – Mongo 14/8, 2015 at 14:42

@Mongo Yes. I clarified my question. – Osi 14/8, 2015 at 22:59

It is a pity that we have two almost equally highly upvoted answers here, which are contradicting. – Mongo 23/8, 2015 at 11:3

@Mongo That's true. But if someone reads both answers (and all comments), everything should be clear. Both answers are helpful and I am glad that this question is answered now. – Osi 26/8, 2015 at 19:54

@Osi I don't quite understand your point: If Iwillnotexist's answer is correct, then (at least) half of Jens's answer is incorrect. Btw, I've sent the author of that GCC article an email, but either have I gotten the address wrong (weird SPAMFREE thing), or he hasn't responded yet. – Mongo 26/8, 2015 at 21:46

@Mongo What I meant is that Jens's code example is helpful to understand the difference between acquire and consume. But regarding this particular question, I also think that the explanation is not completely correct. Thank you for sending an email. I hope he received the mail. – Osi 29/8, 2015 at 17:27

@Mongo I am now even more certain of the correctness of my answer now that I've checked the code in question with a CppMem model. Everything is in an edit of my answer below. – Agave 2/9, 2015 at 0:49

@IwillnotexistIdonotexist Thank you for the prove. I didn't know this tool and I could not prove it on my machine (x86-64), because all loads are automatically acquire operations and all stores are automatically release operations on this architecture. – Osi 2/9, 2015 at 16:13

@Mongo Any news from the GCC article author? – Agave 8/9, 2015 at 19:2

@IwillnotexistIdonotexist Unfortunately not. You might want to send an email yourself, which should decrease the risks of me getting the address wrong and the mail getting stuck in a spam filter. – Mongo 8/9, 2015 at 19:5

@Mongo I didn't get a reply yet but by searching through the mailing list I found this email by one Paul E. Kenney with a discussion of a not-dissimilar example, and he believes the assert won't hold, citing exactly the same architectures (ARM and PowerPC) as a link within my answer below. – Agave 8/9, 2015 at 19:23

@IwillnotexistIdonotexist Wow that's a long thread. I quickly glanced over the linked email, but failed to find the second consume-load. – Mongo 8/9, 2015 at 20:21

@Mongo It's the return b statement in the first quoted code block, effectively. Immediately after said code block, he says," The above example can have a return value of 0 if translated straightforwardly into either ARM or Power, right? Both of these can speculate a read into a conditional, and both can translate a consume load into a plain load if data dependencies remain unbroken." – Agave 8/9, 2015 at 23:13

@IwillnotexistIdonotexist As far as I can tell, the return b is supposed to load b non-atomically (that's why the suggested alternative of using acquire makes sense). Jens Gustedt claims that there's a guaranteed order between two atomic consume-loads. – Mongo 9/9, 2015 at 10:12

The C11 Standard's ruling is as follows.

5.1.2.4 Multi-threaded executions and data races

An evaluation A is dependency-ordered before 16) an evaluation B if:

— A performs a release operation on an atomic object M, and, in another thread, B performs a consume operation on M and reads a value written by any side effect in the release sequence headed by A, or

— for some evaluation X, A is dependency-ordered before X and X carries a dependency to B.

An evaluation A inter-thread happens before an evaluation B if A synchronizes with B, A is dependency-ordered before B, or, for some evaluation X:

— A synchronizes with X and X is sequenced before B,

— A is sequenced before X and X inter-thread happens before B, or

— A inter-thread happens before X and X inter-thread happens before B.

NOTE 7 The ‘‘inter-thread happens before’’ relation describes arbitrary concatenations of ‘‘sequenced before’’, ‘‘synchronizes with’’, and ‘‘dependency-ordered before’’ relationships, with two exceptions. The first exception is that a concatenation is not permitted to end with ‘‘dependency-ordered before’’ followed by ‘‘sequenced before’’. The reason for this limitation is that a consume operation participating in a ‘‘dependency-ordered before’’ relationship provides ordering only with respect to operations to which this consume operation actually carries a dependency. The reason that this limitation applies only to the end of such a concatenation is that any subsequent release operation will provide the required ordering for a prior consume operation. The second exception is that a concatenation is not permitted to consist entirely of ‘‘sequenced before’’. The reasons for this limitation are (1) to permit ‘‘inter-thread happens before’’ to be transitively closed and (2) the ‘‘happens before’’ relation, defined below, provides for relationships consisting entirely of ‘‘sequenced before’’.

An evaluation A happens before an evaluation B if A is sequenced before B or A inter-thread happens before B.

A visible side effect A on an object M with respect to a value computation B of M satisfies the conditions:

— A happens before B, and

— there is no other side effect X to M such that A happens before X and X happens before B.

The value of a non-atomic scalar object M, as determined by evaluation B, shall be the value stored by the visible side effect A.

(emphasis added)

In the commentary below, I'll abbreviate below as follows:

Dependency-ordered before: DOB
Inter-thread happens before: ITHB
Happens before: HB
Sequenced before: SeqB

Let us review how this applies. We have 4 relevant memory operations, which we will name Evaluations A, B, C and D:

Thread 1:

y.store (20);             //    Release; Evaluation A
x.store (10);             //    Release; Evaluation B

Thread 2:

if (x.load() == 10) {     //    Consume; Evaluation C
  assert (y.load() == 20) //    Consume; Evaluation D
  y.store (10)
}

To prove the assert never trips, we in effect seek to prove that A is always a visible side-effect at D. In accordance with 5.1.2.4 (15), we have:

A SeqB B DOB C SeqB D

which is a concatenation ending in DOB followed by SeqB. This is explicitly ruled by (17) to not be an ITHB concatenation, despite what (16) says.

We know that since A and D are not in the same thread of execution, A is not SeqB D; Hence neither of the two conditions in (18) for HB is satisfied, and A does not HB D.

It follows then that A is not visible to D, since one of the conditions of (19) is not met. The assert may fail.

How this could play out, then, is described here, in the C++ standard's memory model discussion and here, Section 4.2 Control Dependencies:

(Some time ahead) Thread 2's branch predictor guesses that the if will be taken.
Thread 2 approaches the predicted-taken branch and begins speculative fetching.
Thread 2 out-of-order and speculatively loads 0xGUNK from y (Evaluation D). (Maybe it was not yet evicted from cache?).
Thread 1 stores 20 into y (Evaluation A)
Thread 1 stores 10 into x (Evaluation B)
Thread 2 loads 10 from x (Evaluation C)
Thread 2 confirms the if is taken.
Thread 2's speculative load of y == 0xGUNK is committed.
Thread 2 fails assert.

The reason why it is permitted for Evaluation D to be reordered before C is because a consume does not forbid it. This is unlike an acquire-load, which prevents any load/store after it in program order from being reordered before it. Again, 5.1.2.4(15) states, a consume operation participating in a ‘‘dependency-ordered before’’ relationship provides ordering only with respect to operations to which this consume operation actually carries a dependency, and there most definitely is not a dependency between the two loads.

CppMem verification

CppMem is a tool that helps explore shared data access scenarios under the C11 and C++11 memory models.

For the following code that approximates the scenario in the question:

int main() {
  atomic_int x, y;
  y.store(30, mo_seq_cst);
  {{{  { y.store(20, mo_release);
         x.store(10, mo_release); }
  ||| { r3 = x.load(mo_consume).readsvalue(10);
        r4 = y.load(mo_consume); }
  }}};
  return 0; }

The tool reports two consistent, race-free scenarios, namely:

In which y=20 is successfully read, and

In which the "stale" initialization value y=30 is read. Freehand circle is mine.

By contrast, when mo_acquire is used for the loads, CppMem reports only one consistent, race-free scenario, namely the correct one:

in which y=20 is read.

Agave answered 17/8, 2015 at 7:59 Comment(3)

This is really helpful! Thanks! I still have a dumb question, in the example you gave here, why can't we reason that operations A and D form a release/consume relation which, according to 15 and 16, can be considered that A inter-thread happens before D? In other words, why must we take operations B and C into consideration rather than reasoning only on A and D? – Dewain 6/9, 2017 at 15:6

@PJ.Hades The example I gave with speculative loads is an example of why. Because the CPU is allowed to reorder the consumes C & D wrt each other (given neither carries a dependency to the other), D can conceivably happen long before A and B even begin. Therefore, C reading 10 does not constitute proof that D will read 20 (IOW, that sychronization always happens properly). The scenario you describe (A, D, B, C) would result in a correct sync, but the one I mention (D, A, B, C) is also possible under the same conditions. – Agave 6/9, 2017 at 16:48

It makes sense. I think I misunderstood something in the definition. Thanks for the explanation! – Dewain 6/9, 2017 at 18:8

Both establish a transitive "visibility" order on atomic stores, unless they have been issued with memory_order_relaxed. If a thread reads an atomic object x with one of the modes, it can be sure that it sees all modifications to all atomic objects y that were known to be done before the write to x.

The difference between "acquire" and "consume" is in the visibility of non-atomic writes to some variable z, say. For acquire all writes, atomic or not, are visible. For consume only the atomic ones are guaranteed to be visible.

thread 1                               thread 2
z = 5 ... store(&x, 3, release) ...... load(&x, acquire) ... z == 5 // we know that z is written
z = 5 ... store(&x, 3, release) ...... load(&x, consume) ... z == ? // we may not have last value of z

Eardrop answered 13/8, 2015 at 16:49 Comment(12)

First of all thank you for your answer! Could you please give me a reference to the corresponding part of the C standard for the fact that all atomic writes are visible for consume? – Osi 13/8, 2015 at 17:9

Unless Anthony Williams is wrong in "C++ Concurrency in Action" p139-140, not all atomic writes are visible after consume. The example in Williams' book combines a relaxed atomic write/read with a release/consume operation, though. I don't quite see why that should be different for release/consume + release/consume, since the OP's example has no dependency ordering. – Mongo 13/8, 2015 at 17:39

Hmm while that reconciles the example in the book with your answer, I don't quite understand how that is enforced in the specification. As far as I understand it, we need the synchronizes-with relationship between the stores and loads to x to invoke the "A synchronizes with X and X is sequenced before B" rule. But that isn't present for release/consume, and I can't see what other guarantee we have between the two stores in thread 1 besides the thread-internal sequenced before, which would also apply to non-atomic operations. – Mongo 13/8, 2015 at 18:13

@splotz90, all of this is buried in the hard-to-digest section 5.1.2.4. In essence "consume" operations contribute to the "dependency-ordered before" relation. Whereas the "acquire" operations contribute to the "inter-thread happens before" relation, that is a superset of this. (And then memory_order_seq_cst is a total ordering of all this.) – Eardrop 13/8, 2015 at 18:16

@dyp, memory_order_relaxed operations are not among the synchronizing operation. There is an explanation in NOTE2 (that is p6) of the standard. They only guarantee consistency of the data, not synchronization. – Eardrop 13/8, 2015 at 18:21

I was referring to this particular example; your answer seems to suggest there's an exception from synchronisation for non-atomic writes and relaxed atomic writes, but I can't find any such exception in the spec (the dependency ordering even implies that some non-atomic writes become visible, even if they do not occur on the same object, e.g. the dependency that is carried in p->x from p to p->x). If there is no such exception, then there must be two distinct ways for rel+acq and rel+con, but I can't find a way for rel+con to establish the happens-before between the store and load to y. – Mongo 13/8, 2015 at 18:27

I actually have the same problem to understand why the standard enforces a happens-before relation between the store to y and load from y. – Osi 14/8, 2015 at 23:7

@Mongo The whole point of "synchronization" is have a past. W/o synchro there is no past and no future: things in other threads may be seen or maybe not. But even in the future are never seen and those in the past are always done. If non atomic events were not part of that "past" then object creation would never enter history and you couldn't even publish an object and expose it to another thread: even the most simple atomic publish idiom (in MT pseudo-code) {x = new T; a.store(true);} ||| { if (a.load()) x->f(); } would fail. – Grimy 22/11, 2019 at 22:4

@Grimy Woah, resurrecting a four-year-old thread, thanks for the nostalgia :D -- Indeed, I believe your example only holds for release+acq, but not for release+con nor for relaxed+anything on a - since there's no dependency ordering between x = new T; and a.store(true). Modeling this on CppMem via x = 42; instead of x = new T; shows that release+acq is race free but the other two contain a data race. You need acq+rel or mutex etc to publish a new object. "But even in the future are never seen and those in the past are always done." I don't get that sentence :( – Mongo 26/11, 2019 at 18:19

@Mongo If an event (a memory SE) is in our past, we will always "see" it: the effect is either testable or clobbered by another event that isn't older. An event in the future is never visible. An even in neither is potentially be seen. When you program w/ mutexes only (and no try_lock), you can use only the past, so stuff is clean. When you use atomics, you can see something before it's in the "past": you can test whether an atomic was changed by a parallel computation, which is a race. Then by acq (or another thread rel) you can put stuff in the past. – Grimy 26/11, 2019 at 19:44

@Mongo It should be possible to use consume for a new object, as it's a typical use case. But compiler writers have found it very difficult to get the correct consume semantics past the compiler front end in cases that might be optimized. (Linus wrote a list of problematic cases where compilers might break code data/addr dependencies in code that has read such "volatile" values: think x[dep] where x has type int[1] and dep has a "consume"-like value dependency.) – Grimy 26/11, 2019 at 19:53

@Grimy I have some trouble understanding the point you'd like to make, and maybe some more space (word limit) might help. So I've created this chat room. – Mongo 1/12, 2019 at 19:38

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

5.1.2.4 Multi-threaded executions and data races

CppMem verification

Recommended topics

Hot tags