What formally guarantees that non-atomic variables can't see out-of-thin-air values and create a data race like atomic relaxed theoretically can?

Asked 19/6, 2019 at 18:41 Answered 23/11, 2019 at 20:38

Solved c++multithreading language-lawyer stdatomic data-race

This is a question about the formal guarantees of the C++ standard.

The standard points out that the rules for std::memory_order_relaxed atomic variables allow "out of thin air" / "out of the blue" values to appear.

But for non-atomic variables, can this example have UB? Is r1 == r2 == 42 possible in the C++ abstract machine? Neither variable == 42 initially so you'd expect neither if body should execute, meaning no writes to the shared variables.

// Global state
int x = 0, y = 0;

// Thread 1:
r1 = x;
if (r1 == 42) y = r1;

// Thread 2:
r2 = y;
if (r2 == 42) x = 42;

The above example is adapted from the standard, which explicitly says such behavior is allowed by the specification for atomic objects:

[Note: The requirements do allow r1 == r2 == 42 in the following example, with x and y initially zero:
// Thread 1:
r1 = x.load(memory_order_relaxed);
if (r1 == 42) y.store(r1, memory_order_relaxed);
// Thread 2:
r2 = y.load(memory_order_relaxed);
if (r2 == 42) x.store(42, memory_order_relaxed);
However, implementations should not allow such behavior. – end note]

What part of the so called "memory model" protects non atomic objects from these interactions caused by reads seeing out-of-thin-air values?

When a race condition would exist with different values for x and y, what guarantees that read of a shared variable (normal, non atomic) cannot see such values?

Can not-executed if bodies create self-fulfilling conditions that lead to a data-race?

Rearmost answered 19/6, 2019 at 18:41 Comment(36)

AFAIK the standard doesn't give you that protection. – Bree 19/6, 2019 at 18:45

I do not even fully understand what guarantees you think you're supposed to get. The wording is convoluted and hard to parse. If you can clarify that, I can write you an answer that explains why you don't get them in practical terms. – Atropos 19/6, 2019 at 18:51

std::atomic_thread_fence should guarantee the order of read and write but that doesn't change anything about the race. – Whitmore 19/6, 2019 at 18:52

@Atropos Ideally a guarantee that the sequence of operations can't happen so x and y keep their original zero value; – Rearmost 19/6, 2019 at 19:17

The C++ memory model allows value-prediction (which no real hardware does AFAIK), so the y = r1; store really can happen before the load result of r1 = x is ready even if you did use std::atomic with mo_relaxed to make this code free from data-race UB. Some models of the famously-weak DEC Alpha really can violate causality for pointed-to data (mo_consume), but not I think for just loading and then storing the load result. – Kauppi 19/6, 2019 at 21:1

Oh, and in this case a compiler can easily prove that inside the if(r1 == 42), it can do y = 42; instead of y = r1; breaking the data dependency. So normal branch speculation can let the store happen before the load, on a weakly-ordered ISA like ARM or PowerPC. (Again assuming std::atomic with mo_relaxed, or that the unsafe C was basically transliterated to asm using plain loads/stores, for some particular ISA where we can then reason about that hardware's memory model.) – Kauppi 19/6, 2019 at 21:5

@PeterCordes What does "violate causality for pointed-to data (mo_consume)" look like? – Rearmost 19/6, 2019 at 21:6

Like int *p = atomic_load(a_pointer); int value = atomic_load(p); (but with mo_relaxed, i.e. ordinary asm loads). Even if the writing thread used barriers to make sure the pointed-to data was globally visible before storing the pointer, the read side could still reorder and read the pointed-to data before reading the pointer (thus ending up with value = old contents of *p). See also Memory order consume usage in C11. Also Guarantees in kernel.org/doc/Documentation/memory-barriers.txt – Kauppi 19/6, 2019 at 21:12

Update: my earlier comments are kind of off-topic (not wrong, though). Cross-thread speculation and memory reordering are both needed to make this happen in practice. And since neither condition should be taken, I think having this happen for non-atomic variables would violate the as-if rule. I posted an answer to that effect. – Kauppi 22/9, 2019 at 20:31

@Rearmost I don't think the result is allowed. It breaks the fundamental causality relation. The causality relation has nothing to do with any memory model (be it language's or processors'). It is the basic logic and is the foundation of programming language design. It is the fundamental contract between human and computer. Any memory model should abide by it. Otherwise it is a bug. – Driskill 30/9, 2019 at 20:54

@Xiao-FengLi "has nothing to do with any memory model" OK then 1) is specified formally? 2) how does it interact w/ the memory model? 3) can't a "weak" memory model cause effects that seem to violate "causality"? – Rearmost 28/10, 2019 at 7:43

@Rearmost the causality here refers to the intra-thread dependence between operations of one thread, such as data dependence (e.g., read after write in same location) and control dependence (e.g., operation in a branch). They cannot be violated by any processor design in its committed result (i.e., externally visible result). Memory model is only about memory operation ordering among multi-processors, they should never violate the intra-thread dependence, although a weak model may allow the causality happening in one processor to be violated (or unseen) in another processor. – Driskill 29/10, 2019 at 22:50

(Cont.) In your code snippet, both threads have (intra-thread) data dependence (load->check) and control dependence (check->store) that ensure their respective executions (in one thread) are ordered. That means, we can check the later op's output to determine if the earlier op has executed. Then we can use simple logic to deduce that, if both r1 and r2 are 42, there must be a dependence cycle, which is impossible, unless you remove the condition checks, which essentially removes the dependence. This has nothing to do with memory model, but intra-thread data dependence. – Driskill 29/10, 2019 at 22:58

@Xiao-FengLi That causality requirement seems to be elementary, but is it plainly written in the C++ std? – Rearmost 21/11, 2019 at 22:28

@Rearmost Causality (or more accurately, dependence here) is defined in C++ std, but not so explicitly, because dependence is more of micro-architecture and compiler terminology. In language spec, it is usually defined using terms like "name scope", "if statement", etc, to define the operational semantics. The semantics implicitly satisfies the causality. For example, the control dependence formed by "if statement" is defined in timsong-cpp.github.io/cppwp/n3337/stmt.if, "If the condition yields true the first substatement is executed. " – Driskill 23/11, 2019 at 6:39

@Xiao-FengLi That you seem to imply that a read inside a then clause can't be executed before the condition is true. – Rearmost 23/11, 2019 at 13:57

@Rearmost Yes, that is the semantics requirement. That said, the compiler and processor can schedule one or more operations of the if-branch to be executed before the if-condition is resolved. But no matter how the compiler and processor schedule the operations, the result of the if-branch cannot be committed (i.e., become visible to the program) before the if-condition is resolved. One should distinguish between semantics requirement and implementation details. One is language spec, the other is how the compiler and processor implement the language spec. – Driskill 23/11, 2019 at 17:0

@Xiao-FengLi So in if (a.load(relaxed)) { r = b.load(relaxed); acquire_fence(); c.store(r,relaxed); } the value of b can't be loaded before the value of a, or if it is, the read must be cancelled if a is true so that the incorrect value in r isn't written to c? – Rearmost 23/11, 2019 at 17:4

@Rearmost The compiler/processor may load b before a's value is resolved - because both a and b are relaxed atomics. But b's value cannot be written to r before the true branch is taken. That is, the value of b is speculatively loaded. The loading itself is useless - in this context. Actually with relaxed order, b's value can even be loaded ahead of a's value - but still it cannot be stored into r. If there is a fence before b's loading, then it cannot be loaded before a's loading, but this is where the memory model kicks in. – Driskill 23/11, 2019 at 17:37

@Rearmost What you described is largely correct but that, "loading of b" is not the same as "writing of r", because r is program visible. Writing to r means the result of speculative loading b is committed. The compiler/processor may write b's value to somewhere temporarily and internally in the system, but not r. – Driskill 23/11, 2019 at 17:53

@Xiao-FengLi Even with relaxed loads? – Rearmost 23/11, 2019 at 18:41

@Rearmost Would you please clarify your question a little bit? I don't know which of my statements you are referring to. – Driskill 23/11, 2019 at 18:56

@curiousguy: updated my answer with more evidence from later standards that out-of-thin-air values are intended to be disallowed, but the formalism fails to do that so they added a note. Later standards word it more clearly. Your question is worded confusingly because your examples don't actually contain a data race. They would if out-of-thin-air values were allowed, so the question is what protects you from out-of-thin-air values, not what protects you from data races. My answer reconciles that mo_relaxed note/example with sane non-atomic behaviour. I'd suggest accepting it. – Kauppi 23/11, 2019 at 21:47

@Xiao-FengLi "The compiler/processor may write b's value to somewhere temporarily and internally in the system, but not r." Even w/ the use of a relaxed load, the CPU can't do that? – Rearmost 23/11, 2019 at 22:35

@PeterCordes The example contains a race if ... it contains a race, allowing other values then the one in the past to be read. Also I don't think there is a universally accepted definition for "out-of-thin-air values" (although some things are universally considered out-of-thin-air), like there is maybe no universal definition of what counts as a German shepherd (or a chihuahua) but no one calls a chihuahua what others call a German shepherd. – Rearmost 23/11, 2019 at 22:41

Anything that's less obviously "out-of-thin-air" would be potential data-race UB from an assignment unsynchronized with a load. So I don't think there'd be any case where you could say that a result was because of an out-of-thin-air value and that data-race UB definitely didn't happen. (I'm assuming that doing some actual stores to the objects would be necessary to "prime" value-prediction, along with cross-thread mutually-confirming speculation.) – Kauppi 23/11, 2019 at 22:50

Speaking of definitions, cross-thread mutually-confirming speculation is required (I think) for this on CPUs anything like what we're used to. That's a pretty clear demarcation. – Kauppi 23/11, 2019 at 22:50

@Rearmost Writing into r here means to cause effect in the system that the program can see and use. No matter how relaxed the memory model it is, the value of new r should never be seen by the program before the if-condition is resolved. This has nothing to do with memory model, as evidenced by the fact that the code you gave is just single-threaded. It is very easy to mix these two concepts (dependence vs. memory order). OTOH, if we really want to say r can be written before the if-condition is resolved, it is fine - as long as r's value is not seen/used by the program. – Driskill 23/11, 2019 at 22:53

I rephrased your question significantly to avoid the confusion that led to the existing answers and the upvotes on them. At the time they looked correct to me, too. Only after carefully looking at the example was it clear that this wasn't obviously broken. – Kauppi 24/11, 2019 at 2:30

@PeterCordes "a compiler can easily prove that inside the if(r1 == 42), it can do y = 42; instead of y = r1; breaking the data dependency. So normal branch speculation can let the store happen before the load, on a weakly-ordered ISA" but it's not allowed to make speculative stores visible to other threads? r1 = x and y = 42 can be executed out of order but never retire out of order, not even on a weak ISA? Reference here – Dorren 27/2, 2021 at 10:41

@DanielNitzan: That's correct, but it still means the store value isn't dependency-ordered after the load. Loads interact with global state when they execute, not when they retire. (However, stores can't become globally visible until after retirement, and the load + branch must execute before retirement. But if this thread reloads y, it can see its own store before it becomes globally visible (store forwarding). So maybe a better example could be cooked up, but there's some effect here I think? I'd have to look at it again more carefully) – Kauppi 27/2, 2021 at 11:28

@PeterCordes to be fair, I've written the previous comment before reading your answer, especially the section about a hypothetical SMT cross-thread speculation architecture that can actually reproduce such a scenario (but would break the no-UB for DRF guarantee of C++). Also, there seems to have been a confusion around the original question that had led to confusing comments. – Dorren 27/2, 2021 at 12:7

@DanielNitzan: right, yeah, cross-confirming speculation would break the if (global_id == mine) shared_var = 123; example I argued is legal in my answer. Unless implementations work around it by using anti-speculation barriers every time they load and/or store non-atomic vars, with only mo_relaxed exposing the underlying DeathStation 9000 creatively terrible hardware behaviour. Hypothetical implementations don't have to be efficient at all, let alone commercially viable. :P Not that we needed to invent one, since we don't want an impl like this to exist, and ISO C++ says it shouldn't :/ – Kauppi 27/2, 2021 at 12:14

@PeterCordes LOL – Dorren 27/2, 2021 at 12:20

@PeterCordes: //DeathStation 9000 creatively terrible hardware behaviour. Hypothetical implementations don't have to be efficient at all, let alone commercially viable. :P Not that we needed to invent one, since we don't want an impl like this to exist, and ISO C++ says it shouldn't.// The authors of the Standard place a higher priority on avoiding mandating corner-case behaviors that might be impractical on some conceivable implementations, than on avoiding characterizing as UB constructs that most implementations should obviously be processed meaningfully most if not all of the time. – Recollect 14/6, 2022 at 20:44

@PeterCordes: If an implementation that behaves weirdly in some obscure corner case could be much more efficient than one that follows the behavioral example set by all implementations, and if the implementation's customers wouldn't care about how it handles that corner case, mandating the common behavior would prevent the implementation from optimally serving its customers' needs, while characterizing such behaviors as UB shouldn't interfere with implementations' continuing to behave in commonplace fraction in the 99.99% of situations where that would be practical. – Recollect 14/6, 2022 at 20:49

The text of your question seems to be missing the point of the example and out-of-thin-air values. Your example does not contain data-race UB. (It might if x or y were set to 42 before those threads ran, in which case all bets are off and the other answers citing data-race UB apply.)

There is no protection against real data races, only against out-of-thin-air values.

I think you're really asking how to reconcile that mo_relaxed example with sane and well-defined behaviour for non-atomic variables. That's what this answer covers.

The note is pointing out a hole in the atomic `mo_relaxed` formalism, not warning you of a real possible effect on some implementations.

This gap does not (I think) apply to non-atomic objects, only to mo_relaxed.

They say However, implementations should not allow such behavior. – end note]. Apparently the standards committee couldn't find a way to formalize that requirement so for now it's just a note, but is not intended to be optional.

It's clear that even though this isn't strictly normative, the C++ standard intends to disallow out-of-thin-air values for relaxed atomic (and in general I assume). Later standards discussion, e.g. 2018's p0668r5: Revising the C++ memory model (which doesn't "fix" this, it's an unrelated change) includes juicy side-nodes like:

We still do not have an acceptable way to make our informal (since C++14) prohibition of out-of-thin-air results precise. The primary practical effect of that is that formal verification of C++ programs using relaxed atomics remains unfeasible. The above paper suggests a solution similar to http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3710.html . We continue to ignore the problem here ...

So yes, the normative parts of the standard are apparently weaker for relaxed_atomic than they are for non-atomic. This seems to be an unfortunately side effect of how they define the rules.

AFAIK no implementations can produce out-of-thin-air values in real life.

Later versions of the standard phrase the informal recommendation more clearly, e.g. in the current draft: https://timsong-cpp.github.io/cppwp/atomics.order#8

Implementations should ensure that no “out-of-thin-air” values are computed that circularly depend on their own computation.
...

[ Note: The recommendation [of 8.] similarly disallows r1 == r2 == 42 in the following example, with x and y again initially zero:
   // Thread 1:
   r1 = x.load(memory_order::relaxed);
   if (r1 == 42) y.store(42, memory_order::relaxed);
   // Thread 2:
   r2 = y.load(memory_order::relaxed);
   if (r2 == 42) x.store(42, memory_order::relaxed);
— end note ]

(This rest of the answer was written before I was sure that the standard intended to disallow this for mo_relaxed, too.)

I'm pretty sure the C++ abstract machine does not allow r1 == r2 == 42.
Every possible ordering of operations in the C++ abstract machine operations leads to r1=r2=0 without UB, even without synchronization. Therefore the program has no UB and any non-zero result would violate the "as-if" rule.

Formally, ISO C++ allows an implementation to implement functions / programs in any way that gives the same result as the C++ abstract machine would. For multi-threaded code, an implementation can pick one possible abstract-machine ordering and decide that's the ordering that always happens. (e.g. when reordering relaxed atomic stores when compiling to asm for a strongly-ordered ISA. The standard as written even allows coalescing atomic stores but compilers choose not to). But the result of the program always has to be something the abstract machine could have produced. (Only the Atomics chapter introduces the possibility of one thread observing the actions of another thread without mutexes. Otherwise that's not possible without data-race UB).

I think the other answers didn't look carefully enough at this. (And neither did I when it was first posted). Code that doesn't execute doesn't cause UB (including data-race UB), and compilers aren't allowed to invent writes to objects. (Except in code paths that already unconditionally write them, like y = (x==42) ? 42 : y; which would obviously create data-race UB.)

For any non-atomic object, if don't actually write it then other threads might also be reading it, regardless of code inside not-executed if blocks. The standard allows this and doesn't allow a variable to suddenly read as a different value when the abstract machine hasn't written it. (And for objects we don't even read, like neighbouring array elements, another thread might even be writing them.)

Therefore we can't do anything that would let another thread temporarily see a different value for the object, or step on its write. Inventing writes to non-atomic objects is basically always a compiler bug; this is well known and universally agreed upon because it can break code that doesn't contain UB (and has done so in practice for a few cases of compiler bugs that created it, e.g. IA-64 GCC I think had such a bug at one point that broke the Linux kernel). IIRC, Herb Sutter mentioned such bugs in part 1 or 2 of his talk, atomic<> Weapons: The C++ Memory Model and Modern Hardware", saying that it was already usually considered a compiler bug before C++11, but C++11 codified that and made it easier to be sure.

Or another recent example with ICC for x86: Crash with icc: can the compiler invent writes where none existed in the abstract machine?

In the C++ abstract machine, there's no way for execution to reach either y = r1; or x = r2;, regardless of sequencing or simultaneity of the loads for the branch conditions. x and y both read as 0 and neither thread ever writes them.

No synchronization is required to avoid UB because no order of abstract-machine operations leads to a data-race. The ISO C++ standard doesn't have anything to say about speculative execution or what happens when mis-speculation reaches code. That's because speculation is a feature of real implementations, not of the abstract machine. It's up to implementations (HW vendors and compiler writers) to ensure the "as-if" rule is respected.

It's legal in C++ to write code like if (global_id == mine) shared_var = 123; and have all threads execute it, as long as at most one thread actually runs the shared_var = 123; statement. (And as long as synchronization exists to avoid a data race on non-atomic int global_id). If things like this broke down, it would be chaos. For example, you could apparently draw wrong conclusions like reordering atomic operations in C++

Observing that a non-write didn't happen isn't data-race UB.

It's also not UB to run if(i<SIZE) return arr[i]; because the array access only happens if i is in bounds.

I think the "out of the blue" value-invention note only applies to relaxed-atomics, apparently as a special caveat for them in the Atomics chapter. (And even then, AFAIK it can't actually happen on any real C++ implementations, certainly not mainstream ones. At this point implementations don't have to take any special measures to make sure it can't happen for non-atomic variables.)

I'm not aware of any similar language outside the atomics chapter of the standard that allows an implementation to allow values to appear out of the blue like this.

I don't see any sane way to argue that the C++ abstract machine causes UB at any point when executing this, but seeing r1 == r2 == 42 would imply that unsynchronized read+write had happened, but that's data-race UB. If that can happen, can an implementation invent UB because of speculative execution (or some other reason)? The answer has to be "no" for the C++ standard to be usable at all.

For relaxed atomics, inventing the 42 out of nowhere wouldn't imply that UB had happened; perhaps that's why the standard says it's allowed by the rules? As far as I know, nothing outside the Atomics chapter of the standard allows it.

A hypothetical asm / hardware mechanism that could cause this

(Nobody wants this, hopefully everyone agrees that it would be a bad idea to build hardware like this. It seems unlikely that coupling speculation across logical cores would ever be worth the downside of having to roll back all cores when one detects a mispredict or other mis-speculation.)

For 42 to be possible, thread 1 has to see thread 2's speculative store and the store from thread 1 has to be seen by thread 2's load. (Confirming that branch speculation as good, allowing this path of execution to become the real path that was actually taken.)

i.e. speculation across threads: Possible on current HW if they ran on the same core with only a lightweight context switch, e.g. coroutines or green threads.

But on current HW, memory reordering between threads is impossible in that case. Out-of-order execution of code on the same core gives the illusion of everything happening in program order. To get memory reordering between threads, they need to be running on different cores.

So we'd need a design that coupled together speculation between two logical cores. Nobody does that because it means more state needs to rollback if a mispredict is detected. But it is hypothetically possible. For example an OoO SMT core that allows store-forwarding between its logical cores even before they've retired from the out-of-order core (i.e. become non-speculative).

PowerPC allows store-forwarding between logical cores for retired stores, meaning that threads can disagree about the global order of stores. But waiting until they "graduate" (i.e. retire) and become non-speculative means it doesn't tie together speculation on separate logical cores. So when one is recovering from a branch miss, the others can keep the back-end busy. If they all had to rollback on a mispredict on any logical core, that would defeat a significant part of the benefit of SMT.

I thought for a while I'd found an ordering that lead to this on single core of a real weakly-ordered CPUs (with user-space context switching between the threads), but the final step store can't forward to the first step load because this is program order and OoO exec preserves that.

T2: r2 = y; stalls (e.g. cache miss)
T2: branch prediction predicts that r2 == 42 will be true. ( x = 42 should run.
T2: x = 42 runs. (Still speculative; r2 = yhasn't obtained a value yet so ther2 == 42` compare/branch is still waiting to confirm that speculation).
a context switch to Thread 1 happens without rolling back the CPU to retirement state or otherwise waiting for speculation to be confirmed as good or detected as mis-speculation.

This part won't happen on real C++ implementations unless they use an M:N thread model, not the more common 1:1 C++ thread to OS thread. Real CPUs don't rename the privilege level: they don't take interrupts or otherwise enter the kernel with speculative instructions in flight that might need to rollback and redo entering kernel mode from a different architectural state.
T1: r1 = x; takes its value from the speculative x = 42 store
T1: r1 == 42 is found to be true. (Branch speculation happens here, too, not actually waiting for store-forwarding to complete. But along this path of execution, where the x = 42 did happen, this branch condition will execute and confirm the prediction).
T1: y = 42 runs.
this was all on the same CPU core so this y=42 store is after the r2=y load in program-order; it can't give that load a 42 to let the r2==42 speculation be confirmed. So this possible ordering doesn't demonstrate this in action after all. This is why threads have to be running on separate cores with inter-thread speculation for effects like this to be possible.

Note that x = 42 doesn't have a data dependency on r2 so value-prediction isn't required to make this happen. And the y=r1 is inside an if(r1 == 42) anyway so the compiler can optimize to y=42 if it wants, breaking the data dependency in the other thread and making things symmetric.

Note that the arguments about Green Threads or other context switch on a single core isn't actually relevant: we need separate cores for the memory reordering.

I commented earlier that I thought this might involve value-prediction. The ISO C++ standard's memory model is certainly weak enough to allow the kinds of crazy "reordering" that value-prediction can create to use, but it's not necessary for this reordering. y=r1 can be optimized to y=42, and the original code includes x=42 anyway so there's no data dependency of that store on the r2=y load. Speculative stores of 42 are easily possible without value prediction. (The problem is getting the other thread to see them!)

Speculating because of branch prediction instead of value prediction has the same effect here. And in both cases the loads need to eventually see 42 to confirm the speculation as correct.

Value-prediction doesn't even help make this reordering more plausible. We still need inter-thread speculation and memory reordering for the two speculative stores to confirm each other and bootstrap themselves into existence.

ISO C++ chooses to allow this for relaxed atomics, but AFAICT is disallows this non-atomic variables. I'm not sure I see exactly what in the standard does allow the relaxed-atomic case in ISO C++ beyond the note saying it's not explicitly disallowed. If there was any other code that did anything with x or y then maybe, but I think my argument does apply to the relaxed atomic case as well. No path through the source in the C++ abstract machine can produce it.

As I said, it's not possible in practice AFAIK on any real hardware (in asm), or in C++ on any real C++ implementation. It's more of an interesting thought-experiment into crazy consequences of very weak ordering rules, like C++'s relaxed-atomic. (Those ordering rules don't disallow it, but I think the as-if rule and the rest of the standard does, unless there's some provision that allows relaxed atomics to read a value that was never actually written by any thread.)

If there is such a rule, it would only be for relaxed atomics, not for non-atomic variables. Data-race UB is pretty much all the standard needs to say about non-atomic vars and memory ordering, but we don't have that.

Kauppi answered 22/9, 2019 at 12:6 Comment(10)

Relaxed-atomics should be no more relaxed than non-atomics. And no matter what, speculation should only be confirmed by non-speculative result, instead of cyclic self-proof. But your answer is a good thoughts exercise anyway. :) – Driskill 30/9, 2019 at 21:38

@Xiao-FengLi: "should be" - yes, that's why the C++ standard says implementations should not allow this. Also why designers of real HW have never AFAIK built HW that could do this. Yes, it's a thought exercise about the kind of insanity that's possible if the rules are too weak, and I think I've heard of it in a CPU-architecture context (outside C++). As I said in the answer, the ordering rules in the Atomics chapter might allow this, but perhaps not when combined with other parts of the C++ standard. I'm not sure it needed to be mentioned as a possibility in the atomics chapter. – Kauppi 1/10, 2019 at 3:0

Relaxed-atomics should be no more relaxed than non-atomics. Yes, agreed. That's part of why I think relaxed atomics probably can't do this either, because it makes no sense for non-atomic vars to be able to do this because there's no UB, therefore relaxed atomics shouldn't be able to do it either. So it's sort of a reductio ad absurdum argument. Fortunately that note is just a note, not normative. And it just leaves the door open, doesn't require that it's possible on any implementation. – Kauppi 1/10, 2019 at 3:5

@Xiao-FengLi: I found some more evidence that the note is non-normative only because they couldn't find an acceptable way to formalize it. Updated my answer. And yes, the formalism for mo_relaxed ends up lacking this guarantee where I think non-atomic objects still have it. This is something the committee would like to fix, but for now we can take it as a given that it's actually disallowed. This is only a problem for formal verification, not real life. – Kauppi 23/11, 2019 at 21:26

"interesting thought-experiment into crazy consequences of very weak ordering rules" That's what ppl said re: things that are UB but "work in practice": It's a crazy to think you don't get 2compl on those CPU as the only asm instr mult instr is in 2compl... until the analyser determines that x>0 so that xa>xb means a>b and your code relying on 2compl mult is broken. Of course naive compilation of MT doesn't produce anything funny, but what about a future aggressive compilers? My no race code was very straightforward so that the issue should have clear cut, but other examples are less clear – Rearmost 23/11, 2019 at 23:24

@curiousguy: My main argument here is not based on compilation for any specific ISA. It's based on the rules of the C abstract machine itself, which set the boundaries for what the "as-if" rule allows optimizers to do. I do have a section about plausible HW that could make this happen, which couldn't run C++ because it could invent values out of thin air. But that's not the point of this answer at all. – Kauppi 23/11, 2019 at 23:29

You say that crazytown isn't the real world, OTOH fixing the semantics would be lead to changes in the asm emitted: "This change would effectively imply stronger semantics for memory_order_relaxed, which would imply additional overhead for memory_order_relaxed on a number of architectures, including ARM, current Nvidia architectures, and possibly POWER." – Rearmost 23/11, 2019 at 23:39

@curiousguy: this change being the proposal in p0668r5: Revising the C++ memory model. The bullet point makes is clear that does not close this loophole. It's an unrelated change. They simply note the out-of-thin-air problem while talking about a fixes for an unrelated issue. I cited it as evidence about the intent of the standards committee. So no, closing the out-of-thin-air loophole would not result in any change to the asm. The change they're talking about would disallow IRIW reordering so all threads can agree on a global order of stores, i.e. multi-copy-atomic model. – Kauppi 23/11, 2019 at 23:44

@PeterCordes n3710 confuses everything: low level/high level definitions of "carries a dependency" which are only vaguely connected. The main idea is buried: whether a SE becomes unavoidable at some point. At some point the writers really seemed lost ("require recompilation of existing code"... hello? that's really not the most serious issue here). The crux of the matter: authors think you need to define "dependency chains" to exclude unwanted behaviors; doing so is either syntax-based (horrible) or general ("we are treating every relaxed store as though it were dependent on every prior load") – Rearmost 24/11, 2019 at 0:3

Let us continue this discussion in chat. – Rearmost 24/11, 2019 at 0:8

When a race condition potentially exists, what guarantees that a read of a shared variable (normal, non atomic) cannot see a write

There is no such guarantee.

When race condition exists, the behaviour of the program is undefined:

[intro.races]

Two actions are potentially concurrent if

they are performed by different threads, or

they are unsequenced, at least one is performed by a signal handler, and they are not both performed by the same signal handler invocation.

The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, except for the special case for signal handlers described below. Any such data race results in undefined behavior. ...

The special case is not very relevant to the question, but I'll include it for completeness:

Two accesses to the same object of type volatile std::sig_atomic_t do not result in a data race if both occur in the same thread, even if one or more occurs in a signal handler. ...

Breathless answered 19/6, 2019 at 18:52 Comment(9)

That special case is basically grandfathered in from C90 with the language updated. – Atropos 19/6, 2019 at 18:57

@Atropos and is pretty much the only portable way of communicating between a signal handler and rest of the program. – Breathless 19/6, 2019 at 19:1

Does that meant that almost all programs are actually broken? – Rearmost 19/6, 2019 at 19:26

@Rearmost Most multithreaded programs use a mutexes or other synchronization primitives (or std::atomic types) to protect shared data. If you don't then yes, your program is broken. – Ferrel 19/6, 2019 at 19:45

@MilesBudnek MT programs typically use mutexes for code like the one in the question? – Rearmost 19/6, 2019 at 20:4

@Rearmost - If x and y are truly the same piece of memory being accessed by more than one thread, then often they will, yes. Some very carefully written code for lock-free data structures will use multiple atomic variables in very specific ways without using mutexes. But that is very tricky code to write and get correct. In this particular case, if your main concerns is that if both x and y are 0 before either thread enters, that they both stay 0, you could probably just use atomics and the more constrained memory orders. – Atropos 19/6, 2019 at 20:26

Minor note: data races and race conditions are not the same thing. Data races are undefined behavior, race conditions are not. In the case of a race condition, the order specific commands occur in is unspecified (leading to (potentially) different results on different runs), but the behavior is indeed defined. – Immensity 19/6, 2019 at 20:27

Downvoted for not answering the part of the question shown in the examples, only the clumsy wording. If you want to undo Nathan's roll back to re-add my edit, or find a less clumsy way of making the same change, I'd be happy to undo the downvote. Neither example contains data-race UB, most obviously the 2nd one which uses atomic<T> – Kauppi 24/11, 2019 at 0:17

@MilesBudnek: Implementations can only support standard mutex and locking primitives if the code is running on an OS that supplies such primitives, and the compiler knows about such OS mechanisms. In many cases, a program for a freestanding implementation will be its own operating system, and thus any locking primitives will have to be implemented in C, using an implementation whose documented behavior for volatile has strong enough semantics to be suitable for such purposes. – Recollect 14/6, 2022 at 20:39

There is no protection against real data races, only against out-of-thin-air values.

I think you're really asking how to reconcile that mo_relaxed example with sane and well-defined behaviour for non-atomic variables. That's what this answer covers.

The note is pointing out a hole in the atomic `mo_relaxed` formalism, not warning you of a real possible effect on some implementations.

This gap does not (I think) apply to non-atomic objects, only to mo_relaxed.

We still do not have an acceptable way to make our informal (since C++14) prohibition of out-of-thin-air results precise. The primary practical effect of that is that formal verification of C++ programs using relaxed atomics remains unfeasible. The above paper suggests a solution similar to http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3710.html . We continue to ignore the problem here ...

So yes, the normative parts of the standard are apparently weaker for relaxed_atomic than they are for non-atomic. This seems to be an unfortunately side effect of how they define the rules.

AFAIK no implementations can produce out-of-thin-air values in real life.

Later versions of the standard phrase the informal recommendation more clearly, e.g. in the current draft: https://timsong-cpp.github.io/cppwp/atomics.order#8

Implementations should ensure that no “out-of-thin-air” values are computed that circularly depend on their own computation.
...

[ Note: The recommendation [of 8.] similarly disallows r1 == r2 == 42 in the following example, with x and y again initially zero:
   // Thread 1:
   r1 = x.load(memory_order::relaxed);
   if (r1 == 42) y.store(42, memory_order::relaxed);
   // Thread 2:
   r2 = y.load(memory_order::relaxed);
   if (r2 == 42) x.store(42, memory_order::relaxed);
— end note ]

(This rest of the answer was written before I was sure that the standard intended to disallow this for mo_relaxed, too.)

Or another recent example with ICC for x86: Crash with icc: can the compiler invent writes where none existed in the abstract machine?

Observing that a non-write didn't happen isn't data-race UB.

It's also not UB to run if(i<SIZE) return arr[i]; because the array access only happens if i is in bounds.

I'm not aware of any similar language outside the atomics chapter of the standard that allows an implementation to allow values to appear out of the blue like this.

A hypothetical asm / hardware mechanism that could cause this

i.e. speculation across threads: Possible on current HW if they ran on the same core with only a lightweight context switch, e.g. coroutines or green threads.

T2: r2 = y; stalls (e.g. cache miss)
T2: branch prediction predicts that r2 == 42 will be true. ( x = 42 should run.
T2: x = 42 runs. (Still speculative; r2 = yhasn't obtained a value yet so ther2 == 42` compare/branch is still waiting to confirm that speculation).
a context switch to Thread 1 happens without rolling back the CPU to retirement state or otherwise waiting for speculation to be confirmed as good or detected as mis-speculation.

This part won't happen on real C++ implementations unless they use an M:N thread model, not the more common 1:1 C++ thread to OS thread. Real CPUs don't rename the privilege level: they don't take interrupts or otherwise enter the kernel with speculative instructions in flight that might need to rollback and redo entering kernel mode from a different architectural state.
T1: r1 = x; takes its value from the speculative x = 42 store
T1: r1 == 42 is found to be true. (Branch speculation happens here, too, not actually waiting for store-forwarding to complete. But along this path of execution, where the x = 42 did happen, this branch condition will execute and confirm the prediction).
T1: y = 42 runs.
this was all on the same CPU core so this y=42 store is after the r2=y load in program-order; it can't give that load a 42 to let the r2==42 speculation be confirmed. So this possible ordering doesn't demonstrate this in action after all. This is why threads have to be running on separate cores with inter-thread speculation for effects like this to be possible.

Note that the arguments about Green Threads or other context switch on a single core isn't actually relevant: we need separate cores for the memory reordering.

Speculating because of branch prediction instead of value prediction has the same effect here. And in both cases the loads need to eventually see 42 to confirm the speculation as correct.

Kauppi answered 22/9, 2019 at 12:6 Comment(10)

Let us continue this discussion in chat. – Rearmost 24/11, 2019 at 0:8

What part of the so called "memory model" protects non atomic objects from these interactions caused by reads that see the interaction?

None. In fact, you get the opposite and the standard explicitly calls this out as undefined behavior. In [intro.races]\21 we have

The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, except for the special case for signal handlers described below. Any such data race results in undefined behavior.

which covers your second example.

The rule is that if you have shared data in multiple threads, and at least one of those threads write to that shared data, then you need synchronization. Without that you have a data race and undefined behavior. Do note that volatile is not a valid synchronization mechanism. You need atomics/mutexs/condition variables to protect shared access.

Rapier answered 19/6, 2019 at 18:54 Comment(22)

If that's the case, I can't imagine any real world MT program being correct – Rearmost 19/6, 2019 at 19:31

@Rearmost As long as you use sequentially consistent mode you are guaranteed to have a single total order of your code. This is offered by C++ so it is perfectly capable of writing multi-threaded code that is 100% portable and guaranteed. – Rapier 19/6, 2019 at 19:34

How do I activate that "mode"? – Rearmost 19/6, 2019 at 19:35

@Rearmost - Well, that's not true at all. There's a reason the standard added atomics and thread synchronization primitives to the language. If you use them, you get guarantees that ordinarily wouldn't apply at all. – Atropos 19/6, 2019 at 19:35

@Rearmost - Use memory_order_seq_cst instead of memory_order_relaxed. – Atropos 19/6, 2019 at 19:36

@Rearmost Generally by just using the defaults. If you have a std::atomic<int> for example and you do ++name_of_atomic_int in multiple threads it is guaranteed that the result will be correct since by default the operators are sequentially consistent. – Rapier 19/6, 2019 at 19:37

@Rapier For a store you can use memory_order_seq_cst but what about the destructor? – Rearmost 19/6, 2019 at 19:43

@Rearmost Two threads won't both try to destroy an atomic. If they do, you have a larger problem. It's the same as if two threads tried to both destroy anything else. – Antilles 19/6, 2019 at 19:46

@FrançoisAndrieux If only one thread is destroying an object, you don't have an issue, but if only one thread is writing to an object, you have one? – Rearmost 19/6, 2019 at 19:47

@FrançoisAndrieux But in my question, only one thread is writing to each variable. – Rearmost 19/6, 2019 at 19:51

@Rearmost - I think you're tying yourself in knots trying to wrap your head around some complex ideas. Instead of trying to come up with some top-down understanding of things, try out a few very specific examples (ideally code that can actually run). Maybe post them on SO and ask what the expected behavior is. Build your understanding from the bottom up until it clicks. – Atropos 19/6, 2019 at 19:58

@Rearmost No. If you have one thread destroying the object, it's fine. If you have two threads trying to destroy the object, it's not fine. But this is not a multithreading issue. It's just as bad as if you had a single thread try to destroy the object twice. – Antilles 19/6, 2019 at 20:0

@Rearmost You can't have an object shared by multiple threads that would be destroyed by both threads. If it is a global it is destroyed at the end of the program. If it is passed by reference, then it is destroyed by the call site, not the threads. If you pass a shared_ptr to the threads then once they end they destroy their own copy of the shared_ptr and shared_ptr uses atomic reference counting to determine when it should destroy the pointed to object. – Rapier 19/6, 2019 at 20:1

@FrançoisAndrieux If you have one thread trying to destroy an object and the other might use (not destroy) the object, then what precaution should you use? – Rearmost 19/6, 2019 at 20:10

@Rearmost That is what std::shared_ptr is for. It allows two or more things to share ownership and will only destroy the pointed to object once all shared_ptr are destroyed. – Rapier 19/6, 2019 at 20:11

@Rapier Yes but that's essentially pushing the problem to the implementer. How can the C++ code of a smart ptr ensure correct sequential behavior? If all operations on an atomic object must be memory_order_seq_cst, how does std::shared_ptr ensure same destruction? Is the destruction memory_order_seq_cst? – Rearmost 19/6, 2019 at 20:56

@Rearmost Yes. The C++ standard mandates that shared_ptr does the right thing. You don't need to worry about it. – Rapier 19/6, 2019 at 21:3

@Rapier I need to worry in order to understand how to ever write correct code. Do I need to use memory_order_seq_cst in the destructor of the object managed by std::shared_ptr? – Rearmost 19/6, 2019 at 21:4

@Rearmost No. shared_ptr handles all of this behind the scenes for you. It use an atomic reference counter to keep track of how man instances there are. The destructor checks the reference count and if it is more than one it just atomically decreases it by one. If the reference counter is at one, then the destructor knows it is the only object that owns the pointer so it deletes the pointer it holds. – Rapier 19/6, 2019 at 21:12

@Rearmost If you really want to understand multi-threaded code in C++ I would suggest getting a book dedicated to that like this one. It can do a much better job of building a solid foundation than we can. – Rapier 19/6, 2019 at 21:17

@Rapier "As long as you use sequentially consistent mode you are guaranteed to have a single total order of your code" If you only ever use atomics, yes. But then, you can't even have polymorphism! (Calling a ctor isn't atomic.) – Rearmost 21/9, 2019 at 19:3

In the second example (quoted from the ISO C++ standard, C++14 I think), the only shared variables are atomic<T>. There is no data-race UB, and in general unsynchronized access to atomic variables is only maybe a garden variety logic bug, not UB. Sorry about the mess my edits left your answer in (I made the first one before I understood the question as being about out-of-thin-air values), but this state is even worse than the state after my last edit, IMO. Going to have to downvote for making incorrect claims about these specific examples. – Kauppi 24/11, 2019 at 0:15

Note: The specific examples I give here are apparently not accurate. I've assumed the optimizer can be somewhat more aggressive than it's apparently allowed to be. There is some excellent discussion about this in the comments. I'm going to have to investigate this further, but wanted to leave this note here as a warning.

Other people have given you answers quoting the appropriate parts of the standard that flat out state that the guarantee you think exists, doesn't. It appears that you're interpreting a part of the standard that says a certain weird behavior is permitted for atomic objects if you use memory_order_relaxed as meaning that this behavior is not permitted for non-atomic objects. This is a leap of inference that is explicitly addressed by other parts of the standard that declare the behavior undefined for non-atomic objects.

In practical terms, here is an order of events that might happen in thread 1 that would be perfectly reasonable, but result in the behavior you think is barred even if the hardware guaranteed that all memory access was completely serialized between CPUs. Keep in mind that the standard has to not only take into account the behavior of the hardware, but the behavior of optimizers, which often aggressively re-order and re-write code.

Thread 1 could be re-written by an optimizer to look this way:

old_y = y; // old_y is a hidden variable (perhaps a register) created by the optimizer
y = 42;
if (x != 42) y = old_y;

There might be perfectly reasonable reasons for an optimizer to do this. For example, it may decide that it's far more likely than not for 42 to be written into y, and for dependency reasons, the pipeline might work a lot better if the store into y occurs sooner rather than later.

The rule is that the apparent result must look as if the code you wrote is what was executed. But there is no requirement that the code you write bears any resemblance at all to what the CPU is actually told to do.

The atomic variables impose constraints on the ability of the compiler to re-write code as well as instructing the compiler to issue special CPU instructions that impose constraints on the ability of the CPU to re-order memory accesses. The constraints involving memory_order_relaxed are much stronger than what is ordinarily allowed. The compiler would generally be allowed to completely get rid of any reference to x and y at all if they weren't atomic.

Additionally, if they are atomic, the compiler must ensure that other CPUs see the entire variable as either with the new value or the old value. For example, if the variable is a 32-bit entity that crosses a cache line boundary and a modification involves changing bits on both sides of the cache line boundary, one CPU may see a value of the variable that is never written because it only sees an update to the bits on one side of the cache line boundary. But this is not allowed for atomic variables modified with memory_order_relaxed.

That is why data races are labeled as undefined behavior by the standard. The space of the possible things that could happen is probably a lot wilder than your imagination could account for, and certainly wider than any standard could reasonably encompass.

Atropos answered 19/6, 2019 at 19:27 Comment(12)

"The compiler would generally be allowed to completely get rid of any reference to x and y at all" how so? "if they weren't atomic." What is they are, then the compiler can't "completely get rid" of atomic variables? – Rearmost 19/6, 2019 at 21:33

@Rearmost - No, the compiler cannot elide operations on atomic variables out of existence. Though if the atomic variables are local and the compiler can prove they are never shared, it can get rid of the variables completely. Non atomic variables that are shared (like they're global, or you've given a pointer to them to someone or something) can't be gotten rid of either. But any access to them can be removed as long as the final result looks as if the code your wrote is what was executed. That is not true for atomic variables. All accesses you put in the code must appear in the assembly. – Atropos 19/6, 2019 at 21:59

@Rearmost - Of course, that 'no eliding accesses' rule is also true for volatile variables. But atomic variables have a few more constraints (like the cache line boundary constraint) that volatile variables don't. – Atropos 19/6, 2019 at 22:0

"All accesses you put in the code must appear in the assembly." Why exactly? – Rearmost 19/6, 2019 at 22:6

@Rearmost - I think I'm wrong about that actually. It is true for volatile. That's the rule there. For atomics it's more complicated. The memory orders you use dictate exactly which access may be elided. If you use sequential access, none of them may be. The semantics dictate that. It's because memory_order_seq_cst requires that the value you write be visible to all other threads that might read with memory_order_acquire or higher. And similarly, any read with memory_order_seq_cst must pick up any value written by another thread with memory_order_release or higher. – Atropos 19/6, 2019 at 22:18

@Rearmost - For memory_order_relaxed accesses may be freely reordered or even elided. The only requirement is that any actual read or write must write or read the entire value and may not read or write a partial value. – Atropos 19/6, 2019 at 22:21

@Rearmost and Omni: ISO C++11/14/17 as written allows compilers to optimize away multiple back-to-back atomic stores, but current compilers choose not to do so (treating them like volatile atomic) because there's no obvious way to do that without possibly doing things we don't want, like collapsing all the stores to update a progress bar counter into one at the end. See Why don't compilers merge redundant std::atomic writes? for details on current compilers and standards discussion / ideas. – Kauppi 20/6, 2019 at 2:47

@Rearmost - You might find this question interesting - #12346987 – Atropos 20/6, 2019 at 23:49

The mechanism you propose (doing y=42 and then conditionally setting it back to the old value) is not generally legal. Compilers can't invent writes along paths that don't (in the C++ abstract machine) write y at all. That would introduce correctness problems if it turned out this thread shouldn't have written y and another thread was writing y at the same time. (@Rearmost we were talking about this problem in comments on another thread). IDK if value-prediction for loads + other crazy stuff could allow it on a hypothetical ISA. – Kauppi 22/9, 2019 at 4:45

Update: posted an answer. I don't think r1=r2=42 is allowed for non-atomic variables. There's no UB in the C++ abstract machine: given those starting x and y values, neither thread writes x or y. Code that doesn't write a variable isn't allowed to disturb what other threads read from it, even if it conditionally might have. – Kauppi 22/9, 2019 at 20:25

@PeterCordes Yes. This answer is incorrect in that, it exposes the possible processor's "internal" speculative operation state to the program and assumes the compiler can do the same thing. Processor's internal state should be hidden from the program execution result, and should never be visible, let alone be "implemented" by the compiler. If they do, it is a bug no mater it is introduced by the processor design or the compiler implementation. – Driskill 30/9, 2019 at 20:45

@Omnifarious: software speculation is allowed in some cases. e.g. if y was already unconditionally written with one value or another, e.g. y = condition ? a : b; could be compiled to y=b; then a conditional store of b if a compiler wanted to. But like I commented earlier, inventing writes to objects that aren't written along the correct path of execution is not legal. – Kauppi 22/10, 2019 at 21:27

(Stackoverflow complains about too many comments I put above, so I gathered them into an answer with some modifications.)

The intercept you cite from from C++ standard working draft N3337 was wrong.

[Note: The requirements do allow r1 == r2 == 42 in the following example, with x and y initially zero:

// Thread 1: r1 = x.load(memory_order_relaxed); if (r1 == 42) y.store(r1, memory_order_relaxed); // Thread 2: r2 = y.load(memory_order_relaxed); if (r2 == 42) x.store(42, memory_order_relaxed);

A programming language should never allow this "r1 == r2 == 42" to happen. This has nothing to do with memory model. This is required by causality, which is the basic logic methodology and the foundation of any programming language design. It is the fundamental contract between human and computer. Any memory model should abide by it. Otherwise it is a bug.

The causality here is reflected by the intra-thread dependences between operations within a thread, such as data dependence (e.g., read after write in same location) and control dependence (e.g., operation in a branch), etc. They cannot be violated by any language specification. Any compiler/processor design should respect the dependence in its committed result (i.e., externally visible result or program visible result).

Memory model is mainly about memory operation ordering among multi-processors, which should never violate the intra-thread dependence, although a weak model may allow the causality happening in one processor to be violated (or unseen) in another processor.

In your code snippet, both threads have (intra-thread) data dependence (load->check) and control dependence (check->store) that ensure their respective executions (within a thread) are ordered. That means, we can check the later op's output to determine if the earlier op has executed.

Then we can use simple logic to deduce that, if both r1 and r2 are 42, there must be a dependence cycle, which is impossible, unless you remove one condition check, which essentially breaks the dependence cycle. This has nothing to do with memory model, but intra-thread data dependence.

Causality (or more accurately, intra-thread dependence here) is defined in C++ std, but not so explicitly in early drafts, because dependence is more of micro-architecture and compiler terminology. In language spec, it is usually defined as operational semantics. For example, the control dependence formed by "if statement" is defined in the same version of draft you cited as "If the condition yields true the first substatement is executed. " That defines the sequential execution order.

That said, the compiler and processor can schedule one or more operations of the if-branch to be executed before the if-condition is resolved. But no matter how the compiler and processor schedule the operations, the result of the if-branch cannot be committed (i.e., become visible to the program) before the if-condition is resolved. One should distinguish between semantics requirement and implementation details. One is language spec, the other is how the compiler and processor implement the language spec.

Actually the current C++ standard draft has corrected this bug in https://timsong-cpp.github.io/cppwp/atomics.order#9 with a slight change.

[ Note: The recommendation similarly disallows r1 == r2 == 42 in the following example, with x and y again initially zero:

// Thread 1: r1 = x.load(memory_order_relaxed); if (r1 == 42) y.store(42, memory_order_relaxed); // Thread 2: r2 = y.load(memory_order_relaxed); if (r2 == 42) x.store(42, memory_order_relaxed);

Driskill answered 23/11, 2019 at 20:38 Comment(10)

The causality here is reflected by the intra-thread dependences between operations within a thread, such as ... control dependence. That's a bit too strong. From within the same CPU core, you would see the operations in program order, but other cores don't have to. Branch prediction + speculative execution breaks control dependencies. If those were both loads, the second load could happen before the first, despite being inside a branch controlled by the first. (So for example, two ordered stores could be seen in the opposite order: LoadLoad reordering). – Kauppi 23/11, 2019 at 21:57

But for a store, yes, it's necessary that all previous control and data dependencies are non-speculative before making a store visible to other threads, on any sane hardware. – Kauppi 23/11, 2019 at 21:58

Note that formally the standard still only says "should" not "must". The disallows you bolded only applies if the implementation follows the recommendation in the previous bullet. But yes, this is much more forcefully worded than the previous "should disallow" at the bottom. Good idea to quote the new wording, though; I did the same in my answer (with a different choice of what to bold). Upvoted for the reasoning about sequential execution for non-atomic stuff; I don't think all your reasoning fully holds up but overall the right idea. – Kauppi 23/11, 2019 at 22:1

@PeterCordes Yes, two loads for if-condition and if-branch can happen out of order (either scheduled by the compiler or by the processor pipeline), but the result cannot be visible to the program. That is, the loaded value in the if-branch cannot be stored into a variable of the program. This is (intra-thread) causality, not related to other thread or core. Other core does not have to see this causality (unless in a causality memory consistency model). They may see out-of-order. The point here is that, the semantics of a program (within thread) should always satisfy "intra-thread causality". – Driskill 23/11, 2019 at 22:12

the loaded value in the if-branch cannot be stored into a variable of the program Yes it can. I'm pretty sure you can observe LoadLoad reordering in real life on a weakly-ordered ISA even with the 2nd load inside a dependent conditional branch in the asm, without compile-time hoisting. – Kauppi 23/11, 2019 at 22:13

With all the loads + the branch not executed yet in the out-of-order backend, this order is possible: 2nd load takes a value from cache. 1st load takes a value from cache. cmp+branch confirms the branch speculation based on the 1st load's value. Then later the 2nd load value can get stored from a register into memory. The store has to wait for the branch to become non-speculative to leave the store buffer, but the critical time for loads is when they read L1d cache during exec, not when they retire. – Kauppi 23/11, 2019 at 22:17

@PeterCordes It depends on how you define "visible". If the value only stores into a variable but not used anywhere, i.e., leading any effect, that is fine. From processor internal point of view, this is always possible, while my point is from "external program's point of view". I guess here is what our definitions were not fully aligned. – Driskill 23/11, 2019 at 22:17

Adding to my previous comment: or the thread could store both load results to global variables where other threads could see that it had observed loads out of order with some known store order. This is very much a real thing. Branch speculation breaks data dependencies. The asm dependency ordering that mo_consume was designed to model doesn't work across control dependencies, only ALU data dependencies. Anyway, I'm not sure if you were disagreeing with this concrete example or if we were just misunderstanding each other. – Kauppi 23/11, 2019 at 22:20

@PeterCordes Yes, for observing threads, it is possible to see out-of-order loads, unless in a causality consistency model which is apparently not what the C++ std relaxed-order defines. – Driskill 23/11, 2019 at 22:28

There's a reason the default is seq_cst: so intuition about causality does work. For LoadLoad it would also work if the first load is an acquire-load. – Kauppi 23/11, 2019 at 22:33

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

The note is pointing out a hole in the atomic mo_relaxed formalism, not warning you of a real possible effect on some implementations.

A hypothetical asm / hardware mechanism that could cause this

The note is pointing out a hole in the atomic mo_relaxed formalism, not warning you of a real possible effect on some implementations.

A hypothetical asm / hardware mechanism that could cause this

Recommended topics

Hot tags

The note is pointing out a hole in the atomic `mo_relaxed` formalism, not warning you of a real possible effect on some implementations.

The note is pointing out a hole in the atomic `mo_relaxed` formalism, not warning you of a real possible effect on some implementations.