r1 == 0
is allowed by C++, and possible on most non-x86 hardware. At least if the two atomic variables end up in different cache lines, which you could force with alignas(64)
on each.
TL:DR:
x86 can't create this runtime reordering effect so you won't see it on Godbolt.
On other real hardware, the val=1
store can still be sitting in the private store buffer of t1 when the flag=true
store becomes visible to t2.
[atomics.order] p10 in the C++ standard doesn't actually say that atomic RMWs take the "latest value" in the mod order or any absolute sense, just last value before the store from the RMW. I and others have been mis-quoting it or going along with that misreading for years >.<
On real CPUs, loads always take the newest value this core can see at the moment when the load actually samples L1d cache or forwards from the store buffer (or both), whether it's the load side of an RMW or not. They don't intentionally take "stale" values, and it's hard to define "stale" in a useful way. (But they can load early, so by the time everything before them in program order has completed, the value they loaded might not still be the newest the CPU can see, especially on weakly-ordered ISAs.)
The modification order doesn't simply grow incrementally as the program runs. At least the C++ standard doesn't make any reference to that. You could say that a store committing to coherent cache (on a normal CPU) is when the mod order is nailed down, so order of cores getting MESI exclusive ownership to commit their stores is what establishes it. But newer values can still be flying around due to store-forwarding.
I think the intended way to look at a modification orders is that it covers the whole lifetime of a variable. Don't look at it as growing as the program runs. The mechanism for an order being established is an aspect of a program compiled for a real CPU, not the abstract machine.
Think about order, not time, and stop obsessing over "latest value". CPUs aren't out to screw you over by loading older values than necessary. For performance, inter-thread latency is something you can measure e.g. with a spin-wait loop and a store after leaving the loop, and measuring either timestamps on both cores or round-trip time. Not loading early so the value isn't "stale" by the time other loads have finished isn't better, and whether it's a problem depends on ordering requirements not performance. The value from a store isn't going to be available any sooner, it's just a question of whether your load ran early and took the value before it (which is pretty much the same as this thread being farther ahead of the thread doing the store, but CPUs want to load early and store late).
On x86 r1 == 0
can only happen due to compile-time reordering of the ops in thread t1. In x86 asm, the hardware memory model is program-order plus a store-buffer with store forwarding, so once the compiler chooses a static order for the operations, every asm store is effectively release
, every asm load is acquire
or actually seq_cst
, every RMW is seq_cst
and includes atomic_thread_fence(seq_cst)
. You need a CPU (or simulator) of a weakly-ordered ISA to see most interesting runtime reordering effects other than StoreLoad.
however, it can guarantee that #1 has already happened when #4 is evaluated.
It certainly hasn't "happened-before" in the sense of the C++ standard because both ops in thread t1 are relaxed
. #1 is only sequenced-before, not inter-thread happens-before #2. Other threads are allowed to observe those stores in either order1.
In the C++ formalism, this is a lack of any happens-before relationship between #1 and #2 when observed by other threads, including a lack of happens-before for a thread that has observed #2.
On real hardware where stores only become visible to other threads by committing to cache (which in most real HW is the only mechanism for inter-thread visibility), this is StoreStore reordering in the core doing the stores. x86 doesn't allow that, but most other ISAs are weakly-ordered and their hardware can do that in practice. After executing both stores, their values and addresses will be in the store buffer of the core running t1, waiting for exclusive ownership of the relevant cache line so the store can commit and become globally visible. If it gets ownership of the line holding flag
first, #2 can commit and become visible to other threads (in response to MESI share requests), all while the store to val
(#1) is still in the private store buffer of t1's core2 waiting for ownership of the cache line.
This is sufficient to lead to r1 == 0
. In thread t2, the spin-loop and exchange could both be seq_cst
. We don't need to invoke any weird trickery about speculative execution of the load side of the exchange
before the spin-wait loop saw the value its waiting for, and the exchange can fully commit val=2
to cache before before t1 commits val=1
.
So the final modification-order for val
once the dust settles will be 0, 2, 1
in that case. As we debated in another comment thread, you might think that means the modification order of val
only contained 0
when val.exchange
's load took a value so 0
was the last value, and say that the val=1
store hadn't yet become part of the modification order.
But t1 could have done tmp = val.load(relaxed)
and bar.store(tmp, relaxed)
after the other 2 stores (Godbolt), which can reload its own val=1
store via store-forwarding before it's visible to any other cores/threads, and then make that 1
it loaded from val
visible to t2 before it sees the val=1
store. (e.g. if bar
was in the same cache line as flag
, where t1 commits stores first in the runs that are interesting for this case.)
alignas(64) std::atomic<bool> flag {false}; // variables in separate cache lines
alignas(64) std::atomic<int> val {0};
alignas(64) std::atomic<int> bar {0};
using std::memory_order::relaxed, std::memory_order::seq_cst;
// Thread 1
val.store(1, relaxed);
flag.store(true, relaxed);
int tmp = val.load(relaxed); // can read val before the value is available to other threads
bar.store(tmp, relaxed); // this store can be the first one visible to t2
// Thread 2
while(!flag.load(relaxed)){} // #3
int rb = bar.load(seq_cst);
int rv = val.fetch_add(rb+10, seq_cst); // definitely can't reorder at compile time with bar.load
// rb == 1 && rv == 0 is allowed, and possible on AArch64 for example
So from t2's perspective in that case, it could be using seq_cst loads or RMWs and see val==0
, flag==true
, and bar==1
. We know that bar==1
came from t1 reading it from val
. Its position in the mod order of val
isn't nailed down until it actually commits to cache; before that all we can say is that it's later in the mod order than any stores that have committed to cache already, but there's still room for other entries to be come between those stores and this.
So do we have a conflict with C++ requiring the load side of atomic RMWs to take the latest value in the modification order? No, that's actually not what the C++ standard says.
[atomics.order]#10
Atomic read-modify-write operations shall always read the last value (in the modification order) written before the write associated with the read-modify-write operation.
Note that it does not say that no later values can exist in the mod order. It just says that the load must see the value from the mod order that immediately precedes the value from the store side of the RMW. This is all that's required for atomicity, which is the point of this requirement. (No store from another thread can come between the RMW's load and its store.)
It doesn't require anything like that later values in the mod order can't already be visible to other loads in this or other threads (although my phrasing here allows some possibilities that aren't plausible on real hardware).
The modification-order of each variable is a way to explain what happened during an execution of a multi-threaded program, but thinking about how it grows is less simple than you thought. Values can be read before all preceding entries in the mod order are established, and before their relative position is nailed down. (On most hardware, only by the thread that did the store. Or on some, by store-forwarding of graduated stores to other SMT logical cores.)
The C++ standard just talks about order and what values a load must or could see, not time. I don't see anywhere in the standard in [intro.races] or [atomics.order] that makes any reference to the current state of the modification order. The rules are only in terms of things like "where X precedes B in the modification order of M", so it's talking in terms of the mod order including all later values for the full run of the program. Nothing about a snapshot of a state when a load happens. (You can talk about that for real hardware, but the C++ standard has nothing to say about that because its formalism is very different from accessing coherent cache state that exists even if there are no readers.)
As [intro.races]#20 says:
Note 20: The value observed by a load of an atomic depends on the “happens before” relation, which depends on the values observed by loads of atomics. The intended reading is that there must exist an association of atomic loads with modifications they observe that, together with suitably chosen modification orders and the “happens before” relation derived as described above, satisfy the resulting constraints as imposed here
Footnote 1: C++'s abstract machine even allows different threads using .load(acquire)
in opposite orders to disagree about which store became visible first. On real hardware disagreement about the order of two stores is normally only talked about when they're done by two separate threads, the IRIW litmus test which IBM POWER hardware can demonstrate.
But it should also be possible on POWER with a single writer thread doing both relaxed stores: One sees the first but not the second (program order) from store-forwarding between SMT threads. Another thread on a separate physical core sees them only through coherent cache and sees the second but not the first if they were in different cache lines and StoreStore reordering happened.
Footnote 2: practical demonstration of the effect on real HW
We can make that more likely in practice by e.g. having t1 write another variable in the same cache line as flag
before it does this, and/or something to create extra contention for the cache line holding val
, like another thread spamming stores or RMWs to it.
To actually observe in practice, you also need t2 to run during this small time window, unlikely if you just start them both with std::thread
constructors, so you might instead have them both spin-wait for a variable set by the main thread after a short sleep, so there's a decent chance they actually start processing the relevant machine instructions within a few tens of clock cycles of each other. Or set things up so you can do multiple tries per invocation of the program, since one attempt per program startup and thread creation is a lot of overhead for the amount of tries you need to maybe actually see something.
e.g. see https://preshing.com/20120515/memory-reordering-caught-in-the-act/ for an example of repeating a test within the same program.
Update: bonus questions
Q1: yes, in the abstract machine at least, #2 can read the 3
stored by #5. Since all of 1-3 are relaxed
, the program can execute as-if #3 was first in source order, followed by #1 and #2 (which can't be reordered with each other because they both access the same variable.)
I'm not 100% confident that real AArch64 hardware for example can let #2 read the 3
stored by #5 (without compile-time reordering which is allowed and can make it happen anywhere). It would require the flag.store
to commit to cache before an earlier RMW exchange, and committing to cache (or forwarding to another SMT logical core on PowerPC) can only happen after an instruction retires. And that can only happen after preceding instructions retire, since in-order retirement is how CPUs make exceptions precise and know that any mis-speculation has been detected.
I don't know if an ARMv8.1 single-instruction RMW can retire without having finished even the load part; I know a pure load can retire before the value is actually available on some non-x86 microarchitectures. But the store doesn't have its data ready yet so it be a special case in the store buffer. So even though exchange
doesn't need any ALU work to produce the value from the load result, it would still be a special case that an architect might choose not to support, but could I think choose to allow.
Some ISAs (including a planned x86 extension, RAO-INT) allow "remote atomics" where the CPU tells an L3 cache slice to do the atomic operation. x86's RAO-INT doesn't even send back a result, just fire and forget e.g. for a kernel incrementing an event counter or CPU-usage total that user-space might want to read later, not something it needs the results of now. It's 100% plausible that a CPU core could send out such a request which gets queued somewhere until after it commits its plain store to flag
. If we don't actually use the return value of #2, the compiler can still use such an instruction and we can still talk about what value it read based on what preceded it in the modification order.
Other hardware designs could support a remote-atomic that does the RMW closer to memory and sends a result back to the core, so once the core sent it out it would just be waiting for a value, like a load, and thus could let it retire from out-of-order exec.
Q2: #2 reading 1
and #5 reading 0
is allowed by C++, and should be possible on real hardware. t1's flag.store
(#3) commits before anything to val
, allowing #5 to read 0
and write 3
. Then #1 and #2 follow it in the modification order of val
so #2 reads the 1
written by #1.
It would also be possible for just #1 to commit to cache and be visible to #5, then #2 and #3 enter the store buffer and #3 commits first. e.g. if an interrupt happens after #1, or if the core gives up MESI ownership after committing #1 but before #2. So a mod order for val
of 0, 1, 3, 2
.