Is the order of a side effect in the modification order determined by when the side effect is produced?
Asked Answered
C

1

1
#include <memory>
#include <atomic>
#include <iostream>
#include <thread>
#include <cassert>
int main()
{
    std::atomic<bool> flag = {false};
    std::atomic<int> val = {0};
    std::thread t1([&](){
        val.store(1,std::memory_order::relaxed);  // #1
        flag.store(true,std::memory_order::relaxed);  // #2
    });
    std::thread t2([&](){
        while(!flag.load(std::memory_order::relaxed)){}  // #3
        auto r = val.exchange(2,std::memory_order::relaxed); // #4
        assert(r!=0);  // #5
    });
    t1.join();
    t2.join();
}

In this example, #3 does not synchronize with #2, however, it can guarantee that #1 has already happened when #4 is evaluated. Does that mean the modification order of val is {0, 1} when #4 is evaluated such that the read part of this read-modify-write operation is guaranteed to read 1 as per [atomics.order] p10

Atomic read-modify-write operations shall always read the last value (in the modification order) written before the write associated with the read-modify-write operation.

Is #5 in this example guaranteed to be not triggered? The assert is not triggered in this example https://godbolt.org/z/G8KzK3o8W

Update:

#include <memory>
#include <atomic>
#include <iostream>
#include <thread>
#include <cassert>
int main()
{
    std::atomic<bool> flag = {false};
    std::atomic<int> val = {0};
    std::thread t1([&](){
        val.store(1,std::memory_order::relaxed);  // #1
        val.exchange(2,std::memory_order::relaxed); //#2
        flag.store(true,std::memory_order::relaxed);  // #3
    });
    std::thread t2([&](){
        while(!flag.load(std::memory_order::relaxed)){}  // #4
        // auto r = val.load(std::memory_order::relaxed);
        auto r = val.exchange(3,std::memory_order::relaxed); // #5
    });
    t1.join();
    t2.join();
}

Q1:

In this example, what are possible values #2 can return? it returns 1 stored by #1, this is a intuitive case. Can #2 read the value 3 stored by #5?

Q2:

If #2 reads 1, does it mean #5 must read 2 that is stored by #2, or it can also be 0? In the same situation, if #5 is a pure load as written in the comment, does it mean the result can be 0, 1, or 2?

Calan answered 19/9, 2024 at 14:22 Comment(3)
x86 CPUs don't support relaxed ordering, and treat it the same as acquire/release (or even sequential consistency; I forget which). You need more exotic hardware to be able to actually observe the quirks of relaxed ordering. "Assert is not triggered when I run this program" doesn't really prove much.Modie
"can guarantee that #1 has already happened when #4 is evaluated" how does that follow?Yachting
What do you mean when you're saying 2 not synchronized with 3? Of all uncertainty in the example 2 and 3 are certainly "synchronized" (meaning 3 guarantees to see TRUE at some point)Moscow
M
3

r1 == 0 is allowed by C++, and possible on most non-x86 hardware. At least if the two atomic variables end up in different cache lines, which you could force with alignas(64) on each.

TL:DR:

  • x86 can't create this runtime reordering effect so you won't see it on Godbolt.

  • On other real hardware, the val=1 store can still be sitting in the private store buffer of t1 when the flag=true store becomes visible to t2.

  • [atomics.order] p10 in the C++ standard doesn't actually say that atomic RMWs take the "latest value" in the mod order or any absolute sense, just last value before the store from the RMW. I and others have been mis-quoting it or going along with that misreading for years >.<

    On real CPUs, loads always take the newest value this core can see at the moment when the load actually samples L1d cache or forwards from the store buffer (or both), whether it's the load side of an RMW or not. They don't intentionally take "stale" values, and it's hard to define "stale" in a useful way. (But they can load early, so by the time everything before them in program order has completed, the value they loaded might not still be the newest the CPU can see, especially on weakly-ordered ISAs.)

  • The modification order doesn't simply grow incrementally as the program runs. At least the C++ standard doesn't make any reference to that. You could say that a store committing to coherent cache (on a normal CPU) is when the mod order is nailed down, so order of cores getting MESI exclusive ownership to commit their stores is what establishes it. But newer values can still be flying around due to store-forwarding.

    I think the intended way to look at a modification orders is that it covers the whole lifetime of a variable. Don't look at it as growing as the program runs. The mechanism for an order being established is an aspect of a program compiled for a real CPU, not the abstract machine.

  • Think about order, not time, and stop obsessing over "latest value". CPUs aren't out to screw you over by loading older values than necessary. For performance, inter-thread latency is something you can measure e.g. with a spin-wait loop and a store after leaving the loop, and measuring either timestamps on both cores or round-trip time. Not loading early so the value isn't "stale" by the time other loads have finished isn't better, and whether it's a problem depends on ordering requirements not performance. The value from a store isn't going to be available any sooner, it's just a question of whether your load ran early and took the value before it (which is pretty much the same as this thread being farther ahead of the thread doing the store, but CPUs want to load early and store late).


On x86 r1 == 0 can only happen due to compile-time reordering of the ops in thread t1. In x86 asm, the hardware memory model is program-order plus a store-buffer with store forwarding, so once the compiler chooses a static order for the operations, every asm store is effectively release, every asm load is acquire or actually seq_cst, every RMW is seq_cst and includes atomic_thread_fence(seq_cst). You need a CPU (or simulator) of a weakly-ordered ISA to see most interesting runtime reordering effects other than StoreLoad.

however, it can guarantee that #1 has already happened when #4 is evaluated.

It certainly hasn't "happened-before" in the sense of the C++ standard because both ops in thread t1 are relaxed. #1 is only sequenced-before, not inter-thread happens-before #2. Other threads are allowed to observe those stores in either order1.

In the C++ formalism, this is a lack of any happens-before relationship between #1 and #2 when observed by other threads, including a lack of happens-before for a thread that has observed #2.

On real hardware where stores only become visible to other threads by committing to cache (which in most real HW is the only mechanism for inter-thread visibility), this is StoreStore reordering in the core doing the stores. x86 doesn't allow that, but most other ISAs are weakly-ordered and their hardware can do that in practice. After executing both stores, their values and addresses will be in the store buffer of the core running t1, waiting for exclusive ownership of the relevant cache line so the store can commit and become globally visible. If it gets ownership of the line holding flag first, #2 can commit and become visible to other threads (in response to MESI share requests), all while the store to val (#1) is still in the private store buffer of t1's core2 waiting for ownership of the cache line.

This is sufficient to lead to r1 == 0. In thread t2, the spin-loop and exchange could both be seq_cst. We don't need to invoke any weird trickery about speculative execution of the load side of the exchange before the spin-wait loop saw the value its waiting for, and the exchange can fully commit val=2 to cache before before t1 commits val=1.

So the final modification-order for val once the dust settles will be 0, 2, 1 in that case. As we debated in another comment thread, you might think that means the modification order of val only contained 0 when val.exchange's load took a value so 0 was the last value, and say that the val=1 store hadn't yet become part of the modification order.

But t1 could have done tmp = val.load(relaxed) and bar.store(tmp, relaxed) after the other 2 stores (Godbolt), which can reload its own val=1 store via store-forwarding before it's visible to any other cores/threads, and then make that 1 it loaded from val visible to t2 before it sees the val=1 store. (e.g. if bar was in the same cache line as flag, where t1 commits stores first in the runs that are interesting for this case.)

    alignas(64) std::atomic<bool> flag {false};  // variables in separate cache lines
    alignas(64) std::atomic<int> val {0};
    alignas(64) std::atomic<int> bar {0};
    using std::memory_order::relaxed, std::memory_order::seq_cst;

    // Thread 1
        val.store(1, relaxed);
        flag.store(true, relaxed);
        int tmp = val.load(relaxed);     // can read val before the value is available to other threads
        bar.store(tmp, relaxed);         // this store can be the first one visible to t2

    // Thread 2
        while(!flag.load(relaxed)){}  // #3
        int rb = bar.load(seq_cst);
        int rv = val.fetch_add(rb+10, seq_cst);  // definitely can't reorder at compile time with bar.load
        // rb == 1 && rv == 0 is allowed, and possible on AArch64 for example

So from t2's perspective in that case, it could be using seq_cst loads or RMWs and see val==0, flag==true, and bar==1. We know that bar==1 came from t1 reading it from val. Its position in the mod order of val isn't nailed down until it actually commits to cache; before that all we can say is that it's later in the mod order than any stores that have committed to cache already, but there's still room for other entries to be come between those stores and this.

So do we have a conflict with C++ requiring the load side of atomic RMWs to take the latest value in the modification order? No, that's actually not what the C++ standard says.

[atomics.order]#10

Atomic read-modify-write operations shall always read the last value (in the modification order) written before the write associated with the read-modify-write operation.

Note that it does not say that no later values can exist in the mod order. It just says that the load must see the value from the mod order that immediately precedes the value from the store side of the RMW. This is all that's required for atomicity, which is the point of this requirement. (No store from another thread can come between the RMW's load and its store.)

It doesn't require anything like that later values in the mod order can't already be visible to other loads in this or other threads (although my phrasing here allows some possibilities that aren't plausible on real hardware).

The modification-order of each variable is a way to explain what happened during an execution of a multi-threaded program, but thinking about how it grows is less simple than you thought. Values can be read before all preceding entries in the mod order are established, and before their relative position is nailed down. (On most hardware, only by the thread that did the store. Or on some, by store-forwarding of graduated stores to other SMT logical cores.)

The C++ standard just talks about order and what values a load must or could see, not time. I don't see anywhere in the standard in [intro.races] or [atomics.order] that makes any reference to the current state of the modification order. The rules are only in terms of things like "where X precedes B in the modification order of M", so it's talking in terms of the mod order including all later values for the full run of the program. Nothing about a snapshot of a state when a load happens. (You can talk about that for real hardware, but the C++ standard has nothing to say about that because its formalism is very different from accessing coherent cache state that exists even if there are no readers.)

As [intro.races]#20 says:

Note 20: The value observed by a load of an atomic depends on the “happens before” relation, which depends on the values observed by loads of atomics. The intended reading is that there must exist an association of atomic loads with modifications they observe that, together with suitably chosen modification orders and the “happens before” relation derived as described above, satisfy the resulting constraints as imposed here


Footnote 1: C++'s abstract machine even allows different threads using .load(acquire) in opposite orders to disagree about which store became visible first. On real hardware disagreement about the order of two stores is normally only talked about when they're done by two separate threads, the IRIW litmus test which IBM POWER hardware can demonstrate.

But it should also be possible on POWER with a single writer thread doing both relaxed stores: One sees the first but not the second (program order) from store-forwarding between SMT threads. Another thread on a separate physical core sees them only through coherent cache and sees the second but not the first if they were in different cache lines and StoreStore reordering happened.

Footnote 2: practical demonstration of the effect on real HW
We can make that more likely in practice by e.g. having t1 write another variable in the same cache line as flag before it does this, and/or something to create extra contention for the cache line holding val, like another thread spamming stores or RMWs to it.

To actually observe in practice, you also need t2 to run during this small time window, unlikely if you just start them both with std::thread constructors, so you might instead have them both spin-wait for a variable set by the main thread after a short sleep, so there's a decent chance they actually start processing the relevant machine instructions within a few tens of clock cycles of each other. Or set things up so you can do multiple tries per invocation of the program, since one attempt per program startup and thread creation is a lot of overhead for the amount of tries you need to maybe actually see something. e.g. see https://preshing.com/20120515/memory-reordering-caught-in-the-act/ for an example of repeating a test within the same program.


Update: bonus questions

Q1: yes, in the abstract machine at least, #2 can read the 3 stored by #5. Since all of 1-3 are relaxed, the program can execute as-if #3 was first in source order, followed by #1 and #2 (which can't be reordered with each other because they both access the same variable.)

I'm not 100% confident that real AArch64 hardware for example can let #2 read the 3 stored by #5 (without compile-time reordering which is allowed and can make it happen anywhere). It would require the flag.store to commit to cache before an earlier RMW exchange, and committing to cache (or forwarding to another SMT logical core on PowerPC) can only happen after an instruction retires. And that can only happen after preceding instructions retire, since in-order retirement is how CPUs make exceptions precise and know that any mis-speculation has been detected.

I don't know if an ARMv8.1 single-instruction RMW can retire without having finished even the load part; I know a pure load can retire before the value is actually available on some non-x86 microarchitectures. But the store doesn't have its data ready yet so it be a special case in the store buffer. So even though exchange doesn't need any ALU work to produce the value from the load result, it would still be a special case that an architect might choose not to support, but could I think choose to allow.

Some ISAs (including a planned x86 extension, RAO-INT) allow "remote atomics" where the CPU tells an L3 cache slice to do the atomic operation. x86's RAO-INT doesn't even send back a result, just fire and forget e.g. for a kernel incrementing an event counter or CPU-usage total that user-space might want to read later, not something it needs the results of now. It's 100% plausible that a CPU core could send out such a request which gets queued somewhere until after it commits its plain store to flag. If we don't actually use the return value of #2, the compiler can still use such an instruction and we can still talk about what value it read based on what preceded it in the modification order.

Other hardware designs could support a remote-atomic that does the RMW closer to memory and sends a result back to the core, so once the core sent it out it would just be waiting for a value, like a load, and thus could let it retire from out-of-order exec.

Q2: #2 reading 1 and #5 reading 0 is allowed by C++, and should be possible on real hardware. t1's flag.store (#3) commits before anything to val, allowing #5 to read 0 and write 3. Then #1 and #2 follow it in the modification order of val so #2 reads the 1 written by #1.

It would also be possible for just #1 to commit to cache and be visible to #5, then #2 and #3 enter the store buffer and #3 commits first. e.g. if an interrupt happens after #1, or if the core gives up MESI ownership after committing #1 but before #2. So a mod order for val of 0, 1, 3, 2.

Moravia answered 19/9, 2024 at 22:56 Comment(100)
After reading your answer, I raised another question, see my updated part.Calan
@xmh0511: Q1: yes, in the abstract machine at least, #2 can read the 3 stored by #5. Q2: #2 reading 1 and #5 reading 0 is allowed by C++, and should be possible on real hardware. t1's flag.store (#1) commits before anything to val, allowing #5 to read 0 and write 3. Then #1 and #2 follow it in the modification order of val. It would also be possible for just #1 to commit to cache and be visible to #5, then #2 follows. e.g. if an interrupt happens after #1 so that commits, or if the core gives up MESI ownership after committing #1 but before #2.Moravia
That is to say, if #2 reads 1, then #5 can return 0 or 2 but not 1. If #5 were auto r = val.load(std::memory_order::relaxed);, in the same situation that #2 reads 1, is r can be 0, 1, and 2, which can have an extra result 1 than #5 originally has, right?Calan
@xmh0511: switched from commenting to editing after one comment, so that was just a partial answer. Is your last comment still anything I didn't already address? You're correct that only one of the RMWs can read 1, because they're both RMWs so whichever goes first replaces the value before the other can read.Moravia
OK, the answer seems to solve my confusion: The modification order doesn't grow incrementally as the program runs, in other words, the side effect produced by a store operation is not immediately placed/nailed into the modification order. The modification order is determined/nailed down by some observed load operations and other coherent constraints.Calan
<<As we debated in another comment thread, you might think that means the modification order of val only contained 0 when val.exchange's load took a value so 0 was the last value>> If val.exchange read 0 in the first example, it does mean the side effect 1 is not in the modification order during the read, isn't it? even though the same thread can read the value it wrote, this is regulated by [intro.races] p18.Calan
@xmh0511: As I said in this answer, the modification order of a variable includes all modifications made to it over its lifetime. You quickly run into things that don't make sense when trying to talk about a mod order that's only partially constructed at the point of some operation taking place, and I don't think it's useful. Looking again at the language in the standard made me realize that the authors of the standard probably didn't intend anyone to think about them that way.Moravia
@xmh0511: so TL:DR: I'm not interested in discussing whether a side-effect is or isn't part of the mod-order yet at some point. There's no language-lawyer basis for such a discussion because the C++ formalism doesn't work that way. If you want to talk about how real CPUs work and when changes commit to cache or become visible to other threads by another mechanism, or visible to OoO exec of loads in the same thread, then there's something to talk about. The store of a 1 isn't visible to anything until the load side of the RMW gets a value. Unless a CPUs can store-forward it to a later load?Moravia
Ok, the modification order of an atomic object is defined in terms of the whole program has completed, there is no snapshot at some time point(i.e. no partial modification order exist at midway). However, this intended meaning is not embodied in the current wording, which is language-lawyer question.Calan
@PeterCordes why do you wrote "The C++ standard doesn't actually say that atomic RMWs take the "latest value" in the mod order"? Here is the according quote from the C++17 standard: "Atomic read-modify-write operations shall always read the last value (in the modification order) written before the write associated with the read-modify-write operation." That sounds pretty clear to me?Angstrom
@mpoeter: "latest value" is a different thing than "last value before something else". "Latest" implies newest in time, as if there was some way to deem this value the "current" one, which is how I had thought of this guarantee until this question made me look at it more closely. "Last before the value from the store" only implies ordering, not freshness or anything about time. Also, "latest value in the mod order" implies that the mod order grows as the program runs, instead of covering the past and future for the whole lifetime of the variable.Moravia
"All modifications to a particular atomic object M occur in some particular total order, called the modification order of M" and "... shall always read the last value (in the modification order) written before the write associated with the read-modify-write". The write of the RMW is obviously also a modification that goes into the modification order (MO). And since the MO is a total order, IMHO this simply means that the RMW will read the value that precedes its own write in MO (which is the latest value from the perspective of this particular RMW operation).Angstrom
@mpoeter: Yes, what you said is true, and isn't what I was disagreeing with. What's incorrect is the idea that when the load side of an RMW runs, there are no values in the modification order after the one it reads. That's how many people (including myself) were thinking about the phrase "latest value" (not "latest value before x"). As this answer discusses, other threads can already have read later values in the mod order. (On real hardware where we can talk about time, those stores by other cores might not have established their position in the mod order yet.)Moravia
Ah ok, got you! You are more rooted in the actual hardware and how the ISAs work while I am more focused on the C++ definition. :) Even though the MO has a single total order, it does not mean that different threads cannot have different views of the MO. They can't see different orders but different "states". I.e., some values can already be visible to thread T1 but not yet to thread T2. However, MOs are per value, so threads can have different views per value. E.g, given values A and B, T1 may have a more current view on A but an "old" view on B, while for T2 it is the other way round.Angstrom
@PeterCordes Do you mean a modification can be inserted into the place between two other modifications? For example, given mod order {v0, v1}, if a RMW operation reads v0, its write will be inserted into v0 and v1? In other words, a modification to the modification order is not necessary to append it in the end?Calan
@xmh0511: In a language-lawyer sense, no, that's a nonsensical statement, the mod-order doesn't grow incrementally in any order, either its own or some other. It's part of an explanation for what happened once the dust settles. e.g. on real hardware (so stepping outside the language-lawyer world), all pending stores have committed from store-buffers where they were visible to one or a couple cores but not all.Moravia
@PeterCordes So, in the perspective of the language-lwayer, a executed store operation does not mean that its side effect is already in the mod order? "the mod-order doesn't grow incrementally in any order" I didn't say that. I meant a side effect can be inserted into other modifications as long as the final order is valid.Calan
@xmh0511: you're still talking about the mod-order growing over time. There's no language-lawyer basis for ever talking about a partially-constructed mod-order.Moravia
@PeterCordes you're still talking about the mod-order growing over time. I talked about the mod-order is changed following the corresponding store operation is evaluated, this is defined in [intro.races] p4. If there is no such progress, how the mod-order is formedCalan
@xmh0511: My answer quotes [intro.races] p20, which is a note talking about how to think about the mod order as a way to explain what happened and constraints on what is possible, not as something built as the program runs. To me, it implies a way of thinking about it as including all future values that will exist in the lifetime of the object. Progress doesn't add things to the mod order. You could say it makes later entries visible I guess. (To this thread first, then potentially to some threads before others).Moravia
@xmh0511: As for how the mod order is determined, that depends on implementation details that the standard doesn't specify. On real CPUs, which use some flavour of MESI cache coherence, the mod order for a variable depends on the order of cores gaining exclusive ownership of the cache line to commit stores. When multiple cores want access, some kind of hardware arbitration happens. As I showed in this answer, this happens long after the values are visible to the thread that did the storeMoravia
And of course we're far outside anything in the standard. It doesn't say how implementations pick an order, it just lays down some constraints that mod orders must satisfy (e.g. being consistent with program-order for stores by the same thread (write-write coherence), and consistent with happens-before relationships.)Moravia
@PeterCordes From the perspective of the standard to talk about mod-order, given //thread 1: v.store(1,relaxed); //thread 2: v.store(2,relaxed); // thread 3: v.exchange(3, relaxed); and v is initialized to 0. what is the mod-order of this program? I think it in this way, first, the pure store does not have constraints on its side effect in mod-order, hence, the mod-order could be {0,1,2} or {0, 2,1}, then the RMW operation in thread 3 have the constraint that its read must immediately precedes its write, if rwm read 0, then the mod order is {0, 3,1,2} or {0,3,2,1}, the write is insertedCalan
the final possible mod-orders for this program are varied by what value the rmw read, so the possible mod-order could be {0,3,1,2}, {0,1,3,2}, {0,1,2,3}, {0,3,2,1},{0,2,3,1},{0,2,1,3}Calan
@PeterCordes It seems that, with the current mental model about mod-order, I can understand What's incorrect is the idea that when the load side of an RMW runs, there are no values in the modification order after the one it reads. .. other threads can already have read later values in the mod order. For example: atomic<int>v = {0}; //thread 1: v.exchange(2,relaxed); // thread 2: v.store(3,relaxed); v.load(relaxed); In thread 2, its load can see the value 3, hence this side effect is in the mod-order. When the RMW in thread 1 reads 0, in its perspective, the mod-order can be {0, 3}Calan
since the guarantee of [atomics.order] p10, the whole mod-order in this program is as if the write of RMW is inserted into 0 and 3 and finally forms the mod-order {0,2,3}, and this mental model admits that there can exist values in the modification order after the one RMW reads. Is this a correct understanding to mod-order?Calan
@xmh0511: Yes, all 6 permutations are possible mod orders for your first example. We have three independent writes with nothing to order them even if they were seq_cst. The load side of the exchange has no effect on the value it will store (unlike other RMWs like fetch_add) so it's a separate question what values it might have loaded. If you construct trivial examples like that, sure they can be explained in terms of a mod-order that grows as the program runs.Moravia
The modification-order doesn't have to be the same as the order of stores becoming visible to other threads as the program runs. e.g. after a v.store, later loads of v in this thread can only see that store or later modifications, but this modification potentially became visible to this thread well before stores from other threads that are earlier in the mod order. So those values were never able to be loaded in this run of the program. Talking about stores becoming visible with time makes sense, talking about the mod order growing doesn't seem useful. There's nothing to gain from it.Moravia
@PeterCordes The modification-order doesn't have to be the same as the order of stores becoming visible to other threads as the program runs. Yes, but the modification should in the modification order once the store operation is evaluated per [intro.races] p4. talking about the mod order growing doesn't seem useful. There's nothing to gain from it. Do you mean the mod order of an atomic object has no process of forming?Calan
@xmh0511: Yes, I do mean the mod order has no process of forming (in the abstract machine). It simply exists, spanning the past and future for the full lifetime of the object. Forming it is an implementation detail. As I've mentioned before, on real hardware with MESI cache coherency, it's the order stores eventually committed to cache, long after values have potentially been read. ... should [be] in the modification order once the store operation is evaluated per [intro.races] p4 - I'm arguing that the mod order includes the whole lifetime, so this is trivially satisfied.Moravia
@PeterCordes All right, so the correct way to talk about what possible mod orders are is according to the observable behavior(i.e. the value a load operation reads), use these outputs together with other coherent constraints to determine a plausible mod order (i.e. no contradiction), but not the other way around(i.e. instead of inferring what the outputs are by assuming a mod order. For example: //atomic<int> v = 0; //thread 1: v.store(1); //thread 2: v.exchange(2); //t3: v.store(3);....Calan
The correct way to figure out the mod order for the program is: if v.exchange in t2 reads 1, the order may {0,1,2,3} or {0,3,1,2}. That is the mod order of v in its whole lifetime is inferred by what the observable behavior is(i.e. the value reads by the load part of the RMW operation). Instead of first assuming the order would be {0,1,3,2}, then we infer the load part of v.exchange read 3, right?Calan
@xmh0511: That method works in this case. In many more-complex cases, it would be a lot of work to come up with all the possible mod-orders compatible with a small set of observations, but yes that's always possible in theory. And yeah, with the only way for a 2 to appear in the mod-order being the store side of an RMW, we can just look for it in the mod order and the previous value must be what it loaded, that's [atomics.order] p10.Moravia
@Calan Personally, I often tend to think in terms of coherent cache existing and threads reordering accesses to it when thinking about what orders are possible, but of course that's not how the C++ formalism works, and can potentially rule out some really weird stuff that is allowed on paper in the C++ abstract machine but can't plausibly happen on any normal hardware. Finding a correctness problem in terms of that model (or how C++ compiles to AArch64 asm) can prove that a bug or interesting effect exists (assuming the C++ to asm mapping has no defects), but can't prove correctness problem.Moravia
@PeterCordes So, what is the formalism way to prove correctness problem? Is it by observing behavior and can reason a mod order that has no contradiction?Calan
@xmh0511: To prove your algorithm does what you want and that there are no legal execution orders that would be a problem, normally you look at happens-before relationships. Formal verification to prove correctness is normally done by machines, where I guess you tell it what's ok and what isn't. For humans it's usually too complex to write proofs for programs that call functions in loops, instead we usually make arguments about allowed orderings. (Sometimes in terms of whether one operation can reorder with another, despite the C++ formalism not working that way.)Moravia
@PeterCordes The happens-before sometimes depends on synchronization, for atomic object, that depends on release sequnce, which in turn depends on modification order, so how a modigication order is inferred for a program is crucial here.Calan
@xmh0511: The constraints on possible mod orders can be crucial, but you can talk about that without ever talking about a partially-constructed one. To talk about what values a certain load is allowed to see, given other things seen by earlier acquire loads and so on, you also need to look at ordering between values produced by ops on different variables, and that's usually the trickier part. If you do multiple accesses to the same variable in a single thread, the coherence rules are pretty straightforward about seeing the same or a later value from loads, etc.Moravia
@PeterCordes So, for figuring out the possible mod orders for a single atomic object, we can count all modifications of that atomic object in the whole program, pick up these unconstrained side effects, and permutate them to form several probably/candidate mod orders, then according to the observable behavior/values, arrange relevant **constrained side effects**(i.e. constrained by [intro.races] coherent and other rules) into these candidate mod orders, which finally get the all possible mod orders, this is my approach.Calan
@xmh0511: yup, or enumerate all permutations and rule out ones forbidden by any other ordering constraints. Although keep in mind that many real programs involve loops so the length of a modification order is unbounded. (And more relevantly, so is the number of executions of one store that come between two other stores of interest.) Much more often, the useful thing to discuss is whether anything constrains a modification to be before or after another in the mod order of the object they modify, not the complete set of possibilities for mod orders. And it's rare to have unconstrained stores.Moravia
(I'm talking about things like writing an SPSC or MPMC queue using a circular buffer, real-world code trying to implement a useful algorithm. With tiny programs like the examples we've used in this Q&A, it is practical to enumerate all valid mod orders.)Moravia
@PeterCordes Ok. Since the mod order is irrelevant to the timeline, consider this example godbolt.org/z/6Ys4xc15W, From the perspective of the C++ abstract machine, Can the output be t3 read 1; wait; and after long time t2 read 0; Such that the mod order is {0, 2, 1, 3}?Calan
@xmh0511: Yeah, obviously. That's possible with sequential consistency, just simple interleaving of each statements from different threads. Which is plausible with long-enough delays for different threads at the right moments. e.g. a big delay between val.exchange(2) and calling cout::operator<<(const char*) to print the start of that line of output. And a bigger delay before val.store(1), then finally val.exchange(3) running last but its cout<< calls running first. Delays this long would very unlikely on real systems unless single-stepping each thread with a debugger.Moravia
@PeterCordes I am curious why you say that the mod order does not grow over time? Obviously, the mod order is just some abstract concept that is used to specify the semantics of the memory model and has no real representation in actual hardware. I actually picture it as something that evolves over time - that is, any operation that modifies a value adds an entry to the mod order. I agree that there is no language lawyer basis for this, but I would argue that there is also nothing to refute it. The only relevant question is whether this model conforms to the requirements imposed by the standardAngstrom
@mpoeter: I used to picture it that way until I realized that multiple threads can have stores in flight which have all been read (by store-forwarding) before their relative order in the mod-order is know (when they commit to cache). It doesn't make sense to talk about loads reading values that aren't in the mod order. I guess you could talk about the mod order growing as stores become visible to their own thread, using future knowledge to know whether to append or insert before some other in-flight stores. But I don't see why that's useful vs. a mod order with a complete view of the future.Moravia
@PeterCordes Thanks very much. Asking a variant version question std::atomic<bool> flag = {false}; std::atomic<int> val = {0}; std::thread t1([&](){ val.store(1,std::memory_order::seq_cst); // #1 flag.store(true,std::memory_order::relaxed); // #2 }); std::thread t2([&](){ while(!flag.load(std::memory_order::relaxed)){} // #3 auto r = val.exchange(2,std::memory_order::seq_cst); // #4 assert(r!=0); // #5 }); t1.join(); t2.join(); change the memory order of 1 and 4 to seq_cst, can the assertion fails?Calan
@xmh0511: the code in that comment is an unreadable mess. Use shorthand like r = v.xchg(2, SC) if you must put code into comments.Moravia
@PeterCordes See godbolt.org/z/cdfcjoKKE, the memory order of #1, #2 and #4 in the original example are change to seq_cst but #3 remains unchanged. Can the assertion fail in the new example?Calan
@xmh0511: On real hardware probably not. In the abstract machine, yes, I think so. #3 isn't an acquire load so it can can let later operations happen before eventually getting a value for flag and confirming the branch-prediction for leaving the loop. (I know the abstract machine is only described by the C++ formalism in terms of ordering and stuff, but imagine an actual CPU that allows speculative exec of atomic RMWs.)Moravia
@PeterCordes "multiple threads can have stores in flight which have all been read (...) before their relative order in mod-order is know". I don't see how this conflicts with a growing mod order. See my comment #79003517Angstrom
Even if you have multiple in-flight stores, the values that have been read already must be consistent with some total mod order. It's just that threads have different views on that order and may not yet see all values. It's not like anyone can "see" the history of the MO, but all threads must agree on the MO in the sense that the values they read are consistent with the order in which they appear in MO. A value becoming visible to a thread basically means that the mod order of this value is known to this thread up to this change; any later read must return this or a newer value.Angstrom
@mpoeter: I don't see how it's easier to work with or think about different threads seeing different partially-constructed mod-orders than to just talk about the eventual mod-order for the lifetime of a variable, and for any given load talk about what values from the mod order are actually visible to them because of other constraints or timing. Sure it sounds easy to talk casually about partial mod orders, but trying to figure out how it fits into the existing C++ formalism is tricky, and there are probably corner cases that are hard or weird for any attempt to codify any rules.Moravia
For me the concept partially-constructed mod-orders comes naturally when reading the standard and how it defines the semantics based on the modification order. But as I said, the memory order is just an abstract concept, so everybody is free to use whatever mental model works best for him/her as long as that model conforms to the requirements imposed by the standard. :) In either case, I do not want to derail from the original question. I am happy to continue this discussion, but perhaps in a different setting.Angstrom
@mpoeter: Well that's what I used to think, in terms of mod-orders growing, until this question made me think about it. Now I think that's not the model the standard was proposing, as I explained in this answer. And I think that growing model has major disadvantages which I've already discussed in comments here. IDK if I should expand the answer to try to make a case for it; being in the mod order never guaranteed visibility so I don't see a downside to either only talking about mod order extending all the way into the future, or only after its order is established.Moravia
@PeterCordes IDK if I should expand the answer to try to make a case for it; It would be better to extend the answers. The original post's confusion is how to think about the modification order of an atomic object; whether to image it as partially-constructed that is grown in the running or considered after the whole program is executed. For example, godbolt.org/z/3crE3P648, the assertion to the CAS operation can fail, intuitively, we thought the expected value should be the latest one set before(from the timeline)...Calan
... and the "stale" value(stored before the latest store on the timeline) shouldn't be read during the atomic comparison, however, it's not true, the timeline is just intuitive to human, and the mod order does not work in that way, IIUC. Moreover, if we consider the mod order partially constructed exists, then we will get the confusion concept that the latest side effect and relative "stale" side effect(i.e. preceded the latest one in the snippet) in that partially constructed mod order. I think we can extend the answer to clarify these points.Calan
@xmh0511: My answer does say "I think the intended way to look at a modification orders is that it covers the whole lifetime of a variable. Don't look at it as growing as the program runs." I don't know why people were reluctant to accept that since it seems to make sense of all the problems for me. So IDK what more to say that would convince people who've already read this answer that's already pretty long.Moravia
@PeterCordes I just reread your answer to try and fully understand your reasoning, but I am confused about this paragraph: "But t1 could have done tmp = val.load(relaxed) and bar.store(tmp, relaxed) after the other 2 stores, which can reload its own val=1 store via store-forwarding (...)" Where is this bar coming from? It is not part of the original question, so I am a bit lost here.Angstrom
@mpoeter: It's another atomic global variable, which is implied by the fact that I'm talking about doing .store() on it and that another thread can see it. (I chose bar instead of foo since there was already a variable whose name started with f.) The val.load alone is enough to show the store being seen by the thread that did the store, but storing that value and making it visible to the other thread proves this could be before flag or val were visible to that thread, and thus it sees something dependent on a value in the mod-order of val which it can't see directly yet.Moravia
Ok, so if I understand you correctly you mean something like t1: val.store(1); flag.store(true); bar.store(val.load()); and ` t2: while(!flag.load()); bar.load(); val.exchange(2);` where the bar.load() can return 1 while the val.exchange still returns 0. But I don't see the problem with this. Like for the stores in t1 the load and the exchange are not ordered by a happens-before relation. It just means that the xchg has to be ordered before the store in val's mod order. Unfortunately SO comments are fundamentally ill suited for such a discussion. :(Angstrom
@Angstrom Yes, that's what I meant. And there's no problem with it if you're thinking about mod orders correctly and reading [atomics.order] p10 correctly. But if you aren't, e.g. if you think p10 says that RMWs read "the latest value in the mod order" (absolute statement, not relative to the store side), then it forces you to rethink that, or do some mental gymnastics to justify a value already having been read from the mod order but not be visible to this thread's RMW. This is part of my argument in favor of the mod order including all past and future modifications.Moravia
@PeterCordes But if you aren't, e.g. if you think p10 says that RMWs read "the latest value in the mod order" Make sense, even though we think it is "the latest value in the partially constructed mod order", it will be wrong, since bar.load() == 1 will imply 1 is already in that "partially constructed mod order", and RMW should read this value. So, think mod order as growing with program running does not make sense, otherwise we cannot explain why RMW in the same thread may not read 1.Calan
@xmh0511: Yes, exactly.Moravia
I think we are largely on the same page, but not completely. :D As pointed out, all operations are relaxed, so there are no orderings whatsoever between any of them. The modifications to val happen ins some total order. It is potentially completely nondeterministic which order that is, the only requirement is that threads can agree on an order. The introduction of bar doesn't change any of that. It may seem that because we write to bar after val in t1 and read from bar before val in t2 that this somehow affects the mod order of val, but it doesn't.Angstrom
I cannot go into the specifics of how this works in hardware (that's @PeterCordes domain), but the important detail here is that nothing is ordering all these operations. So effectively the compiler would be free to reorder the operations so that now t2 reads from bar after val, which would make it perfectly plausible that it would observe a value that was written after t1 wrote a value to val that is ordered after our xchg in val's mod order. Whether such a scenario is also possible without compiler reorderings and only due to ISA implementation details is not really relevant.Angstrom
IMO "Atomic RMW operations shall always read the last value (in the modification order) written before the write associated with the RMW operation" is pretty clear. The last value written before the write associated with our RMW is the value that precedes our own write in the mod order. Everything else would mess up the semantics of atomic RMW operations. "Read-modify-write" implies that we read something and write back a modified value, so obviously whatever value we read cannot be ordered after our own write.Angstrom
@mpoeter: "Atomic RMW operations shall always read the last value (in the modification order) written before the write associated with the RMW operation" is pretty clear. - Yes, totally clear when you actually read the standard's wording. But there's a common misconception based on summarizing it as "the latest value" (implying some sort of absolute sense, not just relative to another value in the mod order). This is what I'm arguing against.Moravia
@mpoeter: You're 100% correct that the introduction of bar doesn't change anything about val except that another thread has already a value in its mod order after the one the RMW reads. This is a proof by contradiction against the idea that a partially-constructed mod-order (which all threads still agree on during its construction) can explain everything. It removes the possibility of arguing that the store hasn't entered the mod order until it's visible to all threads.Moravia
"there's a common misconception based on summarizing it as "the latest value" (implying some sort of absolute sense, not just relative to another value in the mod order)." I suppose I just don't understand how this could be misinterpreted. ;) "This is a proof by contradiction against the idea that a partially-constructed mod-order (which all threads still agree on during its construction) can explain everything. It removes the possibility of arguing that the store hasn't entered the mod order until it's visible to all threads." I am not sure I fully understand. Should we take this to chat?Angstrom
Let us continue this discussion in chat.Moravia
@PeterCordes Maybe a nit-picking to the example But t1 could have done tmp = val.load(relaxed) and bar.store(tmp, relaxed) after the [intro.races] p18 says the load operation can read the value stored by a side effect that happens before the load, which means tmp can read val.store(1) without requiring the modification 1 is in the modification order of val(i.e. happens-before is sufficient). So, a crucial precondition here is that the side effect is in the modification after the associated store happens...Calan
In other words, from the perspective of a C++ abstract machine, once a store is executed, its modification is immediately within the modification order, which is not similar to real CPU in which the side effect of an executed store can remain no committing to coherent cache (on a normal CPU) after which it is considered within the mod order.Calan
@xmh0511: I'd prefer to say that the C++ mod order "knows the future" when talking about real CPUs, not that stores can have happened and be visible without being part of the mod order. The mod order matches cache-coherence order which isn't determined until commit to cache. (coherence-ordered before is a thing in the C++ standard: eel.is/c++draft/atomics.order#3.2 - the definition isn't interesting but the name is.) In a case like with tmp and bar on a real CPU, the val = 1 via store-forwarding is later in the mod order than anything that's already committed to cache.Moravia
@PeterCordes Ok, from the perspective of the pure C++ standard, when a store operation is expected, the side effect produced by it is within the modification order, right?Calan
@xmh0511: Yes, the mod order includes all past and future modifications for this execution of the program, whether they've "happened" or not. The C++ formalism isn't designed to make it intuitive to think about taking a snapshot of a moment in time across all threads, and an all-inclusive mod-order seems like the easiest way to think about it and avoid misleading ideas about visibility.Moravia
@PeterCordes A typo in my above comment: from the perspective of the pure C++ standard, when a store operation is ~expected~ executed, the side effect produced by it is within the modification order, right?Calan
@PeterCordes 1.) Back to think the answers here, consider this example, the output can be failure, current = 2 failure, current = 3 now: 4 or now: 4. Since the mod order includes all past and future modifications...Calan
... 2.) Assuming the modification order that excludes the modification of RMW is {2, 3}, for the first outcome, could the reason be that the load part of RMW reads a "stale" value(i.e. 2) that is not equivalent to the expected 3, then expected is set to 2, and in the second time comparison, its load part reads the lastest value 3 and the comparison is failure. In the third time, the comparison is equivalent, then write the value 4?Calan
@xmh0511: I've already answers tons of followup questions in comments here. I'm not particularly interested in yet another one. Is there something that makes this test-case interesting, or at least a way you can simply summarize a key question?Moravia
@PeterCordes I just want to know, if we consider the mod order to include all past and future modifications at the end of program execution, how we use this concept to explain the outcome failure, current = 2 failure, current = 3 now: 4, thanks. I am trying to use this mental model to understand the example.Calan
Ok, that's reasonable, I wish you'd explained it that way in the first place instead of just launching into a bunch of technical details. As I said before, "stale" values aren't a useful concept either. A load might or might not see the value from a store that happens at a nearby time, depending on timing and inter-thread latency of the hardware, but that doesn't make a value "stale" either way, it just means the load (or RMW) ended up coherence-ordered before the store. Here, CAS can go first and see 2, then during the slow cout<< the v2.store(3) becomes visible, thus another CAS failMoravia
If you didn't have such slow function calls in the CAS retry loop, you could also see executions where cas(2,4) succeeded after the first CAS updated expected to 2. But that's not practically possible since you start both threads at almost the same time, and std::cout<< is plenty of time for t2 to have always finished before the next CAS. I don't see what question there is about the mod order. It's {2,3,4} for the execution where CAS fails twice, and for executions where t2's store goes first.Moravia
Without the slow cout<< in the retry loop, {2,4,3} could also happen, with now: printing 4 or 3, if the store of 3 ever became visible during the short time window between CAS(2,4) success and the reload. All these operations are seq_cst so there's nothing tricky at all about mod orders.Moravia
@PeterCordes "stale" may have an ambiguous meaning in this context, the "stale" value I mentioned in this example is intended to refer to the value RMW read that is not the last value other than its written one in the modification order. The subject of this answer suggests that we should think mod order as the one including past and future side effects at the end of the program. For the multiple retry loop, if excluding its written one, the eventual mod order is {2, 3}, the first time cas read 2 that is a "stale" value in the mod order, and the second time it reads 3...Calan
... which is the new value other than its written one, the complete mod order is {2, 3, 4} for the first outcome. This is what I am thinking about how to consider modification order for the atomic object with the mental model that modification order includes all past and future modifications. Is it reasonable?Calan
@xmh0511: That definition of stale makes no sense. In a long-running program, pretty much every load result would be "stale" because the atomic object is going to change value again later, so there are lots of later values in the modification order. (Except cases like this where the object lifetime ends after a couple mods). Even with a different name that didn't have negative connotations, not being the very end of the mod order (except for the write side for an RMW) isn't a useful thing to talk about with mod orders that extend far into the future.Moravia
@xmh0511: Also note that a failed CAS is not an RMW operation, it's actually just a load. So there is no write side. The 3 is from a later CAS. IDK if you're wanting to call the 2 stale because it's not adjacent to the 4 that a later CAS will write? Or just because it's a pure load not an RMW? If your program hadn't done the later CAS at all, the first CAS would still fail and could still have read a 2 without storing anything. CAS retry loops aren't a primitive operation in the C++ formalism, they're a software construct.Moravia
@PeterCordes IDK if you're wanting to call the 2 stale because it's not adjacent to the 4 that a later CAS will write? Yes, I meant it. The "stale" value is only for this example. That is, the first failed CAS purely loads the "stale" value 2 since the expected 3 was or will be in the modification order from the perspective of including all modifications in the whole lifetime of the atomic object, that is, the CAS at that moment didn't see the future 3 or saying it just read the "stale" value 2 such that the comparison is a failure.Calan
@xmh0511: Since you asked how to think about this with a mod-order that includes the future, no, "stale" is not a useful concept. The 2 isn't stale, it's just the value that was available when this load ran. The way I think about it, the load ran too early to see the store, rather than saying the store was visible too late. Stores become visible on their own, and you can't really do anything to speed it up. Just retry a load or do something else if you didn't see a value you were waiting for. (Or do something with whatever you did load in use-cases that don't need to spin-wait.)Moravia
@PeterCordes The way I think about it, the load ran too early to see the store, rather than saying the store was visible too late. However, [atomics.order] p11 does imply that a store can/may be invisible to a load within a time lag. Moreover, [intro.races] p14 and [intro.races] p18 does not require a produced side effect to be immediate/instantaneously visible to a loadCalan
@xmh0511: Yes, of course. Unlike with a thread reloading its own stores, they become visible to other threads after some time. The point at which that happens is the point at which a load in the other thread will see it. If the load runs before that, it's coherence-ordered before the store and doesn't see it. If it runs after, it's coherence-ordered after and does. Both are possible, neither are inherently "bad". You have to design lock-free algorithms to work and perform well even though inter-thread latency is a thing.Moravia
@PeterCordes If the load runs before that, it's coherence-ordered before the store and doesn't see it. If it runs after, it's coherence-ordered after and does. What do these emphasized wording refer to? From the perspective of implementations, they presumably refer to the order in time, but from the perspective of C++ standard, what do they refer to? After all, the "or" in [intro.races] p18 implies that the load can still read the happened-before side effect many times instead of the latter side effects in the modification order...Calan
... this conclusion can embody the lag time at the implementation level. However, from the abstract machine, there is no concept of time. So, do these before and after only mean the order of coherence-order at the C++ standard's level?Calan
@xmh0511: Correct, from the standard's perspective, the only question is whether the load is coherence-ordered before or after the store. That's up to implementation details and timing so the standard has nothing more to say about it, and it doesn't matter (from the standard's perspective) exactly why a load didn't see a store (even one that's already been seen by its own or some other threads).Moravia
@PeterCordes Thanks. So, saying that a load that is coherence-ordered before a store operation saw a "stale" value is meaningless when we artificially consider that invisible store as a fresh value. BTW, IIUC, the definition of coherence-order is just used to determine the single total order in [atomics.order] p4, can this concept be used to properly/independently describe this case that has no single total? Moreover, the notes between [intro.races] and p18 mention the write-write coherence and so forth, is this coherence the same concept as coherence-order?Calan
@xmh0511: Yes, the definition of coherence-ordered-before involves the same sorts of things as the coherence rules like write-read. So yes I think it's fair to say it's the same concept, despite the standard not formally connecting those definitions. And the name is evocative of actual implementation details (cache coherence) that lead to loads seeing stores or not, which is part of why I chose to use that term. Without seq_cst imposing more limits on coherence order, other things indirectly restrict it by e.g. requiring that some loads see some stores without mentioning coherence-orderedMoravia
@PeterCordes And the name is evocative of actual implementation details (cache coherence) that lead to loads seeing stores or not I think this is the underlying logic why [intro.races] p18 uses "or" to relax the requirement. v =0; //thread 1: sleep(1000s); v.load(); //thread 2: v.exchange(1); On a sane implementation, the store in t2 should be seen by load in t1 after 1000 seconds. However, on an insane implementation, after 1000 seconds, the load can still not see the value 1(no other extra requirements or limits here to impose load must see the store, only p14 and p18)..Calan
...So,[intro.races] p18 starts from an abstract point of view to describe the behavior of such two possible implementations. The "or" makes the second implementation theoretical validCalan
@xmh0511: The "or" in the write-read coherence rule isn't just to allow for inter-thread latency! It's for cases like atomic_int v=0, flag=0; t1: v=1; v=2; flag=1 (release); t2: spin-wait(flag) (acquire); v.load(); Both the v=1 and v=2 modifications happen-before t2's v.load(). The load is required to see v=2 (or later if a third thread did something) to satisfy intro.races p13.2 and write-write coherence, etc. Write-read coherence also has to be satisfied between the v=1 and v.load(), which it is by reading a later value in the mod order.Moravia
@PeterCordes Yes, I see. I just meant how "or" is applied for this single case(i.e. inter-thread latency). Thanks for your clarification. We are paging you for answering this question #79118590 ^_^Calan

© 2022 - 2025 — McMap. All rights reserved.