memory_order_relaxed and visibility

Asked 4/2, 2021 at 22:16 Answered 9/2, 2021 at 22:55

Solved c++atomic cpu-architecture stdatomic

Consider two threads, T1 and T2, that store and load an atomic integer a_i respectively. And let's further assume that the store is executed before the load starts being executed. By before, I mean in the absolute sense of time.

T1                                    T2
// other_instructions here...         // ...
a_i.store(7, memory_order_relaxed)    // other instructions here
// other instructions here            // ...
                                      a_i.load(memory_order_relaxed)
                                      // other instructions here

Is it guaranteed that T2 sees the value 7 after the load?

Omniscient answered 4/2, 2021 at 22:16 Comment(12)

What do you mean that the threads are “synchronized”? Is a condition variable or something used to sequence the operations? – Skeet 4/2, 2021 at 22:20

Just mean to say: one following the other in the execution timeline -- since both are atomic operations, we can't have a data race here. – Omniscient 4/2, 2021 at 22:23

maybe this is the purpose of memory_order_seq_cst – Brunson 4/2, 2021 at 22:27

Then how do you know which one happened first? – Skeet 4/2, 2021 at 22:28

"one following the other in the execution timeline" - what do you mean by "execution timeline"? There is no universal timeline when it comes to mutlthreading. – Cachexia 4/2, 2021 at 22:32

by "completes execution" do you mean join T1 then start T2? – Cachexia 4/2, 2021 at 22:37

I have tried to further explain my question. – Omniscient 4/2, 2021 at 22:50

There is no such thing as "absolute sense of time" in C++ (nor in our physical Universe). The C++ standard doesn't define anything in terms of absolute time. – Nole 4/2, 2021 at 22:52

T2 is guaranteed to see one of the values that were ever stored in a_i, including 7. T2 will never see a value that was not stored (which could happen if a_i was not atomic). But there is no guarantee which of the values it will see if the surrounding code does not guarantee the order of the operations on a_i. This guarantee must be established using the happens-before (intra-thread) and synchronizes-with (inter-thread) relations between C++ expressions, and the latter is achieved using acquire and release operations. So you still need acquire/release ops somewhere in your code. – Politian 4/2, 2021 at 23:43

You might find this talk interesting. – Unify 6/2, 2021 at 12:29

The idea behind relaxed atomic is that their execution can be reordered - sometimes by the compiler, but the real purpose is to allow reordering by the CPU, or rather, to avoid forcing the CPU to produce effects ordered across threads running on different cores. With reordering it's very hard to even define "executed before, in the absolute sense of time". – Anadem 6/2, 2021 at 15:52

I would recommand you to buy C++ Concurrency In Action book by Anthony Williams if you are serious about multithreading. – Summertime 10/2, 2021 at 0:36

Is it guaranteed that T2 sees the value 7 after the load?

Memory order is irrelevant here; atomic operations are atomic. So long as you have ensured that the write "happens-before" the read (which you stated to be true in the premise of your question), and there are no other intervening operations, T2 will read the value which was written by T1. This is the nature of atomic operations, and memory orders do not modify this.

What memory orders control is if T2 sees 7 (whether "happens-before" is ensured or not), whether or not it can access other data modified by T1 before it stored 7 into the atomic. And with relaxed memory ordering, T2 has no such guarantees.

Note: you changed your question from being about a situation where the load "happens after" the store, when the store is explicitly "synchronized" with the load, into a situation that is more nebulous. There is no "absolute time" as far as the C++ object model is concerned. All atomic operations on a particular atomic object happen in an order, but unless there is something which explicitly creates a "happens before/after" relationship between the two loads, then what value gets loaded cannot be known. It will be one of the two possibilities, but which one cannot be known.

Renaissance answered 4/2, 2021 at 22:32 Comment(4)

This is incorrect.. The OP defines 'before' as 'in absolute sense of time'. That does not guarantee that the store is ordered before the load. By definition, the 2 operations are ordered, but you can only determine the order by evaluating the result of the load. If the load occurs (let's say) less than a micro second after the store (clock time), it can (and probably will) return the old value due to store buffer effects. – Seften 5/2, 2021 at 23:26

@LWimsey: The question, at the time I composed my answer, stated "happens after", which is a well-defined C++ term. Earlier versions even used the term "synchronized". It has since been changed to be more nebulous. – Renaissance 9/2, 2021 at 17:57

@NicolBolas : regarding atomic operation and write "happens-before" read , is it true that T2 would have seen 7 , veven if they were reading from an int rather than atomic<int>. since load and store to int are atomic ? – Kylie 4/9, 2023 at 9:51

@user179156: "since load and store to int are atomic ?" As far as the C++ standard is concerned, loads and stores to int are not atomic. – Renaissance 4/9, 2023 at 13:34

(I'm answering the updated question; Nicol answered the original question which specified "after" in C++ "happens-before" terms, including synchronization, which means that the reader is guaranteed to see stuff the writer did. Not that they're running in lock-step cycle for cycle; C++ doesn't have any notion of "cycles".)

I'm answering for how C++ runs on normal modern CPUs. ISO C++ of course says nothing about CPU architecture, other than mentioning that normal hardware has coherent caches in a note about the purpose of the atomic<> coherence guarantees in the C++ standard.

By before, I mean in the absolute sense of time.

If you mean the store becomes globally visible just before the load executes, then yes by definition the load will see it. But if you mean "execute" in the normal computer-architecture sense, then no, there's no guarantee. Stores take some time to become visible to other threads if they're both running simultaneously on different cores.

Modern CPUs use a store buffer to decouple store execution from visibility to other cores, so execution can be speculative and out-of-order exec without making that mess visible outside the core, and so execution doesn't have to stall on cache-miss stores. Cache is coherent; you can't read "stale" values from it, but it takes some time for a store to become visible to other cores. (In computer-architecture terminology, a store "executes" by writing data+address into the store buffer. It becomes globally visible after it's known to be non-speculative, when it commits from the store buffer to L1d cache.)

A core needs to get exclusive ownership of a cache line before it can modify it (MESI Exclusive or Modified state), so it will send out an RFO (Read For Ownership) if it doesn't already own the line when it needs to commit a store from the store buffer to L1d cache. Until a core sees that RFO, it can keep letting loads read that line (i.e. "execute" loads - note that loads and stores are fundamentally different inside a high-performance CPU, with the core wanting load data as early as possible, but doing stores late).

Related: The store buffer is also how you get StoreLoad reordering if thread 1 also did some later loads, even on a strongly-ordered CPU that keeps everything else in order. Or on a CPU with a strongly-ordered memory model like x86 that maintains the illusion of everything happening in program order, except for the store buffer.

Memory barriers just order this core's operations wrt. each other, for example a full barrier blocks later loads from executing until earlier stores+loads have executed and the store buffer has drained up to the point of the barrier, so it contains only later loads if anything.

Barriers have no effect on whether another core sees a store or not, except given the pre-condition that the other core has already seen some other store. Then with barriers (or equivalently release/acquire) you can guarantee the other core will also see everything else from before the release store.

Jeff Preshing's mental model of memory operations as source-control operations accessing a remote server is a useful model: you can order your own operations relative to each other, but requests in the pipelines from different cores can hit the server (shared memory) in different orders.

This is why C++ only specifies visibility as "eventually" / "promptly", with a guarantee of seeing earlier stuff if you've already seen (with an acquire load) the value from a release store. (It's up to hardware what "promptly" means. Typically under 100 ns on modern multi-core systems (depending on what exactly you're measuring), although multi-socket can be slower. If I don't use fences, how long could it take a core to see another core's writes?)

Seeing the store itself (release, seq_cst, or even relaxed if you don't need to synchronize other loads/stores) either happens or not, and is what creates the notion of before/after between threads. Since CPUs can only see each other's operations via shared memory (or inter-processor interrupts), there's not a lot of good ways to establish any notion of simultaneity. Very much like in physics how relativity makes it hard to say 2 things happened at the same time if they didn't happen in the same place: it depends on the observer because of delays in being able to see either event.

(On a machine such as a modern x86 with TSC synchronized between cores (which is common especially in a single-socket multi-core system, and apparently also most(?) multi-socket motherboards), you actually can find absolute timestamps to establish which core is executing what when, but out-of-order execution is still a big confounding factor. Pipelined CPUs make it hard to say exactly when any given instruction "executed". And since communication via memory isn't zero latency, it's not usually useful to even try to establish simultaneity this way.)

Soliloquy answered 9/2, 2021 at 22:55 Comment(2)

Is it correct that on x86, a store(val, memory_order_cst) does an MFENCE, which will make sure that val will be visible to other threads/cores? – Typewritten 5/1, 2023 at 20:45

@Andrew: Yes, it will use mov+mfence or more usually xchg, to make this thread wait until the store is visible to other threads before doing any later loads/stores in this thread, i.e. a StoreLoad barrier (plus the other reorderings that x86 always blocks). Visibility is guaranteed even with relaxed, but the thread doing the store won't wait for it. – Soliloquy 6/1, 2023 at 0:14

Recommended topics

Hot tags