(I'm answering the updated question; Nicol answered the original question which specified "after" in C++ "happens-before" terms, including synchronization, which means that the reader is guaranteed to see stuff the writer did. Not that they're running in lock-step cycle for cycle; C++ doesn't have any notion of "cycles".)
I'm answering for how C++ runs on normal modern CPUs. ISO C++ of course says nothing about CPU architecture, other than mentioning that normal hardware has coherent caches in a note about the purpose of the atomic<>
coherence guarantees in the C++ standard.
By before, I mean in the absolute sense of time.
If you mean the store becomes globally visible just before the load executes, then yes by definition the load will see it. But if you mean "execute" in the normal computer-architecture sense, then no, there's no guarantee. Stores take some time to become visible to other threads if they're both running simultaneously on different cores.
Modern CPUs use a store buffer to decouple store execution from visibility to other cores, so execution can be speculative and out-of-order exec without making that mess visible outside the core, and so execution doesn't have to stall on cache-miss stores. Cache is coherent; you can't read "stale" values from it, but it takes some time for a store to become visible to other cores. (In computer-architecture terminology, a store "executes" by writing data+address into the store buffer. It becomes globally visible after it's known to be non-speculative, when it commits from the store buffer to L1d cache.)
A core needs to get exclusive ownership of a cache line before it can modify it (MESI Exclusive or Modified state), so it will send out an RFO (Read For Ownership) if it doesn't already own the line when it needs to commit a store from the store buffer to L1d cache. Until a core sees that RFO, it can keep letting loads read that line (i.e. "execute" loads - note that loads and stores are fundamentally different inside a high-performance CPU, with the core wanting load data as early as possible, but doing stores late).
Related: The store buffer is also how you get StoreLoad reordering if thread 1 also did some later loads, even on a strongly-ordered CPU that keeps everything else in order. Or on a CPU with a strongly-ordered memory model like x86 that maintains the illusion of everything happening in program order, except for the store buffer.
Memory barriers just order this core's operations wrt. each other, for example a full barrier blocks later loads from executing until earlier stores+loads have executed and the store buffer has drained up to the point of the barrier, so it contains only later loads if anything.
Barriers have no effect on whether another core sees a store or not, except given the pre-condition that the other core has already seen some other store. Then with barriers (or equivalently release/acquire) you can guarantee the other core will also see everything else from before the release store.
Jeff Preshing's mental model of memory operations as source-control operations accessing a remote server is a useful model: you can order your own operations relative to each other, but requests in the pipelines from different cores can hit the server (shared memory) in different orders.
This is why C++ only specifies visibility as "eventually" / "promptly", with a guarantee of seeing earlier stuff if you've already seen (with an acquire load) the value from a release store. (It's up to hardware what "promptly" means. Typically under 100 ns on modern multi-core systems (depending on what exactly you're measuring), although multi-socket can be slower. If I don't use fences, how long could it take a core to see another core's writes?)
Seeing the store itself (release, seq_cst, or even relaxed if you don't need to synchronize other loads/stores) either happens or not, and is what creates the notion of before/after between threads. Since CPUs can only see each other's operations via shared memory (or inter-processor interrupts), there's not a lot of good ways to establish any notion of simultaneity. Very much like in physics how relativity makes it hard to say 2 things happened at the same time if they didn't happen in the same place: it depends on the observer because of delays in being able to see either event.
(On a machine such as a modern x86 with TSC synchronized between cores (which is common especially in a single-socket multi-core system, and apparently also most(?) multi-socket motherboards), you actually can find absolute timestamps to establish which core is executing what when, but out-of-order execution is still a big confounding factor. Pipelined CPUs make it hard to say exactly when any given instruction "executed". And since communication via memory isn't zero latency, it's not usually useful to even try to establish simultaneity this way.)
memory_order_seq_cst
– Brunsonjoin
T1 then start T2? – Cachexiaa_i
, including 7. T2 will never see a value that was not stored (which could happen ifa_i
was notatomic
). But there is no guarantee which of the values it will see if the surrounding code does not guarantee the order of the operations ona_i
. This guarantee must be established using the happens-before (intra-thread) and synchronizes-with (inter-thread) relations between C++ expressions, and the latter is achieved using acquire and release operations. So you still need acquire/release ops somewhere in your code. – Politian