Is cache coherency only an issue when storing and not when loading?

No, seq_cst stores use xchg (or mov + mfence but that's slower on recent CPUs) for ordering wrt. other operations. release or relaxed atomic stores can just use mov and will still be promptly visible to other cores. (Not before later loads in this thread might have executed, though.)

Cache coherence isn't the cause of memory-reordering, that's local to each core. (For x86, the memory model is program order + a store buffer with store-forwarding. It's the store buffer that causes stores to not become visible until after the store instruction has retired from out-of-order exec.)

The answer you linked which says "if I set this to true (or false), no other thread will read a different value after I've set it" (that's not quite such a certainty - you need a "lock" prefix to guarantee that). is somewhat misleading. They mean that (implicit-lock) xchg includes a full memory barrier, so no code in the storing thread can access memory until after the store is actually committed to cache, globally visible.

A clearer way to state that is that it makes this thread wait without doing anything until the store is visible. i.e. stall this thread until the store buffer has finished committing all previous stores. That would eventually happen on its own. So it's really about ordering of this thread relative to store visibility, not other threads. Other threads (cores) can locally do their own early loading / late storing, although on x86 all loads happen in program order. That's why I commented on that answer you linked to disagree with the way it was presenting things.

Can a speculatively executed CPU branch contain opcodes that access RAM? (What a store buffer does)
C++ How is release-and-acquire achieved on x86 only using MOV? discusses cache-coherency and limits on local reordering being enough to give release/acquire synchronization.
Why does a std::atomic store with sequential consistency use XCHG?
Acquire-release on x86
Which is a better write barrier on x86: lock+addl or xchgl? - shows in more detail why we need xchg or a separate memory barrier for a seq_cst store.
https://software.rajivprab.com/2018/04/29/myths-programmers-believe-about-cpu-caches/ (It's talking about Java volatile, which is like C++ std::atomic with memory_order_seq_cst.
How does memory reordering help processors and compilers?
Does atomic read guarantees reading of the latest value? - people often get hung up on "latest value" guarantees. Don't. Acquire/release just works, and stronger orders or memory barriers don't make stores visible to other cores sooner in any significant way.

Recommended topics

Hot tags