Is cache coherency only an issue when storing and not when loading?
Asked Answered
C

1

0

I came across this code emission for x64 were "Atomic Load" is using a simple movq whereas "Atomic Store" is using xchgq.

This link explains that Atomic Load/Stores on aligned addresses are atomic by default. I'm assuming that's why Atomic Load in the above link is using a simple movq.

I have the following questions;

  • Is Atomic Store using a xchgq (which enables LOCK by default) to fix any issues with cache lines? essentially it's making sure all cache lines are updated properly? If cache line wasn't an issue they could have just used movq?

  • Does it also mean cache coherency is only an issue when Storing? As Load above is not using a locked instruction?

Countryside answered 30/1, 2023 at 19:22 Comment(0)
F
1

No, seq_cst stores use xchg (or mov + mfence but that's slower on recent CPUs) for ordering wrt. other operations. release or relaxed atomic stores can just use mov and will still be promptly visible to other cores. (Not before later loads in this thread might have executed, though.)

Cache coherence isn't the cause of memory-reordering, that's local to each core. (For x86, the memory model is program order + a store buffer with store-forwarding. It's the store buffer that causes stores to not become visible until after the store instruction has retired from out-of-order exec.)

The answer you linked which says "if I set this to true (or false), no other thread will read a different value after I've set it" (that's not quite such a certainty - you need a "lock" prefix to guarantee that). is somewhat misleading. They mean that (implicit-lock) xchg includes a full memory barrier, so no code in the storing thread can access memory until after the store is actually committed to cache, globally visible.

A clearer way to state that is that it makes this thread wait without doing anything until the store is visible. i.e. stall this thread until the store buffer has finished committing all previous stores. That would eventually happen on its own. So it's really about ordering of this thread relative to store visibility, not other threads. Other threads (cores) can locally do their own early loading / late storing, although on x86 all loads happen in program order. That's why I commented on that answer you linked to disagree with the way it was presenting things.


Fard answered 5/2, 2023 at 17:18 Comment(2)
Thank you so much. So it's all about memory ordering and it has nothing to do with cache coherency. x86 Loads are seq consistent by default so LOCK prefixes are not needed. Stores however do need a lock to make them sequentially consistent.Countryside
@Dan: Yes, for seq_cst stores only. (For the standard way of recovering sequential consistency on top of the x86 memory model; the other way would be to full-barrier before every SC load, but cheap loads and expensive seq-cst stores is much better. Which is why compilers / ABIs don't do that. cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html). foo.store(1, memory_order_release) only needs mov, so prefer that unless you actually need seq_cst; it's much cheaper for performance of surrounding code on most ISAs.Fard

© 2022 - 2024 — McMap. All rights reserved.