How is the transitivity/cumulativity property of memory barriers implemented micro-architecturally?
Asked Answered
V

1

5

I've been reading about how the x86 memory model works and the significance of the barrier instructions on x86 and comparing to other architectures such as ARMv8. In both the x86 and ARMv8 architecture, it appears(no pun intended) that the memory models respect transitivity/cumulativity, i.e if CPU 1 sees stores by CPU0, and CPU2 sees stores by CPU1 that could only have occurred if CPU1 saw CPU0 stores, then CPU2 must also see CPU0's store. The examples i'm referring to are example 1 and 2 in section 6.1 of Paul McKenney's famous paper(relevant albeit old, the same thing exists in his latest perf cook book, http://www.puppetmastertrading.com/images/hwViewForSwHackers.pdf). If i understand correctly, x86 uses store queue's(or store order buffers) to order the stores(and for other micro arch optimizations)before they become globally visible(ie written to L1D). My question is how does the x86 arch(and other arch's) implement(micro-architecturally) the transitivity property ? The store queue ensures that the particular CPU's stores are made visible globally in a particular order, but what ensures the ordering of stores made by one CPU ordered with stores made by different CPU's ?

Versicular answered 19/9, 2019 at 20:23 Comment(1)
I had the exact same question when I read the paper. And just when I was done with my question formulation I saw your post. And better yet, the answer. SO rocks:)Heathcote
S
8

On x86, there is only one coherency domain. Stores become visible to all other cores at exactly the same time, when they commit to L1d cache. That along with MESI in general is enough to give us a total store order that all threads can agree on.

A few ISAs (including PowerPC) don't have that property (in practice because of store-forwarding for retired store within a physical core, across SMT threads). So mo_relaxed stores from 2 threads can be seen in different orders by 2 other readers in practice on POWER hardware. Will two atomic writes to different locations in different threads always be seen in the same order by other threads? (Presumably barriers on PowerPC block that forwarding.)

The ARM memory model used to allow this IRIW (Independent Reader Independent Writer) reordering, but in practice no ARM HW ever existed that did it. ARM was able to strengthen their memory model to guarantee that all cores agree on a global order for stores done by multiple other cores.

(Store forwarding still means that the core doing the store sees it right away, long before it becomes globally visible. And of course load ordering is required for cores to be able to say they saw anything about what they observed for the ordering of independent writes.)


If all cores must agree on the global ordering of stores, then (in your example) seeing the store from Core2 implies that Core1 must have already happened, and that you can see it, too.

(Assuming that Core2 used appropriate barriers or acquire-load or release-store to make sure its store happened after its load that saw Core1's store.)


Possibly also related:

Sitra answered 19/9, 2019 at 20:55 Comment(6)
Recently, ARM has decided to specify that their model is multicopy atomic so I think IRIW is no longer possible there.Rancor
This does answer my question i think. To clarify, transitivity and TSO(at least on x86) are effectively implemented using MESI, since any write that commits to L1D will RFO the cache line, effectively invalidating/removing all other copies of it. So if any CPU(other than the committing CPU) is able to read the particular location, all CPU's will be able to read it, ie there is no delay in propagation of a GV store between different CPU's in this case. Is my understanding correct ?Versicular
@BeeOnRope, correct. ARM did move to other-multi-copy-atomicity(based on the paper you have linked in different answers). Most ARM systems use MESI/MOESI/AMBA type of protocols which RFO a cache line for a store. Based on the answers in this questions and all the linked questions, using such protocols generally remove the possibility of stores being seen by different CPU's in different order, ie transitivity of stores is guaranteed(ignoring the power PC case of SLF between logical cores). Is that a fair statement ?Versicular
@Raghu: All mainstream ISAs use MESI (or a variant of it), creating a single coherence domain. And yes, that's why commit to L1d makes a store globally visible to all other cores at the same time. The only mechanism for other cores to read that line is by sending requests to Share that Modified line. (Except PowerPC's store-forwarding between SMT threads). x86 gets TSO by also restricting the order of store commits to program-order within each core. (TSO is a stronger term than the mere existence of a global order for all stores).Sitra
@PeterCordes Makes sense. Thanks again for your response here and in all the other question. last question. Is it fair to say that ARM implementations can potentially optimize or have a simpler store buffer implementation since they dont need to ensure that stores are committed in program order unless they see a barrier where as x86 HAS to since that is the published memory model?Versicular
@Raghu: yes, a weakly-ordered ISA like ARM can do store coalescing of non-adjacent stores before commit, while x86 can only coalesce stores to the same line if they were back-to-back. And if the oldest entry's line isn't in E or M state yet, ARM can scan the store buffer for an entry that can commit out of order. IDK how aggressively it's practical to make that: checking cache-state tags for N buffer entries every clock cycle seems impractical, but maybe it can notice when RFO responses arrive (line entering E state).Sitra

© 2022 - 2024 — McMap. All rights reserved.