How do modern Intel x86 CPUs implement the total order over stores

Asked 19/6, 2020 at 7:31 Answered 20/6, 2020 at 1:9

Solved x86 intel cpu-architecture memory-barriers mesi

x86 guarantees a total order over all stores due to its TSO memory model. My question is if anyone has an idea how this is actually implemented.

I have a good impression how all the 4 fences are implemented, so I can explain how local order is preserved. But the 4 fences will just give Program Order; it won't give you TSO (I know TSO allows older stores to jump in front of newer loads so only 3 out of 4 fences are implicitly needed).

Total order over all memory actions over a single address is responsibility of coherence. But I would like know how Intel (Skylake in particular) implements a total order on stores over multiple addresses.

Denicedenie answered 19/6, 2020 at 7:31 Comment(3)

did you see this video? youtube.com/watch?v=WUfvvFD5tAA maybe its relevant .. no idea though just trying to help ^^ – Awlwort 19/6, 2020 at 7:39

It is a nice watch. But it doesn't answer my question. I know the contract provided by Intel X86. But I would like to know how it is actually implemented. – Denicedenie 19/6, 2020 at 12:10

I think I know how it works with the help of beeonrope. I'll write down an explanation as soon as I have time and put my thoughts on paper. – Denicedenie 19/6, 2020 at 15:56

The x86 TSO memory model basically amounts to program-order plus a store buffer with store-forwarding. (486 hardware was that simple; later CPUs didn't introduce new reordering.)

Most of the resulting guarantees are fairly easy in theory for hardware to implement by simply having a store buffer and coherent shared memory; a store buffer insulates OoO exec from the in-order commit requirement (and from cache-miss stores), and makes it possible to speculatively execute stores, and (via store->load forwarding) reloads of those stores while they're still speculative.

All cores can agree on a total order in which all stores happened. Or more accurately, cores can't disagree on any part of the total order they can actually observe. Stores to 2 different lines can be truly simultaneous, so any observations are compatible with either order in a hypothetical total order.

This happens automatically if the only way to make a store visible to any other core makes it visible to all cores simultaneously. i.e. by committing to coherent L1d. This makes IRIW reordering impossible. (MESI ensures that a store can't commit to L1d unless it's exclusively owned by this core: no other cores have a valid copy.) (A core observing its own stores needs a full barrier or it will observe its own stores via store forwarding, not the global total order. Typical IRIW litmus tests are considering 4 total threads so no local reloads.)

In fact it's rare for any hardware not to have this property; some POWER CPUs can store-forward between SMT threads on the same physical core, making it possible for 2 readers to disagree about the order of stores by 2 writers (IRIW reordering). Even though x86 CPUs also often have SMT (e.g. Intel's HyperThreading), the memory model requires them not to store-forward between logical cores. That's fine; they statically partition the store buffer anyway. What will be used for data exchange between threads are executing on one Core with HT?. And also What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings? for experimental testing.

The only reordering that happens is local, within each CPU core, between its accesses to that globally coherent shared state. (That's why local memory barriers that just make this core wait for stuff to happen, e.g. for the store buffer to drain, can recover sequential consistency on top of x86 TSO. The same applies even to weaker memory models, BTW: just local reordering on top of MESI coherency.)

The rest of these guarantees apply to each (logical) CPU core individually. (Q&A about how this can create synchronization between cores.)

Stores become visible in program order: in-order commit from the store buffer to L1d cache. (Store buffer entries are allocated in program order during issue/rename). This means cache miss stores must stall the store buffer, not letting younger stores commit. See Why doesn't RFO after retirement break memory ordering? for a simple mental model of this, and some detail on what Skylake may actually do (with committing data from store misses into LFBs while waiting for the cache lines to arrive).
Loads don't reorder with later stores: easy: require loads to fully complete (have taken data from L1d cache) before they can retire. Since retirement is in order, and a store can't commit to L1d until after it retires (becomes non-speculative), we get LoadStore ordering for free¹.
Loads take data from coherent cache (memory) in program order. This is the hard one: loads access global state (cache) when they execute, unlike stores where the store buffer can absorb the mismatch between OoO exec and in-order commit. Actually making every load dependent on previous loads would prevent hit-under-miss and kill a lot of the benefits of out-of-order execution for code that involved memory.

In practice, Intel CPUs speculate aggressively that a cache line that's present now will still be present when it's architecturally allowed for the load to happen (after earlier loads execute). If that isn't the case, nuke the pipeline (memory order mis-speculation). There's a perf counter event for this.

In practice everything can be more complicated to chase a bit more performance, or a lot more for speculative early loads.

(In C++ terms, this is at least as strong as acq_rel, but also covers behaviour of things that might be UB in C++. For example, a load partially overlapping a recent store to a location another thread might also be reading or writing, allowing this core to load a value that never appeared or will appear in memory for other threads to load. Globally Invisible load instructions)

related Q&As:

What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings? causes machine_clears.memory_ordering
C++ How is release-and-acquire achieved on x86 only using MOV? - MESI cache coherency is a key part of all this.
how are barriers/fences and acquire, release semantics implemented microarchitecturally?
C++ How is release-and-acquire achieved on x86 only using MOV?
Globally Invisible load instructions
Why flush the pipeline for Memory Order Violation caused by other logical processors?
How does memory reordering help processors and compilers?
Does a memory barrier ensure that the cache coherence has been completed? - that's not even the right mental model.

Footnote 1:
Some OoO exec weakly-ordered CPUs can do LoadStore reordering, presumably by letting loads retire from the ROB as long as the load checked permissions and requested the cache line (for a miss), even if the data hasn't actually arrived yet. Some separate tracking of the register not being ready is needed, not the usual instruction scheduler.

LoadStore reordering is actually easier to understand on an in-order pipeline, where we know special handling for cache-miss loads is needed for acceptable performance. How is load->store reordering possible with in-order commit?

Snowball answered 20/6, 2020 at 1:9 Comment(24)

The key part is the ordering (serialization) point in the cache coherence protocol. This creates the total order. This was the essential part I was missing. Thanks for your explanation; I hope to find some time today to analyze it. – Denicedenie 20/6, 2020 at 4:41

I have been banging my head on the IRIW example in combination with coherent cache. po = program order, cc = cache coherence order for some address. ``` W(A=1)--[cc]-->R(A=1)--[po]-->R(B=0)--[cc]-->W(B=1)--[cc]-->R(B=1)--[po]-->R(A=?) ``` A must be 1. So it seems that just having a coherent cache + [LoadLoad] is sufficient to prevent the IRIW problem? What bugs me is that we had a total order of all stores on a single address and now we magically have a total order on all stores over all addresses. – Denicedenie 21/6, 2020 at 8:24

@pveentjer: Note that unlike the single-location case, you can have simultaneous stores to different locations. You don't need to serialize, you just need to ensure that no threads can disagree. It's not required that any real or hypothetical reader be able to see a state where one store has happened and another hasn't. Having both writer cores commit in the same clock cycle (assuming synced clocks across cores...) means that any order of share requests for those lines will either see neither or both stores. If they're not truly simultaneous, then it becomes possible to see one first. – Snowball 21/6, 2020 at 15:3

@pveentjer: I don't really understand your notation or what you're trying to say. But also, what prevents IRIW is no private forwarding: stores become visible to all other cores at once, not some limited set first. On a weakly-ordered ISA, you use acquire loads when checking for IRIW; if you only used 2 relaxed loads you can't say anything about the order the stores might have happened in, because of possible LoadLoad reordering. – Snowball 21/6, 2020 at 15:7

Thanks for your reply. My primary issue is with: what makes sure that cores agree upon the order. I know the actual order isn't relevant. But what mechanism makes sure that everyone will agree upon the order of the stores. Is it sufficient to have a coherent cache whereby the cache only guarantees a total order of all memory actions on a single address? This is the big thing that is bugging me. How can we jump from total order of all memory actions from a single address to total order over all stores of multiple addresses. – Denicedenie 21/6, 2020 at 15:16

@pveentjer: Being coherent means that a read can never observe a stale value. So there's no "latency" for a write to propagate out to other cores which would give a window for re-ordering. MESI coherency gives the same correctness / ordering behaviour as a zero-latency interconnect, or as if all cores were literally directly sharing a single many-ported cache. It should be obvious that this can't reorder any accesses to shared memory, so any reordering it limited to that inside single cores. IDK why you say that a cache only guarantees separate total orders for single addresses. – Snowball 21/6, 2020 at 15:22

Sure. I have zero problems with the coherent part; SWMR. MESI etc. My big problem is that this will only give you a total order of memory operations of a single address. But for TSO we need to have a total order of memory operations (stores in this case) over all addresses. I'm failing to see how we go from having total orders on single addresses, to total order over all addresses. – Denicedenie 21/6, 2020 at 15:26

"IDK why you say that a cache only guarantees separate total orders for single addresses. " Perhaps here lies the source of my problem. My understanding was that e.g. MESI will only give you a total order of operations on a single address. But it will not give any order between operations of separate addresses. That is the primary distinction between coherence and consistency. – Denicedenie 21/6, 2020 at 15:34

@pveentjer: not introducing any reordering means that the true order of commit to cache becomes the total order of stores. The instant a store commits to a line of L1d, it's part of the coherent state maintained by MESI. Therefore a total store order exists. Using ordered loads that observe that shared state in a known order, readers can observe that order. Maybe I'm not using the right formal terminology here (coherency vs. consistency), but think about how cache actually works instead of just formal definitions of terms like coherency. – Snowball 21/6, 2020 at 15:34

But what if 2 stores commit to a line in the L1d at exactly the same moment as is the case with the IRIW example. I guess that is where the order is determined. And perhaps this is the magical ingredient I'm missing. – Denicedenie 21/6, 2020 at 15:47

@pveentjer: You mean to separate lines, I assume. Nothing can ever say that one happened before the other, therefore no observers can disagree on the order. That satisfies x86 TSO. The hypothetical total order (which isn't guaranteed to be directly observable, just to exist at least theoretically) can pick either order if you don't like having events be truly simultaneous in it. Perhaps my answer should have said "cores can't disagree about the total order", instead of phrasing it as cores being able to establish what the total order actually was. – Snowball 21/6, 2020 at 16:5

From what I understand it is when state of a cache line gets changed due to a conflicting access (e.g. the line was M, but another core wants to read it), that this is where the cache coherence will order the cache coherence traffic. This is done at the 'global ordering point'. "Modern Processor Design Fundamentals of Superscalar Processors" 11.3.6.2 What I don't understand however is the IRIW example in this case. Even if the cache would fail to provide any ordering between different addresses but only a total order on a single addres, AFAIK the IRIW reordering isn't possible. – Denicedenie 21/6, 2020 at 16:9

I'm fine with the not caring about the specific total order. A topological sort is needed. There are many possible; they are all fine.. as long as they don't disagree about the one it was. – Denicedenie 21/6, 2020 at 16:11

@pveentjer: re: when a store is globally observable: I'm assuming that cache is designed to respond to MESI share requests at any time, so committing to a line in M state creates the possibility of it being shared with another core in the next cycle. So you can't do that until it's architecturally allowed. (Or you could maybe build your cache to delay responding to MESI requests, but in practice that's only done as part of atomic RMW. Committing stores out of order and then maybe blocking MESI would be bad for perf). – Snowball 21/6, 2020 at 16:14

@pveentjer: IRIW reordering is hard to make possible; as I said in my answer, the real life mechanism is making stores visible via a side-channel to some cores (store forwarding) before they commit to cache at all. As I keep saying, if the only communication channel between cores is via a MESI coherent cache, IRIW reordering can't happen. Ordering (or lack of disagreement at least) just comes for free as part of being coherent. – Snowball 21/6, 2020 at 16:18

@PeterCordes - not sure I agree with "MESI is enough" and "IRIW is hard to make possible". I think the shared caching layer (or the "coherence layer") basically needs to very careful to preserve ordering, apart from the single-line guarantees given by MESI. Consider IRIW on a 2-socket system, with readers and the involved lines on different sockets. If the two read requests arrive at the LLC (using LLC as shorthand for "caching agent") and the LLC just serves them naively, each reader will get the response ... – Schuck 22/6, 2020 at 14:10

... for its "local" line first, getting the initial state (the stores haven't happened yet) and only later for the "remote" line, which could be after both stores happened. If the local line was second in the read order, this would result in the disallowed IRIW reordering. So the LLCs must coordinate somehow to keep these reads in order: MESI alone doesn't do it. For example, maybe it has to wait for the remote request to come back before proceeding with the local one. Or perhaps there is some MESI-like protocol even between the LLCs, in coordination with coherency directories. – Schuck 22/6, 2020 at 14:11

A search turned up an excerpt a book on hypertransport which covers some of the ordering complexities and how cacheable requests have to be strongly ordered, and some rules for when certain request types can "pass" others. Even within a single socket system, this type of thing would have to considered, but there I suppose you can rely more heavily on known/fixed latencies to avoid reordering. – Schuck 22/6, 2020 at 14:15

@BeeOnRope: The readers in IRIW have to be doing acquire loads, otherwise it's indistinguishable from local load reorder. The core receiving those incoming lines has to check that the 2nd load result is still valid after the 1st arrives, if the 2nd load's line arrived earlier. (i.e. block LoadLoad reordering). Maybe doing that efficiently requires some assistance from the interconnect / cache coherency mechanism. Maybe I've shown that IRIW is ruled out for any system that supports acq / rel ordering, which does(?) require the interconnect to avoid reordering, leaving that only for CPU cores. – Snowball 23/6, 2020 at 2:23

I think my scenario is actually bogus, because I also wasn't considering the part where the writes get the line in the E state and how this interacts with the reading core. Well anyway I still believe that the interconnect needs to keep certain things in order but I don't have a good example of why. – Schuck 23/6, 2020 at 2:33

If SWMR is maintained, no additional ordering is required and there will be a total store order. If the cache coherence relaxes SWMR, then it could be that the implementation requires additional ordering. An example of such a relaxation is 'early acknowledgement'; i.e. instead of waiting for all acknowledgement for invalidation to return, the CPU will just continue. patents.google.com/patent/US6493809 – Denicedenie 24/6, 2020 at 6:1

@PeterCordes. > All cores can agree on a total order in which all stores happened. This is true except for the writing cores, according to stackoverflow.com/questions/20907811/… – Incest 20/12, 2020 at 10:52

@DanielNitzan: Well, code running on the writing core should know that it needs to use mfence or other full barrier if it wants to observe the global order instead of store forwarding. Just like on weakly ordered machines, the readers need to order their loads with a load-load barrier if they want to rule out local reordering. But yes, good point, making observations of the total order is less trivial than my phrasing implied, I'll see if I can think of a way to add something without over-complicating the statement. – Snowball 20/12, 2020 at 18:21

@PeterCordes Right, a reminder to self and others that store forwarding is a completely separate issue from (StoreLoad) memory re-ordering introduced by the store buffer in and of itself. SLF affects the observable global store order. As you pointed out it requires careful attention if global store order is to be guaranteed. – Incest 16/10, 2021 at 10:32

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags