how are barriers/fences and acquire, release semantics implemented microarchitecturally?

Asked 23/9, 2019 at 21:29 Answered 23/9, 2019 at 22:46

Solved x86 x86-64 cpu-architecture memory-barriers micro-architecture

A lot of questions SO and articles/books such as https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.2018.12.08a.pdf, Preshing's articles such as https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ and his entire series of articles, talk about memory ordering abstractly, in terms of the ordering and visibility guarantees provided by different barriers types. My question is how are these barriers and memory ordering semantics implemented on x86 and ARM micro architecturally ?

For store-store barriers, it seems like on the x86, the store buffer maintains program order of stores and commits them to L1D(and hence making them globally visible in the same order). If the store buffer is not ordered, ie does not maintain them in program order, how is a store store barrier implemented ? it is just "marking" the store buffer in such a way that that stores before barrier commit to the cache coherent domain before stores after ? or does the memory barrier actually flush the store buffer and stall all instructions until the flushing is complete ? Could it be implemented both ways ?

For load-load barriers, how is load-load reordering prevented ? It is hard to believe that x86 will execute all loads in order! I assume loads can execute out of order but commit/retire in order. If so, if a cpu executes 2 loads to 2 different locations ,how does one load ensure that it got a value from say T100 and the next one got it on or after T100 ? What if the first load misses in the cache and is waiting for data and the second load hits and gets its value. When load 1 gets its value how does it ensure that the value it got is not from a newer store that load 2's value ? if the loads can execute out of order, how are violations to memory ordering detected ?

Similarly how are load-store barriers(implicit in all loads for x86) implemented and how are store-load barriers(such as mfence) implemented ? ie what do the dmb ld/st and just dmb instructions do micro-architecturally on ARM, and what does every load and every store, and the mfence instruction do micro-architecturally on x86 to ensure memory ordering ?

Rasp answered 23/9, 2019 at 21:29 Comment(1)

Is that Q about memory operations, or normal C objects in normal memory, that is operation on addresses that always end up in the cache? – Prisca 24/9, 2019 at 4:28

Much of this has been covered in other Q&As (especially the later C++ How is release-and-acquire achieved on x86 only using MOV?), but I'll give a summary here. Still, good question, it's useful to collect this all in one place.

On x86, every asm load is an acquire-load. To implement that efficiently, modern x86 HW speculatively loads earlier than allowed and then checks that speculation. (Potentially resulting in a memory-order mis-speculation pipeline nuke.) To track this, Intel calls the combination of load and store buffers the "Memory Order Buffer".

Weakly-ordered ISAs don't have to speculate, they can just load in any order.

x86 store ordering is maintained by only letting stores commit from the store buffer to L1d in program order.

On Intel CPUs at least, a store-buffer entry is allocated for a store when it issues (from the front-end into the ROB + RS). All uops need to have a ROB entry allocated for them, but some uops also need to have other resources allocated, like load or store buffer entries, RAT entries for registers they read/write, and so on.

So I think the store buffer itself is ordered. When a store-address or store-data uop executes, it merely writes an address or data into its already-allocated store-buffer entry. Since commit (freeing SB entries) and allocate are both in program order, I assume it's physically a circular buffer with a head and tail, like the ROB. (And unlike the RS).

Avoiding LoadStore is basically free: a load can't retire until it's executed (taken data from the cache). A store can't commit until after it retires. In-order retirement automatically means that all previous loads are done before a store is "graduated" and ready for commit.

A weakly-ordered uarch that can in practice do load-store reordering might scoreboard loads as well as tracking them in the ROB: let them retire once they're known to be non-faulting but, even if the data hasn't arrived.

This seems more likely on an in-order core, but IDK. So you could have a load that's retired but the register destination will still stall if anything tries to read it before the data actually arrives. We know that in-order cores do in practice work this way, not requiring loads to complete before later instructions can execute. (That's why software-pipelining using lots of registers is so valuable on such cores, e.g. to implement a memcpy. Reading a load result right away on an in-order core destroys memory parallelism.)

How is load->store reordering possible with in-order commit? goes into this more deeply, for in-order vs. out-of-order.

Barrier instructions

The only barrier instruction that does anything for regular stores is mfence which in practice stalls memory ops (or the whole pipeline) until the store buffer is drained. Are loads and stores the only instructions that gets reordered? covers the Skylake-with-updated-microcode behaviour of acting like lfence as well.

lfence mostly exists for the microarchitectural effect of blocking later instructions from even issuing until all previous instructions have left the out-of-order back-end (retired). The use-cases for lfence fo memory ordering are nearly non-existent.

C++ How is release-and-acquire achieved on x86 only using MOV?
How is the transitivity/cumulativity property of memory barriers implemented micro-architecturally?
How many memory barriers instructions does an x86 CPU have?
How can I experience "LFENCE or SFENCE can not pass earlier read/write"
Does lock xchg have the same behavior as mfence?
Does the Intel Memory Model make SFENCE and LFENCE redundant?
Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths goes into a lot of detail about how LFENCE stops execution of later instructions, and what that means for performance.
When should I use _mm_sfence _mm_lfence and _mm_mfence high-level languages have weaker memory models than x86, so you sometimes only need a barrier that compiles to no asm instructions. Using _mm_sfence() when you haven't used any NT stores just makes your code slower for no reason than atomic_thread_fence(mo_release).

Kirimia answered 23/9, 2019 at 22:46 Comment(20)

Thanks Peter.1) Can you elaborate more on "checks that speculation" for the load acquire part ? – Rasp 24/9, 2019 at 0:4

@Raghu: look up the things that can cause memory-order mis-speculation. I think it involves noticing if the cache line was invalidated between the load execution and load retirement, maybe by snooping LFB activity and flagging that load buffer. This is pretty hand-wavy; if I knew something more concrete I'd put it in the answer. – Kirimia 24/9, 2019 at 0:6

Thanks Peter. Will try and find out more about MOB etc. i think store-store and store-load is clear. For load-store, if any CPU implements a ROB, they should get load-store ordering for free. Is there any implementation that can even theoretically do load-store reordering given that they all have to do in order retirement to maintain the illusion of program order for the same core ? – Rasp 24/9, 2019 at 0:11

@Raghu: Yes, I think so. Like I said, it's easy for an in-order core that scoreboards loads after checking that they're non-faulting. (i.e. will definitely happen, similar to a retired store that's sitting in the store buffer waiting to commit). A load can delay arbitrarily long while waiting for a cache miss as long as no instruction tries to read the target register. x86 has a strongly-ordered memory model (and needs load-load ordering) so no x86 will attempt this, but weakly-ordered cores might be designed to do that. Maybe even with OoO as well. I'd have to google for real examples. – Kirimia 24/9, 2019 at 0:13

Thanks Peter. Just so i understand your in-order example clearly, you are talking about in-order cores(ie no speculative execution, branch prediction and OoO execution) but with a weak memory model ? – Rasp 24/9, 2019 at 0:21

@Raghu: Yes, like an ARM Cortex-A53 found in most smart phones: in order but otherwise fairly high performance. Such cores will still have branch prediction to avoid fetch bubbles, though! Instructions begin executing in order but can complete out of order once it's known that they won't fault. But yes, no speculative execution, only speculative fetch/decode. No ROB, just a superscalar pipeline. – Kirimia 24/9, 2019 at 0:28

Here is another post by Peter relevant to the above discussion: stackoverflow.com/questions/52215031/… – Rasp 24/9, 2019 at 15:55

@Raghu: yes, thanks for finding that. Added some links to my answer. – Kirimia 24/9, 2019 at 21:2

Does the pipeline wait for the store buffer to be drained as soon as it runs into an MFENCE? Or will it keep issuing instructions until it runs into the first load? – Legere 6/5, 2020 at 13:19

@pveentjer: Depends on the microarchitecture. On Skylake with microcode updates, mfence includes lfence-like behaviour so it stalls the front-end until the store buffer drains. (Are loads and stores the only instructions that gets reordered?). But locked instructions, and mfence on some other uarches, only delay exec of loads. I haven't tested if xchg [mem], reg ; load ; unrelated ALU lets the independent ALU instruction execute before the store buffer drains, but I'd hope so, with the MOB (memory order buf) tracking order. – Kirimia 6/5, 2020 at 13:28

@PeterCordes re load mis-speculation, is it safe to say that it can happen in StoreLoad and LoadLoad scenarios? i.e. the load executes earlier than the store (or load), but another core does a store to the same address before the load retires? – Restrictive 5/4, 2022 at 7:18

@DanielNitzan: LoadLoad yes, but x86 architecturally allows StoreLoad reordering. It's not speculation to do a load before an older store commits. (If you mean across an mfence or locked instruction, I assume it doesn't try to speculate later loads, since you normally only use barriers near access to memory that another core might be messing with. Unlike most code that spends most of its time touching non-shared or read-only cache lines.) – Kirimia 5/4, 2022 at 7:38

There are other cases where mem-order machine nukes can happen, e.g. misprediction that a load will/won't reload a store from an unknown address. It's possible to get mem-order machine clears in single-threaded code because of that. – Kirimia 5/4, 2022 at 7:39

@PeterCordes So you're saying a StoreLoad re-ordering on x86 is possible because of loads executing early (but retiring in order), in addition to the store buffer. (And good point on the mfence!) – Restrictive 5/4, 2022 at 8:1

@DanielNitzan: Not really. If store data or store-address aren't ready, it has to wait, so a younger load might be the oldest-ready-first uop for a port. That shouldn't stop later loads from executing! Executing a store just means writing the address+data to the store buffer. That does necessitate dealing with a load when not all previous store-addresses are known, though, making memory disambiguation harder. (github.com/travisdowns/uarch-bench/wiki/… is good and has tons of links to other good general articles, especially Henry Wong's) – Kirimia 5/4, 2022 at 8:11

@DanielNitzan: But anyway, that's not really introducing any new StoreLoad reordering because like I said, executing a store in the first place is just writing an entry in the store buffer. (Or parts of it, executing store-address and store-data separately when their inputs are ready.) I guess if you were expecting all later loads to have a data dependency on a store address, that would be new, but memory reordering is about the order other threads can observe our actions (or the relative order of our operations touching L1d$), and the existence of the SB already means stores = post retirement – Kirimia 5/4, 2022 at 8:17

@PeterCordes If a speculatively executed load happens to retire after the store commits to L1d$, doesn't it qualify as a new StoreLoad reordering? I understand that it may not happen in practice though? – Restrictive 5/4, 2022 at 8:39

@DanielNitzan: I guess you could call it another mechanism if you want. I think my objection is that it doesn't introduce the possibility of StoreLoad reordering in any new places it wasn't already allowed. Retirement of the load relative store commit also isn't necessarily interesting, although I guess that is when x86 CPUs check a load buffer to verify speculation for load ordering wrt. other loads. Any time a load executes before an older store commits, and doesn't get rolled back as mis-speculation, that's StoreLoad reordering, regardless of retirement timing. – Kirimia 5/4, 2022 at 8:53

@DanielNitzan: It certainly could happen in practice, though, especially with a store, then a long dep chain of imul or sqrtpd, then an independent load. The store will retire much earlier than the load, but if the dep chain is under the RS size then the load can be in the back-end and execute. pause would also delay a long time, but probably in the front-end which would delay the load. – Kirimia 5/4, 2022 at 8:57

Q&A relevant to x86 StoreLoad reordering root cause (storebuf vs OoO exec) – Restrictive 28/5, 2022 at 18:4

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Barrier instructions

Recommended topics

Hot tags