Does a memory barrier acts both as a marker and as an instruction?
Asked Answered
E

1

2

I have read different things about how a memory barrier works.

For example, the user Johan's answer in this question says that a memory barrier is an instruction that the CPU executes.

While the user Peter Cordes's comment in this question says the following about how the CPU reorders instructions:

It reads faster than it can execute, so it can see a window of upcoming instructions. For details, see some of the links in the x86 tag wiki, like Agner Fog's microarch pdf, and also David Kanter's writeup of Intel's Haswell design. Of course, if you had simply googled "out of order execution", you'd find the wikipedia article, which you should read.

So I'm guessing based on the above comment that if a memory barrier exists between the instructions, the CPU will see this memory barrier, which causes the CPU not to reorder the instructions, so this means that a memory barrier is a "marker" for the CPU to see and not to execute.


Now my guess is that a memory barrier acts both as a marker and as an instruction for the CPU to execute.

For the marker part, the CPU sees the memory barrier between the instructions, which causes the CPU not to reorder the instructions.

As for the instruction part, the CPU will execute the memory barrier instruction, which causes the CPU to do things like flushing the store buffer, and then the CPU will continue to execute the instructions after the memory barrier.

Am I correct?

Equalitarian answered 14/5, 2018 at 20:8 Comment(9)
Regarding the tags, the assembly tag should only be used for questions about assembly programming or for questions about how instructions work at the ISA specification level, not the implementation level. Similarly, the cpu tag should only be used for questions about stuff like CPU utilization or configuration or virtualization, but not about the internals of CPUs. You can use the cpu-architecture tag instead for questions about how instructions are implemented or how CPUs work, like this question for example.Teresaterese
What do you mean by "marker"? What's the difference between a marker and an instruction exactly? In the Intel manual, there is no such a thing as a marker instruction.Teresaterese
Strictly speaking, by definition, a memory barrier is only guaranteed to provide ordering for certain memory operations, but not necessarily instructions.Teresaterese
@Hadi Brais "What do you mean by "marker"? What's the difference between a marker and an instruction exactly? In the Intel manual, there is no such a thing as a marker instruction." I mean by "marker" an instruction that the CPU only sees but does not execute (it is a term that I made up).Equalitarian
But then how can a memory barrier act both as a marker and as an instruction? How can it get executed and not get executed?Teresaterese
@Hadi Brais I said in my previous comment: "I mean by "marker" an instruction that the CPU only sees but does not execute". This is not what I meant, I meant that the presence of the marker causes the CPU to do something (not to reorder the instructions in this case) but it can also be executed (the execution of the marker causes the CPU to do something else, in this cases it causes the store buffer to be flushed, among other things I think).Equalitarian
I can't wrap my head around this marker thing, but let me present an alternative way to understand barriers. When the CPU decodes the barrier instruction (potentially together with neighboring instructions) and passes all these instructions to the scheduling unit, the scheduler sees the barrier and it creates new dependencies between the instructions as follows. It creates dependencies between the barrier and any previous instructions that are supposed to be ordered by the barrier...Teresaterese
...It also creates dependencies between any later instructions that are supposed to be ordered by the barrier and the barrier. I think you can draw a diagram for that. So barriers tell the CPU to create certain dependencies between instructions that may have not been there without it. That's exactly what they do.Teresaterese
@Equalitarian This definition of “marker” doesn't make a lot of sense. Certainly, flushing buffers is a form of execution. I think there is no point in considering the term “marker” at all. It doesn't map to the CPU in any reasonable way and doesn't give us any new insights either.Keeney
J
5

No, mfence is not serializing on the instruction stream, and lfence (which is) doesn't flush the store buffer.

(In practice on Skylake, mfence does block out-of-order execution of later ALU instructions, not just loads. (Proof: experiment details at the bottom of this answer). So it's implemented as an execution barrier, even though on paper it's not required to be one. But lock xchg doesn't, and is also a full barrier.)

I'd suggest reading Jeff Preshing's Memory Barriers Are Like Source Control Operations article, to get a better understanding of what memory barriers need to do, and what they don't need to do. They don't (need to) block out-of-order execution in general.


A memory barrier restricts the order that memory operations can become globally visible, not (necessarily) the order in which instructions execute. Go read @BeeOnRope's updated answer to your previous question again: Does an x86 CPU reorder instructions? to learn more about how memory reordering can happen without OoO exec, and how OoO exec can happen without memory reordering.

Stalling the pipeline and flushing buffers is one (low-performance) way to implement barriers, used on some ARM chips, but higher-performance CPUs with more tracking of memory ordering can have cheaper memory barriers that only restrict ordering of memory operations, not all instructions. And for memory ops, they control order of access to L1d cache (at the other end of the store buffer), not necessarily the order that stores write their data into the store buffer.

x86 already needs lots of memory-order tracking for normal loads/stores for high performance while maintaining its strongly-ordered memory model where only StoreLoad reordering is allowed to be visible to observers outside the core (i.e. stores can be buffered until after later loads). (Intel's optimization manual uses the term Memory Order Buffer, or MOB, instead of store buffer, because it has to track load ordering as well. It has to do a memory-ordering machine clear if it turns out that a speculative load took data too early.) Modern x86 CPUs preserve the illusion of respecting the memory model while actually executing loads and stores aggressively out of order.

mfence can do its job just by writing a marker into the memory-order buffer, without being a barrier for out-of-order execution of later ALU instructions. This marker must at least prevent later loads from executing until the mfence marker reaches the end of the store buffer. (As well as ordering NT stores and operations on weakly-ordered WC memory).

(But again, simpler behaviour is a valid implementation choice, for example not letting any stores after an mfence write data to the store buffer until all earlier loads have retired and earlier stores have committed to L1d cache. i.e. fully drain the MOB / store buffer. I don't know exactly what current Intel or AMD CPUs do.)


On Skylake specifically, my testing shows mfence is 4 uops for the front-end (fused domain), and 2 uops that actually execute on execution ports (one for port2/3 (load/store-address), and one for port4 (store-data)). Presumably it's a special kind of uop that writes a marker into the memory-order buffer. The 2 uops that don't need an execution unit might be similar to lfence. I'm not sure if they block the front-end from even issuing a later load, but hopefully not because that would stop later independent ALU operations from being executed.


lfence is an interesting case: as well as being a LoadLoad + LoadStore barrier (even for weakly-ordered loads; normal loads/stores are already ordered), lfence is also a weak execution barrier (note that mfence isn't, just lfence). It can't execute until all earlier instructions have "completed locally". Presumably that means "retired" from the out-of-order core.

But a store can't commit to L1d cache until after it retires anyway (i.e. after it's known to be non-speculative), so waiting for stores to retire from the ROB (ReOrder Buffer for uops) isn't the same thing as waiting for the store buffer to empty. See Why is (or isn't?) SFENCE + LFENCE equivalent to MFENCE?.

So yes, the CPU pipeline does have to "notice" lfence before it executes, presumably in the issue/rename stage. My understanding is that lfence can't issue until the ROB is empty. (On Intel CPUs, lfence is 2 uops for the front-end, but neither of them need execution units, according to Agner Fog's testing. http://agner.org/optimize/.)

lfence is even cheaper on AMD Bulldozer-family: 1 uop with 4-per-clock throughput. IIRC, it's not partially-serializing on those CPUs, so you can only use lfence; rdtsc to stop rdtsc from sampling the clock early on Intel CPUs.


For fully serializing instructions like cpuid or iret, it would also wait until the store buffer has drained. (They're full memory barriers, as strong as mfence). Or something like that; they're multiple uops so maybe only the last one does the serializing, I'm not sure which side of the barrier the actual work of cpuid happens on (or if it can't overlap with either earlier or later instructions). Anyway, the pipeline itself has to notice serializing instructions, but the full memory-barrier effect might be from uops that do what mfence does.


Bonus reading:

On AMD Bulldozer-family, sfence is as expensive as mfence, and may be as strong a barrier. (The x86 docs set a minimum for how strong each kind of barrier is; they don't prevent them from being stronger because that's not a correctness problem). Ryzen is different: sfence has one per 20c throughput, while mfence is 1 per 70c.

sfence is very cheap on Intel (a uop for port2/port3, and a uop for port4), and just orders NT stores wrt. normal stores, not flushing the store buffer or serializing execution. It can execute at one per 6 cycles.

sfence doesn't drain the store buffer before retiring. It doesn't become globally visible itself until all preceding stores have become globally visible first, but this is decoupled from the execution pipeline by the store buffer. The store buffer is always trying to drain itself (i.e. commit stores to L1d) so sfence doesn't have to do anything special, except for putting a special kind of mark in the MOB that stops NT stores from reordering past it, unlike the marks that regular stores put which only order wrt. regular stores and later loads.


It reads faster than it can execute, so it can see a window of upcoming instructions.

See this answer I wrote which is a more detailed version of my comment. It goes over some basics of how a modern x86 CPU finds and exploits instruction-level parallelism by looking at instructions that haven't executed yet.

In code with high ILP, recent Intel CPUs can actually bottleneck on the front-end fairly easily; the back-end has so many execution units that it's rarely a bottleneck unless there are data dependencies or cache misses, or you use a lot of a single instruction that can only run on limited ports. (e.g. vector shuffles). But any time the back-end doesn't keep up with the front-end, the out-of-order window starts to fill with instructions to find parallelism in.

Jam answered 14/5, 2018 at 20:40 Comment(9)
I don't understand the two bold parts about memory barriers and mfence. As I understand it, the barriers mfence and sfence prevents younger stores from executing and drain the store buffer (this is what Intel says and what must be done to make a store globally visible). What is the role of the marker? Could it be a command to the store buffer instead?Kaylor
@MargaretBloom: sfence doesn't drain the store buffer before retiring. It doesn't become globally visible until all preceding stores have become globally visible first, but this is decoupled from the execution pipeline by the store buffer. The store buffer is always trying to drain itself (i.e. commit stores to L1d) so sfence doesn't have to do anything special, except for putting a special kind of mark in the MOB that stops NT stores from reordering past it, unlike the marks that regular stores put which only order wrt. regular stores and later loads.Jam
@MargaretBloom: updated with more stuff, does that help? Do I need to explain what I mean by "marker" for the store buffer in more detail? Loads and stores have to write the MOB so it can track their ordering, and mfence + sfence apparently do something similar on Intel CPUs, writing some kind of marker into the MOB. So it's sort of like a command for the store buffer.Jam
Very much appreciate the effort, but I think the answer is too complicated for the posed question. I think it would be sufficient to just consider only one instruction as an example (the simplest one). The stuff about AMD processors and the differences between the different fence instructions and the stuff about serializing instructions make the answer hard to follow within the context of the posed question. Not to mention that there are nine links in the answer about all kinds of related stuff, which include even more links to even more stuff. That is a little overwhelming.Teresaterese
@HadiBrais: Thanks for the feedback, moved that AMD SFENCE stuff to a "bonus reading" footnote section. My answers are always full of links, but I try to say enough that you don't need to follow them unless you want to learn more about that subject. I don't know what the question is even asking, whether it's about current microarchitectures, or whether it's about the x86 architecture on paper. So I ended up answering both, with a description of the minimum necessary to implement mfence according to the on-paper guarantees (i.e. not serializing at all, just a memory barrier.)Jam
Like for example, the first sentence itself mentions two different fence instructions, and the terms "serializing", "partially serializing", and "store buffers", none of which are even mentioned in the question. From the perspective of a beginner, this would be immediately overwhelming. Then the answer goes into more stuff like MOB, ROB, WC memory, NT writes, different kinds barriers (LoadLoad, etc.), different fence instructions, different processors from different vendors, pipeline ports and uops. The OP may not have a solid understanding of any of these.Teresaterese
@HadiBrais: feel free to try to answer the question more simply. I may take some more time to make a simpler summary. But like I said, I don't know whether the OP is asking about pure memory barriers in a theoretical clean/simple architecture, or whether they're actually asking about x86's barriers on real x86 CPUs, which are complicated by all these warts. I'm not sure how to make it any more clear than BeeOnRope's answer on the OP's previous question, which IMO clearly explained the difference between OoO exec vs. memory reordering.Jam
@HadiBrais: I did add a link to Jeff Preshing's barrier article, so the first couple section of the answer stands on its own as a simple version of the answer, without considering all the details of x86 barrier semantics.Jam
Thank you very much Peter. I really appreciate your efforts.Kaylor

© 2022 - 2024 — McMap. All rights reserved.