Is a memory barrier an instruction that the CPU executes, or is it just a marker?
Asked Answered
E

4

25

I am trying to understand what is a memory barrier exactly. Based on what I know so far, a memory barrier (for example: mfence) is used to prevent the re-ordering of instructions from before to after and from after to before the memory barrier.

This is an example of a memory barrier in use:

instruction 1
instruction 2
instruction 3
mfence
instruction 4
instruction 5
instruction 6

Now my question is: Is the mfence instruction just a marker telling the CPU in what order to execute the instructions? Or is it an instruction that the CPU actually executes like it executes other instructions (for example: mov).

Edlyn answered 10/3, 2017 at 9:20 Comment(4)
It's an instruction that the CPU executes, there's no other kind of instruction.Ballyrag
Note that compiler memory barriers like std::atomic_signal_fence() or GNU C asm("":::"memory") are purely markers in the source code, and compile to zero instructions. They exist to block reordering at compile time, and are especially useful when the target architecture has a stronger memory model than the source language (e.g. C++ -> x86 asm). preshing.com/20120625/memory-ordering-at-compile-time explains more.Elfin
I wonder what you expect from that bounty. The answer you got is clear. If you have further questions, make sure to actually state them! Nobody can guess what part of the answer is unsatisfying to you.Algor
Note that you are setting up a possibly false dichotomy between "an instruction" and "a marker'. Why can't it be both? Yes, it is undeniably an instruction, but why can't it be an instruction that largely serves as a marker?Phail
J
26

Every byte sequence that the CPU encounters amongst its code is an instruction that the CPU executes. There are no other kinds of instructions.

You can see this clearly in both the Intel instruction set reference and the specific page for mfence.

MFENCE
Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes the MFENCE instruction in program order becomes globally visible before any load or store instruction that follows the MFENCE instruction.

The MFENCE instruction is ordered with respect to all load and store instructions, other MFENCE instructions, any LFENCE and SFENCE instructions, and any serializing instructions (such as the CPUID instruction). MFENCE does not serialize the instruction stream. Weakly ordered memory types can be used to achieve higher processor performance through such techniques as out-of-order issue, speculative reads, write-combining, and write-collapsing. The degree to which a consumer of data recognizes or knows that the data is weakly ordered varies among applications and may be unknown to the producer of this data. The MFENCE instruction provides a performance-efficient way of ensuring load and store ordering between routines that produce weakly-order ed results and routines that consume that data.

Processors are free to fetch and cache data speculatively from regions of system memory that use the WB, WC, and WT memory types. This speculative fetching can occur at any time and is not tied to instruction execution. Thus, it is not ordered with respect to executions of the MFENCE instruction; data can be brought into the caches speculatively just before, during, or after the execution of an MFENCE instruction.

As you can see from the excerpt the MFence instruction does quite a bit of work, rather than just being a marker of some sort.

Jennifer answered 10/3, 2017 at 19:43 Comment(0)
N
23

I'll explain the impact that mfence has on the flow of the pipeline. Consider the Skylake pipeline for example. Consider the following sequence of instructions:

inst1
store1
inst2
load1
inst3
mfence
inst4
store2
load2
inst5

The instructions gets decoded into a sequence of uops in the same program order. Then all uops are passed in order to the scheduler. Normally, without fences, all uops get issued for execution out-of-order. However, when the scheduler receives the mfence uop, it needs to make sure that no memory uops downstream the mfence get executed until all upstream memory uops become globally visible (which means that the stores have retired and the loads have at least completed). This applies to all memory accesses irrespective of the memory type of the region being accessed. This can be achieved by either having the scheduler not to issue any downstream store or load uops to the store or load buffers, respectively, until the buffers get drained or by issuing downstream store or load uops and marking them so that they can be distinguished from all existing memory uops in the buffers. All non-memory uops above or below the fence can still be executed out-of-order. In the example, once store1 retires and load1 completes (by receiving the data and holding it in some internal register), the mfence instruction is considered to have completed execution. I think that mfence may or may not occupy any resources in the backend (ROB or RS) and it may get translated to more than one uop.

Intel has a patent submitted in 1999 that describes how mfence works. Since this is a very old patent, the implementation might have changed or it might be different in different processors. I'll summarize the patent here. mfence gets decoded into three uops. Unfortunately, it's not clear exactly what these uops are used for. Entries are then allocated from the reservation station is allocated to hold the uops and also allocated from the load and store buffers. This means that a load buffer can hold entries for either true load requests or for fences (which are basically bogus load requests). Similarly, the store buffer can hold entries for true store requests and for fences. The mfence uop is not dispatched until all earlier load or store uops (in the respective buffers) have been retired. When that happens, the mfence uop itself get sent to the L1 cache controller as a memory request. The controller checks whether all previous requests have completed. In that case, it will simply be treated as a NOP and the uop will get deallcoated from the buffers. Otherwise, the cache controller rejects the mfence uop.

Niacin answered 11/5, 2018 at 21:8 Comment(26)
Skylake decodes mfence into 4 uops (fused and unfused) that run on p2/p3 and p4 (the AGU and store-data ports). agner.org/optimize. On paper at least, mfence doesn't have to stop later stores from executing (and placing store data into the store buffer), it just has stop those stores from becoming globally visible (committing to L1d cache) ahead of anything after the fence. mfence does need a mechanism to stop loads from executing before all earlier loads/stores are globally visible, though. If Intel's implementation is more restrictive, then that's just a design choice.Elfin
@PeterCordes Do you know the purpose of each of these 4 uops? Just curious.Niacin
No, I don't. Perf counters can tell us what ports they run on, but anything more than that is pure guesswork (or digging up Intel patents or conference papers).Elfin
@PeterCordes My best guess for sfence is that its two uops correspond to a “complex AGU” uop on p2 or p3 (p7 can only do simple AGU store-address ops, while p2 and p3 are identical, can do both loads and address generation for both load and store, and support complex AGU ops), plus a store or magic flush-store-buffer uop on p4 (the store-data port). Maybe mfence builds on top of that a magic load-store-load sequence targeting a single, complex magic address?Newman
@IwillnotexistIdonotexist: port 2/3 also have load-data execution units. In broad terms, probably the fence uops write markers into the memory-order buffer (which loads also do), or maybe a special kind of marker, or just do something special to the MOB. Random guess, maybe the mfence uops "turn off" the load ports (p2/p3), and having an extra load port is why mfence is now 4 uops, not 3. But that probably not that simple, because OoO exec dispatches in oldest ready first, and some loads older than the mfence might not have their address registers ready yet.Elfin
@PeterCordes You're getting somewhere. If it was only 3 uops, then we can expect that each uop goes to each of the ports p234. But there are 4 uops. They don't turn off the ports as you suspected, at least not according to the patent I referenced. Instead, the MOB logic takes case of the order when it sees these special uops. It might be the case that even though there are 3 ports, they might be 4 load/store buffers in the MOB and so each uop will go to each buffer.Niacin
@HadiBrais: I put mfence in a loop and looked at the uop->port distribution on SKL. In a tight loop (times 4 mfence / dec ebx/jnz), I see (per mfence): 4 fused-domain (uops_issued.any:u) and 2 uops_executed.thread:u (unfused domain, user-space only). (9 total executed/dispatched for the loop, including 1 macro-fused branch that runs on p6). p4: 1 uop per mfence. p2: 0.625 uops per mfence p3: 0.375 uops per mfence. (other ports negligible, except p6 for the loop branch). The 2.5/1.5 out of 4 split for p2/p3 is consistent across runs with no other loads/stores.Elfin
TL:DR: Agner Fog's table has an error: it's 4 fused-domain / 2 unfused-domain, so 2 of the uops that issue don't need an execution port. The 2 that do run on (p4) and (either of p2 or p3).Elfin
@PeterCordes Nice. So 4 uops occupy the RS, but only two get issued, right? Can you also measure MEM_INST_RETIRED.ALL_LOADS and MEM_INST_RETIRED.ALL_STORES? I'm not sure if there is a memory uop perf event for all memory ops. It'd be interesting to know whether the fence uops has an impact on these perf events.Niacin
@HadiBrais: That's backwards: 4 uops are issued (Intel terminology) into the ROB, but 2 of them (like xor-zeroing) don't need to be executed so AFAIK they don't get issued into the RS at all. The other two do need to execute and are issued/renamed into the ROB and RS. mem_inst_retired.all_loads:u = 0.5 counts per mfence. (super weird, but yes I'm sure. 200M counts for 100M loop iters with 4x mfence). mem_inst_retired.all_stores:u = 0 counts.Elfin
@PeterCordes Awesome, man. Can you experiment with different number of mfences in the body of the loop? Like 1, 4, 8, 16, and measure the same perf counters as above.Niacin
Huh, good idea, it's not quite constant! times 1 mfence: 0.498 +- 0.14% counts of all_loads per mfence. (but saw one run at 0.486). times 2 mfence: 0.497 +- 0.06%, but saw one run at 0.462 +- 7.9% (ocperf.py stat ... -r3). Maybe one of those runs was competing with another hyperthread? times 4 mfence: 0.467 +- 2.1%, or 0.506 +- 3.46%. times 16 mfence: 0.4715, or another run at 0.497 +- 2.6%. System was mostly idle, but did have xmoto in the background using 4% of a core. It didn't have the focus and its window wasn't exposed. No CPU migrations for my tests.Elfin
With a load, mfence always counts. With %rep 4 / mov eax, [rel buf] / mfence / %endrep: total cycles is 36.3c per mfence instead of 33.3c / mfence for just mfence. And mem_inst_retired.all_loads:u stabilized at 2.00 +- 0.01% per mov+mfence pair, so 1.0 per mfence. It seems adjacent fences don't always get counted as loads. But with store+mfence, we have 41.3c/store+mfence, and again have some variability and half the number of load counts: 0.450 +- 3.6% mem_inst_retired.all_loads:uvs. exactly 1 mem_inst_retired.all_stores:u per mov-store, no variability, for %rep 16.Elfin
uops -> ports doesn't seem to change for mfence when alternating with a mov load, though. Still 4 fused-domain / 2 unfused-domain (p23 + p4). With a load in the loop, mem_inst_retired.all_loads ~= dispatched.port2 + dispatched.port3, otherwise not. There are lots of errata for perf counters; I didn't check if SKL has any that affect mem_inst_retired.all_loads, but it might be a mistake to interpret too much meaning out of it.Elfin
@PeterCordes Consider posting all these comments as an answer either to this question to a new question posted by either of us. The new question could be something like "What's the impact of mfence on memory performance events counters (retired loads, L1 load hits, and L1 load misses)?". It would be interesting to find out how to adjust these counters, given the number of dynamic mfences in the code.Niacin
The variations you've observed are OK. I mean they are very small (<5%) and can be ignored.Niacin
@PeterCordes By the way, shouldn't the memory-fences and memory-barriers tags be synonyms?Niacin
@HadiBrais: Yes, the tags should be synonyms, but I don't have enough score in either tag to even nominate them for being duplicates. I tagged this question with both to highlight the redundancy so maybe it could get fixed. And BTW, 5% variation is pretty huge over 100M iterations; normally I'd expect at least 2 orders of magnitude better, e.g. like the +- 0.05% I get for cycles:u (user-space clock cycles). If we trusted the counter, it would mean there's something weird / inconsistent about how mfence behaves in different microarchitectural conditions.Elfin
But yeah, the variation is probably just in how mfence is counted by that counter in different microarchitectural conditions. The factor of 4 difference in counts with / without a real load in the loop is interesting, though. So mfence ~= 0.5 mem_inst_retired.all_loads when there aren't any real loads, or 1.0 when mixed with loads.Elfin
@PeterCordes You can "upvote" the suggestion to make them synonyms. If we collect 4 upvotes, they will become synonyms. Anyone reading these comments, consider upvoting the synonym suggestion. Regarding variation, well, many if not most counters exhibit such variations from my experience. I mean in general, we have no choice but to tolerate the variations while interpreting the counters.Niacin
Oh, apparently I could have suggested synonyms, I just didn't find the button all the way at the bottom right. stackoverflow.com/tags/memory-barriers/synonyms. Upvoted yours.Elfin
Re: variations: in micro-benchmarks with just a single small loop that never goes off-core (l1d hits), usually variation in counters is a result of actual variation in execution. I'm using ocperf.py stat on a tiny static executable that literally does that and then sys_exit, with a total of 2 or 3 page faults and nothing other than interrupts interfering. Very high precision / repeatability / signal-to-noise ratios are normal here. e.g. Can x86's MOV really be "free"? Why can't I reproduce this at all? shows better than 1 part in 10k, over 0.5s.Elfin
@hadi - in my experience most counters have no such unexplained variance: most return exactly the number of events you'd expect (sometimes you need to add a few rules or corrections to get an exact count, sometimes off by 1). That comes from using the counters over a smallish section of code. For larger samples you'll see variation but this seems largely caused by external events like interrupts, or variance for eg in kernel call code paths, not counter variability.Phail
@Phail I think I exaggerated when I said "many if not most". It's more like "some events on particular processors". For some counters on particular processors, I've seen variations of up to 40% on a small loop! Making such counters more like random number generators. That's why I think < 5% variation is perhaps OK, it's not perfect, but just OK.Niacin
Update on mfence uops vs. Agner Fog's tables: he maybe tested before the microcode fix for SKL079 made mfence more expensive. See the bottom of Are loads and stores the only instructions that gets reordered?, which I updated recently after stumbling across the explanation of why mfence is so expensive. Maybe it's different uops to implement the brute-force stall-everything solution, but it used to be 4 unfused-domain uops to only block memory reordering.Elfin
@PeterCordes IIRC, the memory-order tag is supposed to be about ISA memory models and the memory-model tag is supposed to be about language memory models. But it seems that memory-order has somehow become a synonym to memory-barriers, but it should not be. The memory-barriers tag should only be a synonym to memory-fences. The memory-order tag may be a synonym to memory-ordering.Niacin
V
5

mfence is an instruction.

To get it on Linux:

1/ Write a file mfence.c

#include <stdio.h>

int main(){
    printf("Disass me\n");
    asm volatile ("mfence" ::: "memory");
    return 0;
}

2/ Compile

gcc mfence.c mfence

3/ Disassemble

objdump -d mfence | grep -A 10 "<main>:"

000000000000063a <main>:
 63a:   55                      push   %rbp
 63b:   48 89 e5                mov    %rsp,%rbp
 63e:   48 8d 3d 9f 00 00 00    lea    0x9f(%rip),%rdi        # 6e4 <_IO_stdin_used+0x4>
 645:   e8 c6 fe ff ff          callq  510 <puts@plt>
 64a:   0f ae f0                mfence 
 64d:   b8 00 00 00 00          mov    $0x0,%eax
 652:   5d                      pop    %rbp
 653:   c3                      retq   
 654:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
 65b:   00 00 00 

4/ Observe that at line 64a mfence is the (3 bits) instruction (0f ae f0)

So that is a cpu instruction (like mov): The processor needs to decode previous instructions before getting to it otherwise it couldn't guess it's alignement.

For example 0f ae f0 could appear in an address so the cpu cannot use it as a maker.

Finally, it is just an old school instruction, and at its execution point in the pipeline, it will synchronize the memory access futher in the pipeline before executing the next instruction.


Note: on Windows use the macro _ReadWriteBarrier in to produce a mfence

Villose answered 12/5, 2018 at 18:5 Comment(0)
P
4

Your question has the wrong assumptions. The MFENCE does not prevent the reordering of instructions (see highlighted quote). For example if there is a stream of 1000 instructions that only operate on registers and a MFENCE instruction is placed in the middle then it will have no effect on how the CPU reorders those instructions.

The MFENCE instruction is ordered with respect to all load and store instructions, other MFENCE instructions, any LFENCE and SFENCE instructions, and any serializing instructions (such as the CPUID instruction). MFENCE does not serialize the instruction stream.

Instead, the MFENCE instruction prevents the reordering of loads and stores to the cache and main memory.

Paradisiacal answered 18/5, 2018 at 9:56 Comment(3)
I think x86 can only merge adjacent stores, because it has to commit them to L1d cache in order. (x86's memory model doesn't allow StoreStore reordering). But I guess that works as an example of something other than blocking later loads (but not later ALU instructions) until after the last store before MFENCE becomes globally visible.Elfin
@PeterCordes, you're right. I'll remove that last paragraphParadisiacal
Hrm, that was kind of the interesting part of this answer. We have some evidence that Skylake does merge adjacent stores into the same cache line, but it's hard to measure because store-port throughput is only 1 per clock. This answer doesn't explain in detail what it means to stop memory reordering, or mention the store buffer at all, which is key to some people's misunderstanding here (Does a memory barrier acts both as a marker and as an instruction? and Does an x86 CPU reorder instructions?)Elfin

© 2022 - 2024 — McMap. All rights reserved.