x86 mfence and C++ memory barrier

Asked 18/3, 2019 at 23:42 Answered 19/3, 2019 at 1:34

Solved c++11 gcc x86 memory-barriers memory-model

I'm checking how the compiler emits instructions for multi-core memory barriers on x86_64. The below code is the one I'm testing using gcc_x86_64_8.3.

std::atomic<bool> flag {false};
int any_value {0};

void set()
{
  any_value = 10;
  flag.store(true, std::memory_order_release);
}

void get()
{
  while (!flag.load(std::memory_order_acquire));
  assert(any_value == 10);
}

int main()
{
  std::thread a {set};
  get();
  a.join();
}

When I use std::memory_order_seq_cst, I can see the MFENCE instruction is used with any optimization -O1, -O2, -O3. This instruction makes sure the store buffers are flushed, therefore updating their data in L1D cache (and using MESI protocol to make sure other threads can see effect).

However when I use std::memory_order_release/acquire with no optimizations MFENCE instruction is also used, but the instruction is omitted using -O1, -O2, -O3 optimizations, and not seeing other instructions that flush the buffers.

In the case where MFENCE is not used, what makes sure the store buffer data is committed to cache memory to ensure the memory order semantics?

Below is the assembly code for the get/set functions with -O3, like what we get on the Godbolt compiler explorer:

set():
        mov     DWORD PTR any_value[rip], 10
        mov     BYTE PTR flag[rip], 1
        ret


.LC0:
        .string "/tmp/compiler-explorer-compiler119218-62-hw8j86.n2ft/example.cpp"
.LC1:
        .string "any_value == 10"

get():
.L8:
        movzx   eax, BYTE PTR flag[rip]
        test    al, al
        je      .L8
        cmp     DWORD PTR any_value[rip], 10
        jne     .L15
        ret
.L15:
        push    rax
        mov     ecx, OFFSET FLAT:get()::__PRETTY_FUNCTION__
        mov     edx, 17
        mov     esi, OFFSET FLAT:.LC0
        mov     edi, OFFSET FLAT:.LC1
        call    __assert_fail

Proffer answered 18/3, 2019 at 23:42 Comment(0)

The x86 memory ordering model provides #StoreStore and #LoadStore barriers for all store instructions¹, which is all what the release semantics require. Also the processor will commit a store instruction as soon as possible; when the store instruction retires, the store becomes the oldest in the store buffer, the core has the target cache line in a writeable coherence state, and a cache port is available to perform the store operation². So there is no need for an MFENCE instruction. The flag will become visible to the other thread as soon as possible and when it does, any_value is guaranteed to be 10.

On the other hand, sequential consistency also requires #StoreLoad and #LoadLoad barriers. MFENCE is required to provide both³ barriers and so it is used at all optimization levels.

Footnotes:

(1) There are exceptions that don't apply here. In particular, non-temporal stores and stores to the uncacheable write-combining memory types provide only the #LoadStore barrier. Anyway, these barriers are provided for stores to the write-back memory type on both Intel and AMD processors.

(2) This is in contrast to write-combining stores which are made globally-visible under certain conditions. See Section 11.3.1 of the Intel manual Volume 3.

(3) See the discussion under Peter's answer.

Finicky answered 19/3, 2019 at 0:18 Comment(9)

Thank you for clarifying the question in a very detailed manner! The answer I posted before, I was compiling with std::memory_order_seq_cst by mistake, therefore I deleted my answer. So for x86 as long as the instruction is atomic, any release acquire memory order will work. – Proffer 19/3, 2019 at 0:31

@Proffer Yes. atomic not only provides ISA-level barriers but also compiler-level barriers. – Finicky 19/3, 2019 at 0:37

@HadiBras could you please explain why #loadload is not needed for acquire? I see how #storestore and #loadstore make sense for release, but acquire seems to need it? – Proffer 19/3, 2019 at 5:35

@Proffer The x86 memory ordering model also provides a #LoadLoad barrier between any two loads that are write-back cacheable. See Section 8.2.2 of the Intel manual Volume 3. So there is no need to explicitly use any fence instructions to order such loads with respect to each other. – Finicky 19/3, 2019 at 5:40

@HadiBras, Got it, thank you again for your support! – Proffer 19/3, 2019 at 5:50

@Proffer Actually I've made a mistake in my previous comment. See the update. – Finicky 19/3, 2019 at 19:37

@HadiBras, Thank you for the update. I followed your comments between Peter and BeeonRope, do you mind corroborating the following statement: The x86 TSO is sequentially consistent + plus store buffer (StoreLoad) and store forwarding (LoadLoad), MFENCE fence needed for StoreLoad. In a CPU, its stores are committed to memory in program order, however the execution of the stores wrt stores/loads and loads wrt loads/stores can be done out of order, before the instructions retire. Continuation below. – Proffer 19/3, 2019 at 22:56

@HadiBras Since the execution of loads are done while in OoO, then no need to load them again once retired, except when load is set to invalid before or after retirement, in this case a rollback occurs. Once the store, load and rest of instructions are retired, the commit of these instructions are done in program order where the load sets the value in register file and store commits to cache. – Proffer 19/3, 2019 at 22:56

@Proffer Sequential consistency by definition doesn't allow reordering any accesses, so it offers all of the four barriers. The store buffer (or out-of-order execution) enables loads to be reordered with earlier stores. Store forwarding may violate the LoadLoad barrier. Therefore, the x86 model can be described as sequential-consistency minus StoreLoad minus LoadLoad (when the first load gets its value from store forwarding). It can be stated alternatively as sequential-consistency plus store buffering plus store forwarding. This is very similar to the total store order (TSO) model. – Finicky 20/3, 2019 at 16:32

x86's TSO memory model is sequential-consistency + a store buffer, so only seq-cst stores need any special fencing. (Stalling after a store until the store buffer drains, before later loads, is all we need to recover sequential consistency). The weaker acq/rel model is compatible with the StoreLoad reordering caused by a store buffer.

(See discussion in comments re: whether "allowing StoreLoad reordering" is an accurate and sufficient description of what x86 allows. A core always sees its own stores in program order because loads snoop the store buffer, so you could say that store-forwarding also reorders loads of recently-stored data. Except you can't always: Globally Invisible load instructions)

(And BTW, compilers other than gcc use xchg to do a seq-cst store. This is actually more efficient on current CPUs. GCC's mov+mfence might have been cheaper in the past, but is currently usually worse even if you don't care about the old value. See Why does a std::atomic store with sequential consistency use XCHG? for a comparison between GCC's mov+mfence vs. xchg. Also my answer on Which is a better write barrier on x86: lock+addl or xchgl?)

Fun fact: you can achieve sequential consistency by instead fencing seq-cst loads instead of stores. But cheap loads are much more valuable than cheap stores for most use-cases, so everyone uses ABIs where the full barriers go on the stores.

See https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html for details of how C++11 atomic ops map to asm instruction sequences for x86, PowerPC, ARMv7, ARMv8, and Itanium. Also When are x86 LFENCE, SFENCE and MFENCE instructions required?

when I use std::memory_order_release/acquire with no optimizations MFENCE instruction is also used

That's because flag.store(true, std::memory_order_release); doesn't inline, because you disabled optimization. That includes inlining of very simple member functions like atomic::store(T, std::memory_order = std::memory_order_seq_cst)

When the ordering parameter to the __atomic_store_n() GCC builtin is a runtime variable (in the atomic::store() header implementation), GCC plays it conservative and promotes it to seq_cst.

It might actually be worth it for gcc to branch over mfence because it's so expensive, but that's not what we get. (But that would make larger code-size for functions with runtime variable order params, and the code path might not be hot. So branching is probably only a good idea in the libatomic implementation, or with profile-guided optimization for rare cases where a function is large enough to not inline but takes a variable order.)

Amboise answered 19/3, 2019 at 1:34 Comment(24)

Thank you for your complete answer as well! The way I understand things so far is like this: The ROB orders register instructions and MOB (load + store buffer) order loads and stores to be consistent. These two combined, results in ordered code. Since the load buffer is checked for coherency as well, i.e. when speculative execution runs a load, and the value is changed before being retired (i.e. false sharing), then pipeline must be flushed. Continuation next answer. – Proffer 19/3, 2019 at 2:2

For this reason the LoadStore memory model is maintained in the architecture of the CPU. However StoreLoad is not the same as stores are not being checked for coherency, and can also be pending in store buffer before being flushed to cache. – Proffer 19/3, 2019 at 2:2

@AdvSphere: Yes, the x86 TSO memory model is sequential consistency + a store buffer. A thread/core always sees its own stores in program order (because loads snoop the store buffer), but other threads don't see those stores until they become globally visible: commit to L1d cache sometime after the store becomes non-speculative (retirement from OoO back-end). But yes, out-of-order execution gives the illusion of executing in program order. And yes, re-checking loads to make sure the cache line is still present when it's legal to read allows speculative load reordering for high perf. – Amboise 19/3, 2019 at 2:12

FWIW I've stopped saying "x86 only allows StoreLoad", at least when I have room to write more, because I no longer think it's a complete description of the x86 memory model. IMO there are two key aspects: StoreLoad reordering is allowed (eg the effect of store buffering) and a CPU may see its own stores out of order with respect to the global order (eg the effect of store forwarding). It's the second part that is often missing. The second restriction should actually be a bit more precise: the reordering only allows local stores in appear earlier, not later in the global order. – Shelleyshellfire 19/3, 2019 at 17:4

Here's a litmus test that is allowed on x86 (and in fact is the "obvious" outcome if you do a good job of starting both threads at the same time), which can't be explained (I think) purely by StoreLoad reordering against a total store order. To remember it, I think of it in terms of the store buffer: x86 is exactly like a sequentially consistent system except with a store buffer - but the this has not one but two effects: stores passing later loads, and store forwarding. Their impact on the memory model is distinct. – Shelleyshellfire 19/3, 2019 at 17:10

@BeeOnRope: I guess it comes down to what you mean by StoreLoad. It's StoreLoad when other threads observe your stores vs. your loads, even though you observe your own operations in program order. That part is always assumed in all memory models; the cardinal rule of out-of-order execution is that you don't break single-threaded code. But yes, TSO = seq-cst + a store buffer is a clear way to describe it (like I did in the comment before yours). You've convinced me it's better because it leaves less room for confusion or forgetting that loads snoop the local store buffer. – Amboise 19/3, 2019 at 17:43

@PeterCordes - I think you are missing what I'm saying. I agree CPUs observe their own stuff in program order, that is never up for debate! As far as I know the various barriers like StoreLoad mean that when considering the actual observed behavior across all threads, you can freely reorder loads and stores on every thread as long as you don't say move a load past an ealier store separated by StoreLoad. That is, the barriers constrain the possible reorderings of each threads loads and stores, and then all the allowed reorderings are an allowed execution. – Shelleyshellfire 19/3, 2019 at 17:47

Alternately, take a look at the litmus test I linked and explain how the result can occur with only StoreLoad reordering? I can't. Note that most litmus tests do work with only StoreLoad reordering. I guess what I'm actually saying is that you just cannot explain x86 in terms of the the 4 {Store,Load}^2 barrier types: saying it only permits StoreLoad is wrong: that model is stronger than the actual x86 model, but allowing any of the other 3 would be way too weak (weaker than the actual model). So you have to give up on that way of describing it to get it right. – Shelleyshellfire 19/3, 2019 at 17:49

About program order: I think we agree that threads respecting program order is an additional restriction on the possible executions: i.e., it forbids some nonsense results like a load moving above a store to the same location hence not reading the value written (or at least another value written by some other thread). The program order stuff doesn't allow additional reorderings that would be otherwise disallowed. So in that context the litmus test shows two threads disagreeing about the order of writes to [x] and [y] which is not generally possible in x86, and can't ... – Shelleyshellfire 19/3, 2019 at 18:8

... explained by StoreLoad reordering only (you need LoadLoad reordering to get it with the classic SL^2 barrier model). Indeed, if you move the reads to other threads, the scenario is forbidden. Only on the writing threads can the store appear out of order, a scenario not expressive with the SL^2 barriers. It doesn't come up all that often (frankly the example looks dumb, right?) - which is what leads to the impression that StoreLoad is enough. – Shelleyshellfire 19/3, 2019 at 18:9

@BeeOnRope: other threads don't directly observe your loads, but to actually implement that test you need to collect a,b,c, and d somewhere, so other threads do eventually learn what you loaded. Anyway, you can say that store-forwarding reorders loads, at least as far as when the loaded data becomes globally visible, but even that isn't always true: Globally Invisible load instructions shows a partial-forwarding case where a load can get data that was never globally visible. So yes, TSO = seqcst + store buffer is good. – Amboise 19/3, 2019 at 18:10

@PeterCordes - sure we just assume a-d are registers and you print them out, there is no observation per se, they are simply local state (registers) and they can be exposed without any concurrency concerns. Overall I don't think that's relevant here, I was just confused by your characterization of StoreLoad relating to other threads observing your loads. StoreLoad is usually purely local. TSO = seqcst + store buffer - I don't think it's enough. You need seqcst + store buffer **and** store forwarding, because on a system without store forwarding, the litmus test I showed is forbidden. – Shelleyshellfire 19/3, 2019 at 18:11

@BeeOnRope: Oh, I hadn't considered the flush instead of forward strategy for the store buffer to not break single-threaded programs. Yes, if you want to be rigorous then you'd say that. I think my current answer to this is in good shape. re: printing registers: ok in that case it's not the other thread per-se that observes the loads, but there is an eventual observer of the loads or the thread's behaviour based on the loaded data, otherwise it wouldn't matter what they loaded. – Amboise 19/3, 2019 at 18:15

I mean you can imagine other strategies besides flushing as far as hardware implementations does: one could do this all speculatively, and try to detect order-changing forwardings, but admittedly it all gets messy really quickly (and probably means you can't delay store commit until after retirement). That said, I think focusing on the hardware implementation is a bit risky: I use it to remember the semantics of the formal x86 memory model, but I don't think hardware actually works like that! For example, StoreLoad reordering is allowed "due to the effect of store buffering"... – Shelleyshellfire 19/3, 2019 at 18:26

I think I agree with @BeeOnRope. This is also documented in the Intel manual Section 8.2.3.5. I think that an alternative way of looking at is as follows: in the x86 memory model, two loads can be reordered if the first one is data-dependent on a previous store. In particular, the second load can be reordered above the store. So even on a x86 processor that does't support store forwarding, the test that Bee has linked to still holds. So perhaps the rule about load reordering from Section 8.2.2 can be stated more clearly by mentioning that exception to load reordering. – Finicky 19/3, 2019 at 18:26

... but I'm pretty sure modern x86 actually freely moves loads ahead of stores even aside from buffering: e.g., executing loads even before the stores ever execute (and I'm not talking about disallowed-but-lets-try-it-speculatively orderings detected by the MOB here). So the hardware led to the rules in the formal model, but then later hardware isn't just limited to the behaviors of the original hardware model, it can use the allowed reorderings in other ways too. – Shelleyshellfire 19/3, 2019 at 18:28

Therefore, strictly speaking, MFENCE is needed to prevent StoreLoad reordering and that special case of LoadLoad reordering. – Finicky 19/3, 2019 at 18:31

@Shelleyshellfire But then I think if store forwarding is a form of a combination of LoadLoad reordering and StoreLoad reordering, the SL^2 matrix would still seem to fully describe the x86 memory model. – Finicky 19/3, 2019 at 18:42

@HadiBrais: Bee's point is that LoadLoad isn't allowed in general, so x86's memory model can't be described in terms of StoreLoad + LoadLoad always being allowed all the time. It requires extra conditions for when they're allowed and when LoadLoad is not. And also partial store forwarding (e.g. dword load after a byte store) allows loading of a value that was never globally visible. How do you describe that in terms of StoreLoad + sometimes LoadLoad? – Amboise 19/3, 2019 at 18:53

@PeterCordes To my knowledge, no Intel or AMD processor to date supports this case of forwarding. So far, forwarding may only be supported when the load is fully contained in the previous store. That said, I don't know how the x86 model handles that case. Anyway, since no processor supports that, all current supported forms of store forwarding can be described in terms of StoreLoad+LoadLoad. In case it got supported on some future processor, then Intel may further clarify their model. – Finicky 19/3, 2019 at 19:7

@PeterCordes - well on that last point I usually just live in the world of "let's talk about aligned loads and stores all of the same width", because even that is tricky enough - but yeah, it gets extra special once you consider that. I think the Cambridge guys included that in the later papers? Does it work if you sort of extend the model so that an overlapping store applies the existing rules to each part? I.e., individually select the parts. That should explain your "never in GO" example linked above. I guess it is not enough because you need some constraints on which parts you select. – Shelleyshellfire 19/3, 2019 at 19:8

@HadiBrais - right, but the case Peter linked above shows an interesting case, where it does appear to do some type of store-forwarding for a partially overlapping store. The store forwarding isn't fast (but still faster than a drain) - but the result is very different than if the load simply stalled until the store committed and then read something from L1. Instead the result is as if part of the result came from store forwarding and part came from a GO store of another CPU, and that can be something that never existed at any moment in the GO (i.e., impossible for another CPU to read it). – Shelleyshellfire 19/3, 2019 at 19:10

@HadiBrais: Incorrect. "failed" store forwarding on current CPUs isn't as slow as draining the store buffer. We have absolute proof that some x86 implementations do load from L1d and merge with the pending store in the store buffer. Alex's answer on Can x86 reorder a narrow store with a wider load that fully contains it?/ has experimental evidence. My answer has some discussion of what that means, and lists some latencies for "failed" store forwarding from Agner Fog. They might be best-case, or if it's really near-constant then it can't wait for flush. – Amboise 19/3, 2019 at 19:12

@PeterCordes I have seen that post before and completely forgot about it. Anyway, I think the SL^2 matrix can still be used to fully describe that behavior as well. We can say that StoreLoad+LoadLoad reordering may occur to the part of data that is overlapping between the first load and the overlapping previous store. The LoadLoad barrier still holds for the non-overlapping part. This example seems to show, however, that different parts of the same load can be observable at different times. – Finicky 19/3, 2019 at 19:35

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags