x86 fence instructions can be briefly described as follows:
MFENCE prevents any later loads or stores from becoming globally observable before any earlier loads or stores. It drains the store buffer before later loads1 can execute.
LFENCE blocks instruction dispatch (Intel's terminology) until all earlier instructions retire. This is currently implemented by draining the ROB (ReOrder Buffer) before later instructions can issue into the back-end.
SFENCE only orders stores against other stores, i.e. prevents NT stores from committing from the store buffer ahead of SFENCE itself. But otherwise SFENCE is just like a plain store that moves through the store buffer. Think of it like putting a divider on a grocery-store checkout conveyor belt that stops NT stores from getting grabbed early. It does not necessarily force the store buffer to be drained before it retires from the ROB, so putting LFENCE after it doesn't add up to MFENCE.
A "serializing instruction" like CPUID (and IRET, etc) drains everything (ROB, store buffer) before later instructions can issue into the back-end, and discards the front-end. MFENCE + LFENCE would also do the back-end part, but true serializing instructions also discard fetched machine code, so can work for cross-modifying code. (e.g. a load sees a flag, you run cpuid
or the new serialize
, then jump to a buffer where another thread stored code before a release-store on the flag. Code-fetch is guaranteed to get the new instructions. Unlike data loads, code-fetch doesn't respect x86's usual LoadLoad ordering rule.)
These descriptions are a little ambiguous in terms of exactly what kind of operations are ordered and there are some differences across vendors (e.g. SFENCE is stronger on AMD) and even processors from the same vendor. Refer to the Intel's manual and specification updates and AMD's manual and revision guides for more information. There are also a lot of other discussions on these instructions on SO other other places. But read the official sources first. The descriptions above are I think the minimum specified on-paper behaviour across vendors.
Footnote 1: OoO exec of later stores don't need to be blocked by MFENCE; executing them just writes data into the store buffer. In-order commit already orders them after earlier stores, and commit after retirement orders wrt. loads (because x86 requires loads to complete, not just to start, before they can retire, as part of ensuring load ordering).
Remember that x86 hardware is built to disallow reordering other than StoreLoad.
The Intel manual Volume 2 number 325383-072US describes SFENCE as an instructions that "ensures that every store prior to SFENCE is globally visible before any store after SFENCE becomes globally visible." Volume 3 Section 11.10 says that the store buffer is drained when using the SFENCE. The correct interpretation of this statement is exactly the earlier statement from Volume 2. So SFENCE can be said to drain the store buffer in that sense. There is no guarantee at what point during SFENCE's lifetime earlier stores achieve GO. For any earlier store, it could happen before, at, or after retirement of SFENCE. Regarding what the point of GO is, it depends on serveral factors. This is beyond the scope of the question. See: Why “movnti” followed by an “sfence” guarantees persistent ordering?.
MFENCE does have to prevent NT stores from reordering with other stores, so it has to include whatever SFENCE does, as well as draining the store buffer. And also reordering of weakly-ordered SSE4.1 NT loads from WC memory, which is harder because the normal rules that get load ordering for free no longer apply to those. Guaranteeing this is why a Skylake microcode update strengthened (and slowed) MFENCE to also drain the ROB like LFENCE. It might still be possible for MFENCE to be lighter weight than that with HW support for optionally enforcing ordering of NT loads in the pipeline.
The main reason why SFENCE + LFENCE is not equal to MFENCE is because SFENCE + LFENCE doesn't block StoreLoad reordering, so it's not sufficient for sequential consistency. Only mfence
(or a lock
ed operation, or a real serializing instruction like cpuid
) will do that. See Jeff Preshing's Memory Reordering Caught in the Act for a case where only a full barrier is sufficient.
From Intel's instruction-set reference manual entry for sfence
:
The processor ensures that every store prior to SFENCE is globally visible before any store after SFENCE becomes globally visible.
but
It is not ordered with respect to memory loads or the LFENCE instruction.
LFENCE forces earlier instructions to "complete locally" (i.e. retire from the out-of-order part of the core), but for a store or SFENCE that just means putting data or a marker in the memory-order buffer, not flushing it so the store becomes globally visible. i.e. SFENCE "completion" (retirement from the ROB) doesn't include flushing the store buffer.
This is like Preshing describes in Memory Barriers Are Like Source Control Operations, where StoreStore barriers aren't "instant". Later in that that article, he explains why a #StoreStore + #LoadLoad + a #LoadStore barrier doesn't add up to a #StoreLoad barrier. (x86 LFENCE has some extra serialization of the instruction stream, but since it doesn't flush the store buffer the reasoning still holds).
LFENCE is not fully serializing like cpuid
(which is as strong a memory barrier as mfence
or a lock
ed instruction). It's just LoadLoad + LoadStore barrier, plus some execution serialization stuff which maybe started as an implementation detail but is now enshrined as a guarantee, at least on Intel CPUs. It's useful with rdtsc
, and for avoiding branch speculation to mitigate Spectre.
BTW, SFENCE is a no-op for WB (normal) stores.
It orders WC stores (such as movnt, or stores to video RAM) with respect to any stores, but not with respect to loads or LFENCE. Only on a CPU that's normally weakly-ordered does a store-store barrier do anything for normal stores. You don't need SFENCE unless you're using NT stores or memory regions mapped WC. If it did guarantee draining the store buffer before it could retire, you could build MFENCE out of SFENCE+LFENCE, but that isn't the case for Intel.
The real concern is StoreLoad reordering between a store and a load, not between a store and barriers, so you should look at a case with a store, then a barrier, then a load.
mov [var1], eax
sfence
lfence
mov eax, [var2]
can become globally visible (i.e. commit to L1d cache) in this order:
lfence
mov eax, [var2] ; load stays after LFENCE
mov [var1], eax ; store becomes globally visible before SFENCE
sfence ; can reorder with LFENCE
memory controller
. Fences are used to coordinate system memory and cache memory. And I think this cache coherency is the responsibility ofmemory controller
. – WeathercockL/S/MFENCE
not related to the cache coherency, becauseSFENCE
flushes Store-Buffer which not related to the cache coherency. In some CPUs (not x86) Load FENCE flush Invalidate-Queue, but x86 have not it. In internet I find that LFENCE makes no sense in processors x86, ie it does nothing. Then, reordering ofSFENCE
MOV reg, [addr]
-->MOV reg, [addr]
SFENCE
possible only in theory, not perhaps in reality, is it true? – Comity