The lfence
and sfence
asm instructions are no-ops unless you're using NT stores (or NT loads from WC memory, e.g. video RAM). (Actually, movntdqa
loads might only be ordered by mfence
on paper, not lfence
. In which case I don't know when you'd ever use lfence
. It was added to the ISA along with sfence
+ mfence
at the same time as NT stores, before movntdqa
, possibly just for completeness / in case it was ever needed.)
There is sometimes confusion around this point, because the C/C++ intrinsics for lfence
and sfence
are also compiler barriers. That is needed in C/C++, but can be had more cheaply with GNU C asm("":::"memory");
or (to order relaxed-atomic
operations1) std::atomic_signal_fence(std::memory_order_acq_rel)
. Restricts compile-time reordering without making the compiler emit any useless asm barrier instructions.
Run-time reordering is already blocked by the x86 memory model, except for StoreLoad reordering which requires mfence
to block. lfence
+ sfence
don't add up to mfence
. See Does it make any sense instruction LFENCE in processors x86/x86_64? and various other SO Q&As about these instructions.
This is why std::atomic_thread_fence(std::memory_order_acq_rel)
also compiles to zero instructions on x86, but to barriers on weakly-ordered architectures.
lfence
is also a serializing instruction on Intel microarchitectures (but maybe not AMD?). It has been all along, but Intel recently made this guarantee official so Spectre mitigation techniques could safely use it instead of a much more inconvenient cpuid
.
atomic_signal_fence
on gcc may also be a compiler barrier for plain non-atomic
variables; it was last time I checked with gcc (while atomic_thread_fence
wasn't), but this is probably just an implementation detail when there aren't any atomic
variables involved. When there are atomic
variables, the compiler knows that those variables may provide ordering that lets other threads access non-atomic variables without UB, so ordering is needed.
SFENCE
, there is no particular guarantee regarding when the processor may decide to make the writes visible. This is mostly not an issue in practice. – Frenchylfence
andsfence
asm instructions are no-ops unless you're using NT stores (or NT loads from WC memory, e.g. video RAM). – Galliwaspsfence
isn't going to make the stores visible sooner, AFAIK. It might make the CPU wait while the store buffer drains (but probably onlymfence
does that, to stop StoreLoad reordering). Stores that were already in the buffer are already committing as fast as possible. – Galliwasp