Does the Intel Memory Model make SFENCE and LFENCE redundant?
Asked Answered
F

1

23

The Intel Memory Model guarantees:

  • Stores will not be re-ordered with other Stores
  • Loads will not be re-ordered with other Loads

http://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/

I have seen claims that SFENCE is redundant on x86-64 due to the Intel memory model, but never LFENCE. Do the above memory model rules make either instructions redundant?

Faina answered 21/9, 2015 at 21:58 Comment(4)
Err, what about Store-Load and Load-Store ordering?Pillbox
@IwillnotexistIdonotexist: MFENCE is a StoreLoad barrier (and all 3 other kinds, too). And yes, you still need it. :P I'm not sure if movNT loads/stores can show LoadStore re-ordering, or if they omitted a separate LoadStore barrier instruction on the assumption that you typically (always?) need a StoreLoad barrier any time you need LoadStore barrier. Since it only affects movnt streaming ops anyway, it's a special case of a special case, and x86 is fine without it. :PDexamyl
@PeterCordes In a previous answer, I cited the full list or permitted reorderings from the Intel SDMs. But what I was driving at with my comment above is that OP pointed out, mostly correctly, that Load-Load and Store-Store reordering doesn't occur. However, those are but two of four possibilities in total (Load-Store and Store-Load are the other two combinations), and those other possibilities can occur, whence the need arises for mfence/sfence/lfence.Pillbox
@IwillnotexistIdonotexist: Oh interesting, so LFENCE is a Load-Store barrier, too, since later stores can't be globally visible before the load / lfence. I assume that normally movnt loads/stores reorder that way, most likely if the load address wasn't available until after the store. I hadn't looked too closely at LFENCE, and assumed it was just a LoadLoad barrier.Dexamyl
D
24

Right, LFENCE and SFENCE are not useful in normal code because x86's acquire / release semantics for regular stores make them redundant unless you're using other special instructions or memory types.

The only fence that matters for normal lockless code is the full barrier (including StoreLoad) from a locked instruction, or a slow MFENCE. Prefer xchg for sequential-consistency stores over mov+mfence. Are loads and stores the only instructions that gets reordered? because it's faster.

Does `xchg` encompass `mfence` assuming no non-temporal instructions? (yes, even with NT instructions, as long as there's no WC memory.)


Jeff Preshing's Memory Reordering Caught in the Act article is an easier-to-read description of the same case Bartosz's post talks about, where you need a StoreLoad barrier like MFENCE. Only MFENCE will do; you can't construct MFENCE out of SFENCE + LFENCE. (Why is (or isn't?) SFENCE + LFENCE equivalent to MFENCE?)

If you had questions after reading the link you posted, read Jeff Preshing's other blog posts. They gave me a good understanding of the subject. :) Although I think I found the tidbit about SFENCE/LFENCE normally being a no-op in Doug Lea's page. Jeff's posts didn't consider NT loads/stores.


Related: When should I use _mm_sfence _mm_lfence and _mm_mfence (my answer and @BeeOnRope's answer are good. I wrote this answer a lot longer ago than that answer, so parts of this answer are showing my inexperience years ago. My answer there considers the C++ intrinsics and C++ compile-time memory order, which is not at all the same thing as x86 asm runtime memory ordering. But you still don't want _mm_lfence().)


SFENCE is only relevant when using movnt (Non-Temporal) streaming stores, or working with memory regions with a type set to something other than the normal Write-Back. Or with clflushopt, which is kind of like a weakly-ordered store. NT stores bypass the cache as well as being weakly ordered. x86's normal memory model is strongly ordered, other than NT stores, WC (write-combining) memory, and ERMSB string ops (see below)).

LFENCE is only useful for memory ordering with weakly-ordered loads, which are very rare. (Or possible for LoadStore ordering with regular loads before NT stores?)

NT loads (movntdqa) from WB memory are still strongly ordered, even on a hypothetical future CPU that doesn't ignore the NT hint; the only way to do weakly-ordered loads on x86 is when reading from weakly-ordered memory (WC), and then I think only with movntdqa. This doesn't happen by accident in "normal" programs, so you only have to worry about this if you mmap video RAM or something.

(The primary use-case for lfence is not memory ordering at all, it's for serializing instruction execution, e.g. for Spectre mitigation, or with RDTSC. See Is LFENCE serializing on AMD processors? and the "linked questions" sidebar for that question.)


Memory ordering in C++, and how it maps to x86 asm

I got curious about this a couple weeks ago, and posted a fairly detailed answer to a recent question: Atomic operations, std::atomic<> and ordering of writes. I included lots of links to stuff about the memory model of C++ vs. hardware memory models.

If you're writing in C++, using std::atomic<> is an excellent way to tell the compiler what ordering requirements you have, so it doesn't reorder your memory operations at compile time. You can and should use weaker release or acquire semantics where appropriate, instead of the default sequential consistency, so the compiler doesn't have to emit any barrier instructions at all on x86. It just has to keep the ops in source order.


On a weakly ordered architecture like ARM or PPC, or x86 with movnt, you need a StoreStore barrier instruction between writing a buffer and setting a flag to indicate the data is ready. Also, the reader needs a LoadLoad barrier instruction between checking the flag and reading the buffer.

Not counting movnt, x86 already has LoadLoad barriers between every load, and StoreStore barriers between every store. (LoadStore ordering is also guaranteed). MFENCE is all 4 kinds of barriers, including StoreLoad, which is the only barrier x86 doesn't do by default. MFENCE makes sure loads don't use old prefetched values from before other threads saw your stores and potentially did stores of their own. (As well as being a barrier for NT store ordering and load ordering.)

Fun fact: x86 lock-prefixed instructions are also full memory barriers. They can be used as a substitute for MFENCE in old 32bit code that might run on CPUs not supporting it. lock add [esp], 0 is otherwise a no-op, and does the read/modify/write cycle on memory that's very likely hot in L1 cache and already in the M state of the MESI coherency protocol.

SFENCE is a StoreStore barrier. It's useful after NT stores to create release semantics for a following store.

LFENCE is almost always irrelevant as a memory barrier because the only weakly-ordered load

a LoadLoad and also a LoadStore barrier. (loadNT / LFENCE / storeNT prevents the store from becoming globally visible before the load. I think this could happen in practice if the load address was the result of a long dependency chain, or the result of another load that missed in cache.)


ERMSB string operations

Fun fact #2 (thanks @EOF): The stores from ERMSB (Enhanced rep movsb/rep stosb on IvyBridge and later) are weakly-ordered (but not cache-bypassing). ERMSB builds on regular Fast-String Ops (wide stores from the microcoded implementation of rep stos/movsb that's been around since PPro).

Intel documents the fact that ERMSB stores "may appear to execute out of order" in section 7.3.9.3 of their Software Developers Manual, vol1. They also say

"Order-dependent code should write to a discrete semaphore variable after any string operations to allow correctly ordered data to be seen by all processors"

They don't mention any barrier instructions being necessary between the rep movsb and the store to a data_ready flag.

The way I read it, there's an implicit SFENCE after rep stosb / rep movsb (at least a fence for the string data, probably not other in-flight weakly ordered NT stores). Anyway, the wording implies that a write to the flag / semaphore becomes globally visible after all the string-move writes, so no SFENCE / LFENCE is needed in code that fills a buffer with a fast-string op and then writes a flag, or in code that reads it.

(LoadLoad ordering always happens, so you always see data in the order that other CPUs made it globally visible. i.e. using weakly-ordered stores to write a buffer doesn't change the fact that loads in other threads are still strongly ordered.)

summary: use a normal store to write a flag indicating that a buffer is ready. Don't have readers just check the last byte of the block written with memset/memcpy.

I also think ERMSB stores prevent any later stores from passing them, so you still only need SFENCE if you're using movNT. i.e. the rep stosb as a whole has release semantics wrt. earlier instructions.

There's a MSR bit that can be cleared to disable ERMSB for the benefit of new servers that need to run old binaries that writes a "data ready" flag as part of a rep stosb or rep movsb or something. (In that case I guess you get the old fast-string microcode that may use an efficient cache protocol, but does make all the stores appear to other cores in order).

Dexamyl answered 21/9, 2015 at 22:34 Comment(6)
It's not only movnt that has weaker memory ordering. The memcpy/strcpy-instructions (rep[ne] movs[b/w/d/q]) do too.Constantan
@EOF: Thanks, I didn't know that! Strange that the insn ref manual doesn't mention that, only the Vol1 manual. I updated my answer with my interpretation of what the docs say: there's an implicit StoreStore barrier (for the string data) after a rep movsb, so you just need to write your data-ready flag separately (not as the last bytes of the string op).Dexamyl
@EOF: It's not only movnt and rep[ne] movs[b/w/d/q]; but (potentially) every single instruction that accesses memory; given that the memory ordering model can be weakened by configuring either the PAT/page tables or MTRRs for the memory being accessed as "write combining" (rather than "write back").Nara
@Brendan: I was assuming the context of a user process in a normal OS, like Linux. You can assume all your pages are WB unless you took special OS-specific to map any other pages. WB memory performs much better than any other type, for most uses. Interesting point, though; it's true that WC memory is weakly ordered.Dexamyl
@PeterCordes: Assumptions are fine if they're correct, but even then it's nice to at least be aware that there's cases where those assumption aren't correct (e.g. device drivers talking to memory mapped IO areas, like video display memory).Nara
@Brendan: I think my answer in its current form mentions memory types clearly enough that it's not misleading. But yes, I guess other people's default assumptions about context when you're using asm could be different. Good point.Dexamyl

© 2022 - 2024 — McMap. All rights reserved.