Why is (or isn't?) SFENCE + LFENCE equivalent to MFENCE?
Asked Answered
C

3

17

As we know from a previous answer to Does it make any sense instruction LFENCE in processors x86/x86_64? that we can not use SFENCE instead of MFENCE for Sequential Consistency.

An answer there suggests that MFENCE = SFENCE+LFENCE, i.e. that LFENCE does something without which we can not provide Sequential Consistency.

LFENCE makes impossible to reordering:

SFENCE
LFENCE
MOV reg, [addr]

-- To -->

MOV reg, [addr]
SFENCE
LFENCE

For example reordering of MOV [addr], reg LFENCE --> LFENCE MOV [addr], reg provided by mechanism - Store Buffer, which reorders Store - Loads for performance increase, and beacause LFENCE does not prevent to it. And SFENCE disables this mechanism.

What mechanism disables the LFENCE to make impossible reordering (x86 have not mechanism - Invalidate-Queue)?

And is reordering of SFENCE MOV reg, [addr] --> MOV reg, [addr] SFENCE possible only in theory or perhaps in reality? And if possible, in reality, what mechanisms, how does it work?

Comity answered 23/12, 2014 at 21:4 Comment(2)
I guess L/S/M FENCE are enforced by the memory controller. Fences are used to coordinate system memory and cache memory. And I think this cache coherency is the responsibility of memory controller.Weathercock
@Peng Zhang Cache coherency provided automatically by MOESI/MESIF cc-protocols, more specifically these protocols - provide acquire-release consistensy. As I know L/S/MFENCE not related to the cache coherency, because SFENCE flushes Store-Buffer which not related to the cache coherency. In some CPUs (not x86) Load FENCE flush Invalidate-Queue, but x86 have not it. In internet I find that LFENCE makes no sense in processors x86, ie it does nothing. Then, reordering of SFENCE MOV reg, [addr] --> MOV reg, [addr] SFENCE possible only in theory, not perhaps in reality, is it true?Comity
C
28

x86 fence instructions can be briefly described as follows:

  • MFENCE prevents any later loads or stores from becoming globally observable before any earlier loads or stores. It drains the store buffer before later loads1 can execute.

  • LFENCE blocks instruction dispatch (Intel's terminology) until all earlier instructions retire. This is currently implemented by draining the ROB (ReOrder Buffer) before later instructions can issue into the back-end.

  • SFENCE only orders stores against other stores, i.e. prevents NT stores from committing from the store buffer ahead of SFENCE itself. But otherwise SFENCE is just like a plain store that moves through the store buffer. Think of it like putting a divider on a grocery-store checkout conveyor belt that stops NT stores from getting grabbed early. It does not necessarily force the store buffer to be drained before it retires from the ROB, so putting LFENCE after it doesn't add up to MFENCE.

  • A "serializing instruction" like CPUID (and IRET, etc) drains everything (ROB, store buffer) before later instructions can issue into the back-end, and discards the front-end. MFENCE + LFENCE would also do the back-end part, but true serializing instructions also discard fetched machine code, so can work for cross-modifying code. (e.g. a load sees a flag, you run cpuid or the new serialize, then jump to a buffer where another thread stored code before a release-store on the flag. Code-fetch is guaranteed to get the new instructions. Unlike data loads, code-fetch doesn't respect x86's usual LoadLoad ordering rule.)

These descriptions are a little ambiguous in terms of exactly what kind of operations are ordered and there are some differences across vendors (e.g. SFENCE is stronger on AMD) and even processors from the same vendor. Refer to the Intel's manual and specification updates and AMD's manual and revision guides for more information. There are also a lot of other discussions on these instructions on SO other other places. But read the official sources first. The descriptions above are I think the minimum specified on-paper behaviour across vendors.

Footnote 1: OoO exec of later stores don't need to be blocked by MFENCE; executing them just writes data into the store buffer. In-order commit already orders them after earlier stores, and commit after retirement orders wrt. loads (because x86 requires loads to complete, not just to start, before they can retire, as part of ensuring load ordering). Remember that x86 hardware is built to disallow reordering other than StoreLoad.

The Intel manual Volume 2 number 325383-072US describes SFENCE as an instructions that "ensures that every store prior to SFENCE is globally visible before any store after SFENCE becomes globally visible." Volume 3 Section 11.10 says that the store buffer is drained when using the SFENCE. The correct interpretation of this statement is exactly the earlier statement from Volume 2. So SFENCE can be said to drain the store buffer in that sense. There is no guarantee at what point during SFENCE's lifetime earlier stores achieve GO. For any earlier store, it could happen before, at, or after retirement of SFENCE. Regarding what the point of GO is, it depends on serveral factors. This is beyond the scope of the question. See: Why “movnti” followed by an “sfence” guarantees persistent ordering?.

MFENCE does have to prevent NT stores from reordering with other stores, so it has to include whatever SFENCE does, as well as draining the store buffer. And also reordering of weakly-ordered SSE4.1 NT loads from WC memory, which is harder because the normal rules that get load ordering for free no longer apply to those. Guaranteeing this is why a Skylake microcode update strengthened (and slowed) MFENCE to also drain the ROB like LFENCE. It might still be possible for MFENCE to be lighter weight than that with HW support for optionally enforcing ordering of NT loads in the pipeline.


The main reason why SFENCE + LFENCE is not equal to MFENCE is because SFENCE + LFENCE doesn't block StoreLoad reordering, so it's not sufficient for sequential consistency. Only mfence (or a locked operation, or a real serializing instruction like cpuid) will do that. See Jeff Preshing's Memory Reordering Caught in the Act for a case where only a full barrier is sufficient.


From Intel's instruction-set reference manual entry for sfence:

The processor ensures that every store prior to SFENCE is globally visible before any store after SFENCE becomes globally visible.

but

It is not ordered with respect to memory loads or the LFENCE instruction.


LFENCE forces earlier instructions to "complete locally" (i.e. retire from the out-of-order part of the core), but for a store or SFENCE that just means putting data or a marker in the memory-order buffer, not flushing it so the store becomes globally visible. i.e. SFENCE "completion" (retirement from the ROB) doesn't include flushing the store buffer.

This is like Preshing describes in Memory Barriers Are Like Source Control Operations, where StoreStore barriers aren't "instant". Later in that that article, he explains why a #StoreStore + #LoadLoad + a #LoadStore barrier doesn't add up to a #StoreLoad barrier. (x86 LFENCE has some extra serialization of the instruction stream, but since it doesn't flush the store buffer the reasoning still holds).

LFENCE is not fully serializing like cpuid (which is as strong a memory barrier as mfence or a locked instruction). It's just LoadLoad + LoadStore barrier, plus some execution serialization stuff which maybe started as an implementation detail but is now enshrined as a guarantee, at least on Intel CPUs. It's useful with rdtsc, and for avoiding branch speculation to mitigate Spectre.


BTW, SFENCE is a no-op for WB (normal) stores.

It orders WC stores (such as movnt, or stores to video RAM) with respect to any stores, but not with respect to loads or LFENCE. Only on a CPU that's normally weakly-ordered does a store-store barrier do anything for normal stores. You don't need SFENCE unless you're using NT stores or memory regions mapped WC. If it did guarantee draining the store buffer before it could retire, you could build MFENCE out of SFENCE+LFENCE, but that isn't the case for Intel.


The real concern is StoreLoad reordering between a store and a load, not between a store and barriers, so you should look at a case with a store, then a barrier, then a load.

mov  [var1], eax
sfence
lfence
mov   eax, [var2]

can become globally visible (i.e. commit to L1d cache) in this order:

lfence
mov   eax, [var2]     ; load stays after LFENCE

mov  [var1], eax      ; store becomes globally visible before SFENCE
sfence                ; can reorder with LFENCE
Conah answered 14/5, 2018 at 2:26 Comment(21)
Comments are not for extended discussion; this conversation has been moved to chat.Shriek
When a store is issued an entry is allocated in the ROB and SB. And when the store retires, it can be removed from the ROB. And after that it will eventually be removed from the SB once it commits to the L1D. So a store in the SB can outlive the store in the ROB. Is this correct? I'm trying to build up the correct mental model and I have this impression that this is the key to understand the fences better.Osmious
The above must be the case. Otherwise a store could block the whole pipeline and that is the primary reason we have a SB in the first place.Osmious
A MFENCE wait for the SB to be drained. A LFENCE waits for the ROB to be drained. Is my understanding correct?Osmious
@pveentjer: Yes, but you also need to specify what is blocked while waiting. For LFENCE, it's the front-end issue stage. For MFENCE, depending on the implementation it might only be exec of later loads, with OoO exec of ALU work continuing. (Same for the full barrier as part of a locked instruction). Or for other implementations (like Skylake with microcode updates), MFENCE apparently blocks the front-end while draining the SB + ROB, like lock xor + LFENCE. See the end of this answerConah
Excellent. I'll have a closer look. It started to make sense once I realized that waiting for the SB to be drained isn't the same as waiting for the ROB to be drained.Osmious
@pveentjer: Indeed, IDK why I didn't say that in the first place in my answer; perhaps those concepts weren't as clear in my head 2 years ago. Edited to add a new section at the top.Conah
All serializing instructions wait for the ROB to be drained before issuing the the next instruction?Osmious
@pveentjer: Yes, block the front end until the ROB and store buffer are drained, hiding all effects of pipelining. That's what "serializing" means as a technical term in x86 manuals. Only a few instructions are guaranteed to be like that, including cpuid and iret.Conah
And that is why LFENCE isn't a fully serializing instruction; it only waits for the ROB to be drained but not the SB.Osmious
The Intel documentation talks about serializing instructions in combination with parallel execution of instructions: xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/… Are they referring to pipelining or to superscalar execution and out of order execution?Osmious
In case of an MFENCE.. once the load is encountered no further instructions are picked up from the ROB and send to the RS until the SB has been drained. So once this load is encountered, no later instructions are executed. no matter if they are independent.. it all stops... Is this correct? Sorry for asking these 'basic' questions, but want to make sure my understanding is solid.Osmious
@pveentjer: Serializing instructions would have to drain the SB on a scalar in-order pipeline, but maybe not actually drain the pipeline itself. Serializing does guarantee that self-modifying code is visible, so a pipeline that doesn't snoop store addresses for being near in-flight instructions would need to discard instruction pre-fetch and not fetch anything past the CPUID or w/e until the SB was drained. I guess you could say that serializing instructions are memory barriers that respect code-fetch as a load. That's one of the guaranteed effects. (Modern CPUs do snoop and might not flush)Conah
@pveentjer: No, MFENCE doesn't have to block OoO exec of independent ALU work after itself or after a load. If the work depends on the load, it will be blocked by dependency rules, but if it only depends on registers that were written before MFENCE then no rules stop a CPU from continuing. Only possible implementation details. (I expect that on CPUs without Skylake's mostly-serializing MFENCE, independent ALU instructions after later loads could still exec).Conah
I need to dig into the execution engine some more. Every answer leads to 2 more questions :)Osmious
I did a bit more thinking about the topic. And it makes sense to keep continuing executing instruction; only loads needs to be stalled; and all dependent instructions will automatically get blocked since that is the whole purpose of the Tomasulo algorithmOsmious
@PeterCordes I'd been reading Preshing's Memory Barriers Are Like Source Control Operations... Any idea why he compares a load-load barrier to a pull that doesn't pull from the latest version of the sentral repository. Why does he state this? What behaviour of real processors does he want to demonstrate using this?Jubilee
Quite late to the party, but shouldn't the first phrase read: "MFENCE drains the store buffer before later loads and stores can execute."? Also, believing the Intel manual, sfence drains the store buffer but the marker model makes more sense. I don't remember where I read of it (probably from you), did you know where it comes from?Toady
@MargaretBloom: executing a store just writes into the store buffer, so no, it's fine if mfence doesn't block that. In-order commit already orders later stores after any earlier stores. And commit after retirement orders them after any earlier loads. In practice implementations may make mfence stronger, like Skylake with updated microcode fully blocking issue and dispatch of anything after mfence. Re: the marker model: yes, that's probably from me, based on Is a memory barrier an instruction that the CPU executes, or is it just a marker?Conah
@MargaretBloom: Or actually more likely I picked up that idea while answering Does a memory barrier acts both as a marker and as an instruction?Conah
@MargaretBloom: Updated my answer, I was forgetting about NT stores and NT loads in my prev comment. That's why MFENCE is so slow in Skylake. Re: the "marker" model for SFENCE (or for NT load ordering with MFENCE): I don't know for sure if that's what the internal implementation looks like, but we do know that in practice SFENCE + LFENCE doesn't add up to MFENCE even for plain stores / plain loads (e.g. another answer on this question). But it would if it drained the store buffer before it retired. So I think as a mental model it works well enough, at least for safety analysis.Conah
P
6

In general MFENCE != SFENCE + LFENCE. For example the code below, when compiled with -DBROKEN, fails on some Westmere and Sandy Bridge systems but appears to work on Ryzen. In fact on AMD systems just an SFENCE seems to be sufficient.

#include <atomic>
#include <thread>
#include <vector>
#include <iostream>
using namespace std;

#define ITERATIONS (10000000)
class minircu {
        public:
                minircu() : rv_(0), wv_(0) {}
                class lock_guard {
                        minircu& _r;
                        const std::size_t _id;
                        public:
                        lock_guard(minircu& r, std::size_t id) : _r(r), _id(id) { _r.rlock(_id); }
                        ~lock_guard() { _r.runlock(_id); }
                };
                void synchronize() {
                        wv_.store(-1, std::memory_order_seq_cst);
                        while(rv_.load(std::memory_order_relaxed) & wv_.load(std::memory_order_acquire));
                }
        private:
                void rlock(std::size_t id) {
                        rab_[id].store(1, std::memory_order_relaxed);
#ifndef BROKEN
                        __asm__ __volatile__ ("mfence;" : : : "memory");
#else
                        __asm__ __volatile__ ("sfence; lfence;" : : : "memory");
#endif
                }
                void runlock(std::size_t id) {
                        rab_[id].store(0, std::memory_order_release);
                        wab_[id].store(0, std::memory_order_release);
                }
                union alignas(64) {
                        std::atomic<uint64_t>           rv_;
                        std::atomic<unsigned char>      rab_[8];
                };
                union alignas(8) {
                        std::atomic<uint64_t>           wv_;
                        std::atomic<unsigned char>      wab_[8];
                };
};

minircu r;

std::atomic<int> shared_values[2];
std::atomic<std::atomic<int>*> pvalue(shared_values);
std::atomic<uint64_t> total(0);

void r_thread(std::size_t id) {
    uint64_t subtotal = 0;
    for(size_t i = 0; i < ITERATIONS; ++i) {
                minircu::lock_guard l(r, id);
                subtotal += (*pvalue).load(memory_order_acquire);
    }
    total += subtotal;
}

void wr_thread() {
    for (size_t i = 1; i < (ITERATIONS/10); ++i) {
                std::atomic<int>* o = pvalue.load(memory_order_relaxed);
                std::atomic<int>* p = shared_values + i % 2;
                p->store(1, memory_order_release);
                pvalue.store(p, memory_order_release);

                r.synchronize();
                o->store(0, memory_order_relaxed); // should not be visible to readers
    }
}

int main(int argc, char* argv[]) {
    std::vector<std::thread> vec_thread;
    shared_values[0] = shared_values[1] = 1;
    std::size_t readers = (argc > 1) ? ::atoi(argv[1]) : 8;
    if (readers > 8) {
        std::cout << "maximum number of readers is " << 8 << std::endl; return 0;
    } else
        std::cout << readers << " readers" << std::endl;

    vec_thread.emplace_back( [=]() { wr_thread(); } );
    for(size_t i = 0; i < readers; ++i)
        vec_thread.emplace_back( [=]() { r_thread(i); } );
    for(auto &i: vec_thread) i.join();

    std::cout << "total = " << total << ", expecting " << readers * ITERATIONS << std::endl;
    return 0;
}
Peccary answered 14/5, 2018 at 1:57 Comment(3)
doesn't seem to have had any effectPeccary
Alexander, Just FYI StackExchange requires you to put a newline between the language hint and the codeblock for some reason, see the revision history for more information, Ross Ridge took care of it..Bernabernadene
For some reason AMD defines sfence as a full barrier, draining the store buffer before later loads can execute. I think this is officially documented for AMD CPUs, not just an implementation detail like sfence happening to drain the SB before it can retire from the ROB.Conah
L
3

What mechanism disables the LFENCE to make impossible reordering (x86 have not mechanism - Invalidate-Queue)?

From the Intel manuals, volume 2A, page 3-464 documentation for the LFENCE instruction:

LFENCE does not execute until all prior instructions have completed locally, and no later instruction begins execution until LFENCE completes

So yes, your example reordering is explicitly prevented by the LFENCE instruction. Your second example involving only SFENCE instructions IS a valid reordering, since SFENCE has no impact on load operations.

Lubricious answered 10/4, 2015 at 2:16 Comment(3)
Thank you! But I don't claim that MFENCE = LFENCE + SFENCE, I claim that MFENCE = SFENCE + LFENCE - the order of barriers is important, you can see our discussion: stackoverflow.com/questions/20316124/… SFENCE + LFENCE can't be reordered to LFENCE + SFENCE, and so, 2 mov [mem], reg can't execute after SFENCE and 3 mov reg, [mem] can't execute before LFENCE, can't reordered: 1 mov reg, [mem] 2 mov [mem], reg SFENCE LFENCE 3 mov reg, [mem] 4 mov [mem], regComity
@Comity You're absolutely right, sorry for the mistake. I've removed that portion of my answer. I would like to investigate the minutea of this in greater detail, I'll post a link here once I finish my writeup.Lubricious
Ok, do not worry, I made the same mistake too, at the beginning of that discussion on the link :) Maybe it's not a simple question.Comity

© 2022 - 2024 — McMap. All rights reserved.