Is LFENCE serializing on AMD processors?
Asked Answered
C

2

10

In recent Intel ISA documents the lfence instruction has been defined as serializing the instruction stream (preventing out-of-order execution across it). In particular, the description of the instruction includes this line:

Specifically, LFENCE does not execute until all prior instructions have completed locally, and no later instruction begins execution until LFENCE completes.

Note that this applies to all instructions, not just memory load instructions, making lfence more than just a memory ordering fence.

Although this now appears in the ISA documentation, it isn't clear if it is "architectural", i.e., to be obeyed by all x86 implementations, or if it is Intel specific. In particular, do AMD processors also treat lfence as serializing the instruction stream?

Cantrell answered 14/8, 2018 at 15:26 Comment(12)
lfence isn't "serializing" on Intel. That term has a technical meaning that includes fully flushing the store buffer. e.g. cpuid and iret are serializing. lfence only serializes the instruction stream / out-of-order core, not the whole pipeline including the store buffer. I usually say it's "partially serializing" or something.Lighter
@PeterCordes - note that I wrote "serializing the instruction stream" at the first use of that term in the question. I disagree that Intel uses serializing consistently in their manuals. They do use serializing instruction fairly consistently for things like cpuid, but they also use serializing alone for other things, including things which are not serializing instructions. The sentence in the lfence section directly that precedes the one I quoted uses the term "serializing operation" in reference to lfence.Cantrell
I suggest removing the generic isa tag and adding the memory-barriers tag, which is more pertinent.Weikert
@HadiBrais: I removed [memory-barriers] because we're not interested in the memory-barrier effect of lfence. We know it does that, and it's a red-herring that distracts from this question about its other effect. I don't insist on removing it again if you and @Bee don't find that argument convincing, though.Lighter
@PeterCordes - yeah but it's just a tag. I don't find it distracting. In fact, I find it at least tangentially relevant: lfence is at least presented as a memory barrier, and is a memory barrier, and this OoO-blocking side effect is actually a result of the implementation design for its original primary function. If you were interesting in lfence as a barrier, it is highly likely that you care about performance and also perhaps care about this OoO blocking behavior. Take the contrary position: you mention lfence OoO behavior almost every time the instruction comes up in the context ...Cantrell
... of actual barriers, so why is the reverse so wrong?Cantrell
That's fair. My only counter-argument is that execution serialization is the current primary purpose of lfence, which is why it makes sense to mention it any time it comes up in a memory-barrier context, but arguably not the reverse. i.e. I'm correcting the misconception that lfence is useful as a memory barrier. But I guess you're right, maybe people doing a tag search on [x86] [memory-barriers] would find this question and learn something. I still liked my title edit even though you've convinced me on the tags, but it's your question.Lighter
@PeterCordes - yeah that's why I said original primary. I also tried to put scare quotes around "primary" to reflect that today the most interesting use has nothing much to do with memory ordering, but I was at the character limit for that comment. The thing is that this tag is replacing the "isa" tag which is little used and pretty useless here I think.Cantrell
As far as I know; as part of trying to find solutions to all the spectre problems, the LFENCE instruction (which was previously only supposed to be a load barrier) got redefined as a "speculative execution barrier" in 2018.Insurer
@Insurer - that only worked 100% for Intel, since it happened to have that behavior on existing chips. For AMD the situation is more complicated, and lfence may or may not be serializing depending on the value set in an MSR.Cantrell
@BeeOnRope: I'm assuming a set of firmware updates would've been created so that the bit is set by default (for those who install updates and newer systems).Insurer
@Insurer - probably. It's interesting how this big was created long ago, before we knew of Spectre and Meltdown - probably for other reasons.Cantrell
W
10

AMD has always in their manual described their implementation of LFENCE as a load serializing instruction

Acts as a barrier to force strong memory ordering (serialization) between load instructions preceding the LFENCE and load instructions that follow the LFENCE.

The original use case for LFENCE was ordering WC memory type loads. However, after the speculative execution vulnerabilities were discovered, AMD released a document in January 2018 entitled "Software techniques for managing speculation on AMD processors". This is the first and only document in which MSR C001_1029[1] is mentioned (other bits of C001_1029 are discussed in some AMD documents, but not bit 1). When C001_1029[1] is set to 1, LFENCE behaves as a dispatch serializing instruction (which is more expensive than merely load serializing). Since this MSR is available on most older AMD processors, it seems that it has almost always been supported. Maybe because they thought they might need in the future to maintain compatibility with Intel processors regarding the behavior of LFENCE.

There are exceptions to the ordering rules of fence instructions and serializing instructions and instructions that have serializing properties. These exceptions are subtly different between Intel and AMD processors. An example that I can think of right now is the CLFLUSH instruction. So AMD and Intel mean slightly different things when they talk about instructions with serializing properties.

One thing not clear to me is the following part of the quote from the document:

AMD family 0Fh/11h processors support LFENCE as serializing always but do not support this MSR.

This statement is vague because it doesn't clearly say whether LFENCE on AMD families 0Fh and 11h is fully serializing (in AMD terminology) or dispatch serializing (in AMD terminology). But it's most probably dispatch serializing only. The AMD family-specific manuals don't mention LFENCE or MSR C001_1029.


Since the Linux kernel v4.15-rc8, the serializing properties of LFENCE on AMD processors are used. The change consists of two commits 1 and 2. The following macros were defined in commit 1:

+#define MSR_F10H_DECFG         0xc0011029
+#define MSR_F10H_DECFG_LFENCE_SERIALIZE_BIT    1

The first macro specifies the MSR address and the second specifies the offset. The following code was added in init_amd (some comments are mine) in commit 2:

/* LFENCE always requires SSE2 */
if (cpu_has(c, X86_FEATURE_XMM2)) {
    unsigned long long val;
    int ret;
    
    /* The AMD CPU supports LFENCE, but there are three cases to be considered:
     * 1- MSR C001_1029[1] must be set to enable the dispatch 
     *    serializing behavior of LFENCE. This can only be done 
     *    if and only if the MSR is supported.
     * 2- The MSR is not supported (AMD 0Fh/11h). LFENCE is by 
     *    default at least dispatch serializing. Nothing needs to 
     *    be done.
     * 3- The MSR is supported, but we are running under a hypervisor
     *    that does not support writing that MSR (because perhaps
     *    the hypervisor has not been updated yet). In this case, resort
     *    to the slower MFENCE for serializing RDTSC and use a Spectre
     *    mitigation that does not require LFENCE (i.e., generic retpoline).


    /*
     * A serializing LFENCE has less overhead than MFENCE, so
     * use it for execution serialization.  On families which
     * don't have that MSR, LFENCE is already serializing.
     * msr_set_bit() uses the safe accessors, too, even if the MSR
     * is not present.
     */
    msr_set_bit(MSR_F10H_DECFG,
            MSR_F10H_DECFG_LFENCE_SERIALIZE_BIT);

    /*
     * Verify that the MSR write was successful (could be running
     * under a hypervisor) and only then assume that LFENCE is
     * serializing.
     */
    ret = rdmsrl_safe(MSR_F10H_DECFG, &val);
    if (!ret && (val & MSR_F10H_DECFG_LFENCE_SERIALIZE)) {
        /* A serializing LFENCE stops RDTSC speculation */
        set_cpu_cap(c, X86_FEATURE_LFENCE_RDTSC);
        /* X86_FEATURE_LFENCE_RDTSC is used later to choose a Spectre
           mitigation */
    } else {
        /* MFENCE stops RDTSC speculation */
        set_cpu_cap(c, X86_FEATURE_MFENCE_RDTSC);
    }
}

Since v5.4-rc1, the MSR write verification code was removed. So the code became:

    msr_set_bit(MSR_F10H_DECFG,
            MSR_F10H_DECFG_LFENCE_SERIALIZE_BIT);
    set_cpu_cap(c, X86_FEATURE_LFENCE_RDTSC);

The reasoning behind this change is discussed in the commit message. (In summary, it's mostly not needed, and it may not work.)

That document also says:

All AMD family 10h/12h/14h/15h/16h/17h processors support this MSR. LFENCE support is indicated by CPUID function1 EDX bit 26, SSE2. AMD family 0Fh/11h processors support LFENCE as serializing always but do not support this MSR.

But it appears that none of the AMD manuals have been updated yet to mention support for C001_1029[1].

AMD said the following in that document:

AMD plans support for this MSR and access to this bit for all future processors.

This means that C001_1029[1] should be considered as architectural on future AMD processors (with respect to January 2018).

Weikert answered 14/8, 2018 at 19:33 Comment(0)
U
6

There is an MSR that configures that behaviour:

Description: Set an MSR in the processor so that LFENCE is a dispatch serializing instruction and then use LFENCE in code streams to serialize dispatch (LFENCE is faster than RDTSCP which is also dispatch serializing). This mode of LFENCE may be enabled by setting MSR C001_1029[1]=1.

Effect: Upon encountering an LFENCE when the MSR bit is set, dispatch will stop until the LFENCE instruction becomes the oldest instruction in the machine.

Applicability: All AMD family 10h/12h/14h/15h/16h/17h processors support this MSR. LFENCE support is indicated by CPUID function1 EDX bit 26, SSE2. AMD family 0Fh/11h processors support LFENCE as serializing always but do not support this MSR. AMD plans support for this MSR and access to this bit for all future processors.

(source)

Uncivil answered 14/8, 2018 at 15:26 Comment(5)
Is this register a recent invention, or does it exist on older AMD models too? It would seem weird if it's an old thing, since it was presumably added for Spectre mitigation.Cantrell
@Cantrell AMD family 10h is K10, 0Fh (doesn't have the MSR but had serializing LFENCE) is K8, AFAIK anything older didn't even have SSE2 yet so no LFENCE at allUncivil
And Windows / Linux / *BSD all set that MSR with Spectre mitigation enabled? So it's now mostly safe to portably use lfence; rdtsc, if we can assume updated kernels?Lighter
(Note that Intel's manual doesn't promise that rdtscp stops later instructions from executing, it only promises that the time won't be sampled until all earlier ones are done. So it's not useful at the start of a timed interval. See also this case where lfence;rdtsc;lfence gave more consistent results at the bottom of a timing interval: clflush to invalidate cache line via C functionLighter
(update: I'm not sure rdtscp on Intel CPUs really is as weak in practice as Intel's paper spec. It probably does decode to uops basically like lfence;rdtsc plus setting ECX. I seem to recall someone mentioning in comments that they'd never seen a case of later instructions having been able to exec early.)Lighter

© 2022 - 2024 — McMap. All rights reserved.