Does it make any sense to use the LFENCE instruction on x86/x86_64 processors?

Asked 1/12, 2013 at 19:19 Answered 22/12, 2014 at 18:59

Solved assembly x86 x86-64 atomic memory-barriers

Often in internet I find that LFENCE makes no sense in processors x86, ie it does nothing , so instead MFENCE we can absolutely painless to use SFENCE, because MFENCE = SFENCE + LFENCE = SFENCE + NOP = SFENCE.

But if LFENCE does not make sense, then why we have four approaches to make Sequential Consistency in x86/x86_64:

LOAD (without fence) and STORE + MFENCE
LOAD (without fence) and LOCK XCHG
MFENCE + LOAD and STORE (without fence)
LOCK XADD ( 0 ) and STORE (without fence)

Taken from here: http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

As well as performances from Herb Sutter on page 34 at the bottom: https://skydrive.live.com/view.aspx?resid=4E86B0CF20EF15AD!24884&app=WordPdf&wdo=2&authkey=!AMtj_EflYn2507c

If LFENCE did not do anything, then the approach (3) would have the following meanings: SFENCE + LOAD and STORE (without fence), but there is no point in doing SFENCE before LOAD. Ie if LFENCE does nothing , the approach (3) does not make sense.

Does it make any sense instruction LFENCE in processors x86/x86_64?

ANSWER:

1. LFENCE required in cases which described in the accepted answer, below.

2. The approach (3) should be viewed not independently, but in combination with the previous commands. For example, approach (3):

MFENCE
MOV reg, [addr1]  // LOAD-1
MOV [addr2], reg  //STORE-1

MFENCE
MOV reg, [addr1]  // LOAD-2
MOV [addr2], reg  //STORE-2

We can rewrite the code of approach (3) as follows:

SFENCE
MOV reg, [addr1]  // LOAD-1
MOV [addr2], reg  //STORE-1

SFENCE
MOV reg, [addr1]  // LOAD-2
MOV [addr2], reg  //STORE-2

And here SFENCE makes sense to prevent reordering STORE-1 and LOAD-2. For this after STORE-1 command SFENCE flushes Store-Buffer.

Bresee answered 1/12, 2013 at 19:19 Comment(1)

There are instructions with a "non-temporal hint" which aren't as strongly ordered as the usual load and store; I imagine those may benefit from fencing. (Edit: This is actually mentioned on the page you linked.) – Agripina 1/12, 2013 at 20:34

Bottom line (TL;DR): LFENCE alone indeed seems useless for memory ordering, however it does not make SFENCE a substitute for MFENCE. The "arithmetic" logic in the question is not applicable.

Here is an excerpt from Intel's Software Developers Manual, volume 3, section 8.2.2 (the edition 325384-052US of September 2014), the same that I used in another answer

Reads are not reordered with other reads.

Writes are not reordered with older reads.

Writes to memory are not reordered with other writes, with the following exceptions:

writes executed with the CLFLUSH instruction;

streaming stores (writes) executed with the non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD); and

string operations (see Section 8.2.4.1).

Reads may be reordered with older writes to different locations but not with older writes to the same location.

Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions.

Reads cannot pass earlier LFENCE and MFENCE instructions.

Writes cannot pass earlier LFENCE, SFENCE, and MFENCE instructions.

LFENCE instructions cannot pass earlier reads.

SFENCE instructions cannot pass earlier writes.

MFENCE instructions cannot pass earlier reads or writes.

From here, it follows that:

MFENCE is a full memory fence for all operations on all memory types, whether non-temporal or not.
SFENCE only prevents reordering of writes (in other terminology, it's a StoreStore barrier), and is only useful together with non-temporal stores and other instructions listed as exceptions.
LFENCE prevents reordering of reads with subsequent reads and writes (i.e. it combines LoadLoad and LoadStore barriers). However, the first two bullets say that LoadLoad and LoadStore barriers are always in place, no exceptions. Therefore LFENCE alone is useless for memory ordering.

To support the last claim, I looked at all places where LFENCE is mentioned in all 3 volumes of Intel's manual, and found none which would say that LFENCE is required for memory consistency. Even MOVNTDQA - the only non-temporal load instruction so far - mentions MFENCE but not LFENCE.

Update: see answers on Why is (or isn't?) SFENCE + LFENCE equivalent to MFENCE? for correct answers to the guesswork below

Whether MFENCE is equivalent to a "sum" of other two fences or not is a tricky question. At glance, among the three fence instructions only MFENCE provides StoreLoad barrier, i.e. prevents reordering of reads with earlier writes. However the correct answer requires to know more than the above rules; namely, it's important that all fence instructions are ordered with respect to each other. This makes the SFENCE LFENCE sequence more powerful than a mere union of individual effects: this sequence also prevents StoreLoad reordering (because loads cannot pass LFENCE, which cannot pass SFENCE, which cannot pass stores), and thus constitutes a full memory fence (but also see the note (*) below). Note however that order matters here, and the LFENCE SFENCE sequence does not have the same synergy effect.

However, while one can say that MFENCE ~ SFENCE LFENCE and LFENCE ~ NOP, that does not mean MFENCE ~ SFENCE. I deliberately use equivalence (~) and not equality (=) to stress that arithmetic rules do not apply here. The mutual effect of SFENCE followed by LFENCE makes the difference; even though loads are not reordered with each other, LFENCE is required to prevent reordering of loads with SFENCE.

(*) It still might be correct to say that MFENCE is stronger than the combination of the other two fences. In particular, a note to CLFLUSH instruction in the volume 2 of Intel's manual says that "CLFLUSH is only ordered by the MFENCE instruction. It is not guaranteed to be ordered by any other fencing or serializing instructions or by another CLFLUSH instruction."

(Update, clflush is now defined as strongly ordered (like a normal store, so you only need mfence if you want to block later loads), but clflushopt is weakly ordered, but can be fenced by sfence.)

Pillory answered 22/12, 2014 at 18:59 Comment(9)

Thanks, but I don't agree about that "MFENCE is stronger than LFENCE+SFENCE. It's a full memory fence for all operations on all memory types, whether non-temporal or not.". We always can write sequence MOVNTQ SFENCE LFENCE MOVNTQ or MOV [addr], reg SFENCE LFENCE MOV reg, [addr] for Sequential Consistency. I.e. LFENCE+SFENCE - similarly a full memory fence for all operations on all memory types, whether non-temporal or not. – Bresee 23/12, 2014 at 12:23

Thank you for the comment. It happens to be more complicated; I rewrote the answer with additional details. – Pillory 23/12, 2014 at 14:10

Thank for your clarification, yes I agree, i.e. if we use SFENCE instead of MFENCE for Sequential Consistency then what can happen: MOV [addr1], reg SFENCE MOV reg, [addr2] --> MOV [addr1], reg MOV reg, [addr2] SFENCE --> MOV reg, [addr2] MOV [addr1], reg SFENCE. I.e. it is logical, we got reordering Store <-> Load. But can it really happen? If so, then there must be a mechanism that includes a command fence, what is this mechanism? Everywhere write that the LFENCE does not do anything, ie no such mechanism. There is Store Buffer, but no Load Buffer. – Bresee 23/12, 2014 at 14:40

And yes, thank for your note, I agree, that LFENCE+SFENCE != SFENCE+LFENCE :) Because for first can happen next: MOV [addr1], reg LFENCE SFENCE MOV reg, [addr2] --> LFENCE MOV [addr1], reg MOV reg, [addr2] SFENCE --> LFENCE MOV [addr1], reg MOV reg, [addr2] SFENCE --> LFENCE MOV reg, [addr2] MOV [addr1], reg SFENCE. I.e. we got reordering Store <-> Load, ie do not get Sequential Consistency. I fixed this in my question. – Bresee 23/12, 2014 at 14:44

Ie the main question now, is reordering of SFENCE MOV reg, [addr] --> MOV reg, [addr] SFENCE possible only in theory or perhaps in reality? And if possible, in reality, what mechanisms, how does it work? For example reordering of MOV [addr], reg LFENCE --> LFENCE MOV [addr], reg provided by mechanism - Store Buffer, which reorders Store - Loads for performance increase, and beacause LFENCE does not prevent to it . – Bresee 23/12, 2014 at 15:57

Possibly the new main question might deserve a separate SO question. – Pillory 23/12, 2014 at 18:15

Thank you! Ok, I created new separate question: stackoverflow.com/questions/27627969/… – Bresee 23/12, 2014 at 21:6

It sounds to me that SFENCE/LFENCE pair implements release/acquire C++11 memory ordering semantics. Does it not? – Gezira 23/2, 2017 at 20:57

LFENCE does appear to implement acquire semantics since it acts as a LoadLoad and a LoadStore barrier. SFENCE is ONLY a StoreStore barrier(and that too only NT and other weaker operations defined by x86,CFLUSH etc).Release semantics require the properties of a LoadStore and a StoreStore barrier. See preshing.com/20120913/acquire-and-release-semantics. For release semantics, an LFENCE + SFENCE(in that order) will likely do it. I assume this will only be useful with WC regions and NT memory accesses. – Impenetrable 18/9, 2019 at 17:35

Consider the following scenario - this is the critical case where speculative load execution can theoretically harm sequential consistency

initially [x]=[y]=0

CPU0:                              CPU1: 
store [x]<--1                      store [y]<--1
load  r1<--[y]                     load r2<--[x]

Since x86 allows loads to be reordered with earlier stores to different addresses, both loads may return 0's. Adding an lfence alone after each store wouldn't prevent that, since they only prevent reordering within the same context, but since stores are dispatched after retirement you can have both lfences and both loads commit before the stores are performed and observed.

An mfence on the other hand would force the stores to perform, and only then allow the loads to be executed, so you'll see the updated data on at least one context.

As for sfences - as pointed out in the comment, in theory it's not strong enough to prevent the load from reordering above it, so it might still read stale data. While this is true as far as the memory official ordering rules apply, I believe that current implementation of x86 uarch makes it slightly stronger (while not committing to do so in the future, I guess). According to this description:

Because of the strong x86 ordering model, the load buffer is snooped by coherency traffic. A remote store must invalidate all other copies of a cache line. If a cache line is read by a load, and then invalidated by a remote store, the load must be cancelled, since it potentially read invalid data. The x86 memory model does not require snooping the store buffer.

Therefore, any load not yet committed in the machine should be snoopable by stores from other cores, thereby making the effective observation time of the load at the commit point, and not the execution point (which is indeed out of order and may have been performed much earlier). Commit is done in order, and therefore the load should be observed after previous instructions - making lfences pretty much useless as I said above in the comments, since the consistency can be maintained the same way without them. This is mostly speculation, trying to explain the common conception that lfences are meaningless in x86 - I'm not entirely sure where it originated and if there are other considerations at hand - would be happy for any expert to approve / challenge this theory.

All the above applies only for WB mem types of course

Openandshut answered 2/12, 2013 at 13:7 Comment(5)

SFENCE cannot be used to fix this example, because a load can migrate up across SFENCE. The example requires MFENCE to fix. – Piddle 2/12, 2013 at 15:26

@ArchD.Robison, in practice, the load would have to retire after any previous instruction, and during that time a store from the other thread would be able to snoop it - I think this would cause re-execution but i'm not entirely sure, so I fixed the answer as you suggested - thanks. – Openandshut 2/12, 2013 at 16:2

Thanks for the fix. I recommend coding against the architectural guarantees since the micro-architects are always looking for clever and mysterious ways to speed things up. – Piddle 2/12, 2013 at 17:0

@Openandshut What do you mean by "but since stores are dispatched after retirement"? Do you mean the result of the store is saved to the store buffer and then "disptached" to the L1 cache later? – Merrie 22/12, 2014 at 1:23

@user997112, yes. The result of the store can be calculated speculatively, but the store itself must not be exposed and visible externally until it is committed (i.e. - all instructions up to it are known to be on the correct path and non faulting). Some machines can handle such stores in the cache perhaps using special bits to mark them as speculative (in case a snoop comes along), but the common solution is to keep the store in the buffer until it has committed. Also note that in architectures where stores are ordered like x86, dispatching from the buffer must also be done in-order. – Openandshut 22/12, 2014 at 7:13

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags