What is the difference in logic and performance between LOCK XCHG and MOV+MFENCE? [duplicate]

Asked 30/9, 2013 at 13:56 Answered 30/9, 2013 at 16:26

Solved multithreading assembly concurrency x86 cpu-cache

What is the difference in logic and performance between x86-instructions LOCK XCHG and MOV+MFENCE for doing a sequential-consistency store.

(We ignore the load result of the XCHG; compilers other than gcc use it for the store + memory barrier effect.)

Is it true, that for sequential consistency, during the execution of an atomic operation: LOCK XCHG locks only a single cache-line, and vice versa MOV+MFENCE locks whole cache-L3(LLC)?

Hoofer answered 30/9, 2013 at 13:56 Comment(10)

Apples and oranges, MFENCE doesn't provide atomicity. – Babbittry 30/9, 2013 at 14:10

@Hans Passant I didn't say that MFENCE provide atomicity, because MOV already atomic - this we can see in C11(atomic)/C++11(std::atomic) for all ordering in x86 except SC(sequential consistency): en.cppreference.com/w/cpp/atomic/memory_order But i said that MFENCE provide sequential consistency for atomic variables as we can see in C11(atomic)/C++11(std::atomic) in GCC4.8.2: stackoverflow.com/questions/19047327/… – Hoofer 30/9, 2013 at 14:37

mov maybe atomic for what it does, but xchg can't be expressed as a single mov. – Harter 30/9, 2013 at 14:49

(I'm not even sure if mov is atomic for unaligned access, by the way.) – Harter 30/9, 2013 at 14:57

@Kerrek SB MOV+MFENCE(SC in GCC4.8.2) we can replace on LOCK XCHG for SC as we can see in video where on 0:28:20 said that MFENCE more expensive that XCHG: channel9.msdn.com/Shows/Going+Deep/… – Hoofer 30/9, 2013 at 15:18

@Alex, see also here - stackoverflow.com/questions/19059542/… – Fipple 30/9, 2013 at 17:46

I thought LOCK was implicit with XCHG? Does specifying LOCK XCHG actually do anything different than just an XCHG? – Expert 1/10, 2013 at 11:31

@BrianKnoblauch: Yes, lock is already implicit for xchg [mem], reg. Hopefully when people say LOCK XCHG, they're just talking about the implied behaviour. I'm not sure if any assemblers will omit the lock prefix from the machine code if you write lock xchg, but they could. – Inspan 20/10, 2018 at 22:40

@KerrekSB: This question is asking about 2 methods for doing a seq_cst store, where we ignore the load result of the xchg and just use it to do a store + memory barrier. Turns out it's more efficient to use xchg on Intel Skylake at least, where mfence blocks out-of-order exec of independent non-memory instructions. I'm closing this as a dup for now because I addressed this in an answer on a related question, but maybe this question deserves its own answer. Which is a better write barrier on x86: lock+addl or xchgl? is related. – Inspan 20/10, 2018 at 22:42

@PeterCordes: Sure, makes sense, thanks. – Harter 21/10, 2018 at 10:38

-1

The difference is in purpose of usage.

MFENCE (or SFENCE or LFENCE) is useful when we are locking a part of memory region accessible from two or more threads. When we atomically set the lock for this memory region we can after that use all non-atomic instruction, because there are faster. But we must call SFANCE (or MFENCE) one instruction before unlocking the memory region to ensure that locked memory is visible correctly to all other threads.

If we are changing only a single memory aligned variable, then we are using atomic instructions like LOCK XCHG so no lock of memory region is needed.

Earthman answered 30/9, 2013 at 16:26 Comment(12)

Do you mean if we want a sequential consistency for a large area of memory (8 Bytes - 1 MB and more), the best performance use MFENCE, and if we want to get a sequential consistency for a small area of memory as a single variable (1 byte (char) - 8 bytes (long long )), then the better use LOCK XCHG? Because LOCK - locks only a single cache-line, but MFENCE locks whole cache-L3(LLC). – Hoofer 30/9, 2013 at 16:48

@Alex: Yes MFENCE only ensure that load-from-memory and store-to-memory is guaranted visible to all threads correctly after execution of that instruction. MFENCE have nothing common with atomic instructions. – Earthman 30/9, 2013 at 17:21

No, an x86 lock is in itself an mfence (it's even said in the video here), so you don't need another one (let alone any one directional fence at entry/exit of critical sections). Also, there's no such thing as locking the L3, mfence does not lock anything (so it does not ensure any atomicity), it just ensures serialization of all memory operations in the thread that used it – Fipple 30/9, 2013 at 17:50

@Fipple I know that MFENCE = LFENCE(getting data from L1/L2 caches of others cpu-cores for own cpu-core in L1/L2 caches with Invalid-cache-line) + SFENCE(dissemination of own Modified-cache lines of own cpu-core to the others L1/L2 caches of other cpu-cores). But is MFENCE not lock the bus at the time of these updates Invalid/Modified-cache lines, namely blocking-L3 cache and RAM at all time of during execution MFENCE? Because a sequence of such an exchange is very important to comply for the sequential consistency, ie while MFENCE is executing on one core, the other cores can't launch MFENCE. – Hoofer 30/9, 2013 at 18:15

@Fipple I don't say that MFENCE locks bus for any other instructions except self, but for it self locks RAM-bus and L3(LLC). Only LOCK locks instructions for which it is a prefix, but lock only a single cache line for this memory cell - lock cache line condition at "Modify" during execution: LOCK XCHG, LOCK XADD, LOCK CMPXCHG. – Hoofer 30/9, 2013 at 18:19

@Alex, I think you got it mixed up a little - fences are creatures of the ISA, x86 in this case. Caches are implementation detailes, and are "under the hood" mostly. Any x86 load/store operation will collect coherent data from other cores/sockets thanks to a MESI/snoops protocol. Modified lines in your own core are also maintained by that protocol (although there is an ISA hook to flush them out - but that's with wbinvd/clflush, not sfence). Either way, the exact behavior of the HW may differ between products (but most modern CPUs don't have to go with expensive bus locks for these ops) – Fipple 30/9, 2013 at 18:37

@Fipple You are right for the old single-core processors on which the cache coherence protocol was MESI, and for other devices used "snoops", but for multi-core processors the problem of concurrency is solved through protocols MOESI/MESIF. Through using a prefix LOCK Owned/Forward/Modified-conditions holds (to lock) for the duration of an atomic operation for current cache-line. Similarly (but for the whole shared for all cpu-cores cache L3 and RAM) should be strict sequence of instructions: MFENCE from CoreX, MFENCE from CoreY, MFENCE from CoreZ... How is it providing? – Hoofer 30/9, 2013 at 19:6

MESIF/MOESI allow some optimization in HW, but are not relevant here - a lock will hold any line in place regardless of state. However, I don't agree with your 2nd part - MFENCE applies only for the program order in a given thread, not others. It may help in some consistency cases (as I wrote here - stackoverflow.com/questions/19059542/…), but only because it serializes each thread internally, not through any atomicity, or "cache locking" as you insinuate. If you think otherwise, please open a question with an example. – Fipple 30/9, 2013 at 19:28

@Fipple If you think that "MFENCE applies only for the program order in a given thread", and if the thread-1 and thread-2 at the same time (simultaneously) perform an operation SFENCE, then through this SFENCE how can we ensure that the thread-3 will see the data in its L1/L2 cache initially from thread-1 (SFENCE), and then from the thread-2 (SFENCE), and the thread-4 receive data is the same sequence, if the two SFENCEs on two threads (1 & 2) performed simultaneously? – Hoofer 30/9, 2013 at 19:48

@Hoofer - well, barriers is one example. – Fipple 30/9, 2013 at 21:38

@Hoofer SFENCE is an ordered flush of local outstanding writes to shared memory. Two cores can do simultaneous SFENCE and a third core will see the writes interleaved. Intel says: "Writes from an individual processor are NOT ordered with respect to the writes from other processors." – Megawatt 1/10, 2013 at 18:6

In addition to being wrong, this doesn't answer the question. – Opportunist 21/10, 2018 at 0:5

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags