The cost of atomic counters and spinlocks on x86(_64)

Asked 18/8, 2014 at 12:26 Answered 27/12, 2017 at 9:31

Solved c++multithreading c++11 atomic memory-fences

Preface

I recently came across some synchronization problems, which led me to spinlocks and atomic counters. Then I was searching a bit more, how these work and found std::memory_order and memory barriers (mfence, lfence and sfence).

So now, it seems that I should use acquire/release for the spinlocks and relaxed for the counters.

Some reference

x86 MFENCE - Memory Fence
x86 LOCK - Assert LOCK# Signal

Question

What is the machine code (edit: see below) for those three operations (lock = test_and_set, unlock = clear, increment = operator++ = fetch_add) with default (seq_cst) memory order and with acquire/release/relaxed (in that order for those three operations). What is the difference (which memory barriers where) and the cost (how many CPU cycles)?

Purpose

I was just wondering how bad my old code (not specifying memory order = seq_cst used) really is and if I should create some class atomic_counter derived from std::atomic but using relaxed memory ordering (as well as good spinlock with acquire/release instead of mutexes on some places ...or to use something from boost library - I have avoided boost so far).

My Knowledge

So far I do understand that spinlocks protect more than itself (but some shared resource/memory as well), so, there must be something that makes some memory view coherent for multiple threads/cores (that would be those acquire/release and memory fences). Atomic counter just lives for itself and only need that atomic increment (no other memory involved and I do not really care about the value when I read it, it is informative and can be few cycles old, no problem). There is some LOCK prefix and some instructions like xchg implicitly have it. Here my knowledge ends, I don't know how the cache and buses really work and what is behind (but I know that modern CPUs can reorder instructions, execute them in parallel and use memory cache and some synchronization). Thank you for explanation.

P.S.: I have old 32bit PC now, can only see lock addl and simple xchg, nothing else - all versions look the same (except unlock), memory_order makes no difference on my old PC (except unlock, release uses move instead of xchg). Will that be true for 64bit PC? (edit: see below) Do I have to care about memory order? (answer: no, not much, release on unlock saves few cycles, that's all.)

The Code:

#include <atomic>
using namespace std;

atomic_flag spinlock;
atomic<int> counter;

void inc1() {
    counter++;
}
void inc2() {
    counter.fetch_add(1, memory_order_relaxed);
}
void lock1() {
    while(spinlock.test_and_set()) ;
}
void lock2() {
    while(spinlock.test_and_set(memory_order_acquire)) ;
}
void unlock1() {
    spinlock.clear();
}
void unlock2() {
    spinlock.clear(memory_order_release);
}

int main() {
    inc1();
    inc2();
    lock1();
    unlock1();
    lock2();
    unlock2();
}

g++ -std=c++11 -O1 -S (32bit Cygwin, shortened output)

__Z4inc1v:
__Z4inc2v:
    lock addl   $1, _counter    ; both seq_cst and relaxed
    ret
__Z5lock1v:
__Z5lock2v:
    movl    $1, %edx
L5:
    movl    %edx, %eax
    xchgb   _spinlock, %al      ; both seq_cst and acquire
    testb   %al, %al
    jne L5
    rep ret
__Z7unlock1v:
    movl    $0, %eax
    xchgb   _spinlock, %al      ; seq_cst
    ret
__Z7unlock2v:
    movb    $0, _spinlock       ; release
    ret

UPDATE for x86_64bit: (see `mfence` in `unlock1`)

_Z4inc1v:
_Z4inc2v:
    lock addl   $1, counter(%rip)   ; both seq_cst and relaxed
    ret
_Z5lock1v:
_Z5lock2v:
    movl    $1, %edx
.L5:
    movl    %edx, %eax
    xchgb   spinlock(%rip), %al     ; both seq_cst and acquire
    testb   %al, %al
    jne .L5
    ret
_Z7unlock1v:
    movb    $0, spinlock(%rip)
    mfence                          ; seq_cst
    ret
_Z7unlock2v:
    movb    $0, spinlock(%rip)      ; release
    ret

Wilden answered 18/8, 2014 at 12:26 Comment(10)

If you want to know what the machine code is, why don't you just look at the result of the compilation?? – Shantae 18/8, 2014 at 12:27

@KerrekSB: That is only small part of the question. I can see some fences and locks, but want to know, what that really do. – Wilden 18/8, 2014 at 12:28

Using g++ -fverbose-asm -O1 -std=c++11 -S might give slightly more readable assembly code.... – Trona 18/8, 2014 at 13:12

Perhaps most of the actual cost is cache synchronization. – Trona 18/8, 2014 at 13:28

@BasileStarynkevitch: yes, that was the purpose of the question. I don't know what will I see on 64bit PC right now (will probably check when I have the chance), but want to know if it is OK to just forget about it or should I use these memory_order makrs (so the question is about the cost - how long, how it synchronizes, what do modern PCs do). – Wilden 18/8, 2014 at 13:37

But then it depends upon the processor model and brand, and what other cores caches contain. In other words, it is not predictable or reproducible! – Trona 18/8, 2014 at 13:40

It should be noted that the difference between memory ordering is mostly about the compiler being able to move around loads and stores (and, of course, also the instructions it uses, though less so on x86). Insofar a question of "how many cycles" is not really answerable. An atomic operation takes "tens to twenties" of cycles as compared to "one or two" for a normal operation, but it may be "zero" or "several hundreds", depending on context. – Rapture 18/8, 2014 at 13:42

I don't care about ARM or whatever architecture now, just normal Intel/AMD PCs (x86 and x86_64, Intel i7, AMD QuadCore and such). I should look again when I saw that the memory fence / lock is aprox. 100 cycles. – Wilden 18/8, 2014 at 13:48

And the context is: some threads on some cores incrementing the counter or locking the spinlock. Worst possible if you care (two threads on two cores accessing same counter / spinlock). So far it looks that there is almost no difference (except using release for the unlock). – Wilden 18/8, 2014 at 13:51

Here is an older thread with discussion of and where I timed spinlocks. – Coakley 14/1, 2019 at 18:48

x86 has mostly strong memory model, all the usual stores/loads have release/acquire semantics implicitly. The exception is only SSE non-temporal store operations which require sfence to be ordered as usual. All the read-modify-write (RMW) instructions with the LOCK prefix imply full memory barrier, i.e. seq_cst.

Thus on x86, we have

test_and_set can be coded with lock bts (for bit-wise operations), lock cmpxchg, or lock xchg (or just xchg which implies the lock). Other spin-lock implementations can use instructions like lock inc (or dec) if they need e.g. fairness. It is not possible to implement try_lock with release/acquire fence (at least you'd need standalone memory barrier mfence anyway).
clear is coded with lock and (for bit-wise) or lock xchg, though, more efficient implementations would use plain write (mov) instead of locked instruction.
fetch_add is coded with lock add.

Removing the lock prefix will not guarantee atomicity for RMW operations thus such operations cannot be interpreted strictly as having memory_order_relaxed in C++ view. However in practice, you might want to access atomic variable via faster non-atomic operation when it is safe (in constructor, under lock).

In our experience, it does not really matter which exactly RMW atomic operation is performed they take almost the same number of cycles to execute (and mfence is about x0.5 of a lock operation). You can estimate performance of synchronization algorithms by counting the number of atomic operations (and mfences), and the number of memory indirections (cache misses).

Godship answered 18/8, 2014 at 12:56 Comment(10)

Can you site the reference for this? I've been lead to believe that the memory model is not strong; that certain accesses can be reorderd. – Deicer 18/8, 2014 at 13:22

I can see that there is almost no difference on my 32bit x86 (except release-unlock - simple move instead of xchg), but x86_64 will be the same? No memory fences and such? So the answer is that I don't really have to care much (especially about those counters)? – Wilden 18/8, 2014 at 13:24

@JamesKanze, yes, it's not totaly strong, loads can be reordered before writes, but otherwise it's quite restricted. – Godship 18/8, 2014 at 13:24

So, if I understand it correctly, that xchg (or lock bts/cmpxchg) works as memory fence (read-modify-write instruction), nothing can be reordered across that. The lock is safe and unlock is not that important, because nothing between lock-unlock can move outside (the unlock could possibly be seen later, if observing but not trying to lock it, don't know, but the lock is safe without any other memory fence). Same with those counters, no matter if I use relaxed or default, it is atomic and nothing can be reordered across the line. Thats it? – Wilden 18/8, 2014 at 13:45

@firda, you are totally right. Also updated with a little about performance of the atomics. – Godship 18/8, 2014 at 14:6

One more question before I accept the answer: You have mentioned that try_lock with release/acquire is not possible without mfence. Can you explain? I have seen some answer possibly related to this wich explained, that those memory_order arguments for atomic operations talk about other instructions around that atomic, but the atomic itself is RMW and thus... kind of fence (not sure exactly how to say that). – Wilden 18/8, 2014 at 14:35

@firda, I meant Peterson's lock which does not use atomics but still requires the memory barrier. It is practical for synchronizing only two threads. For more, it gets more complicated. – Godship 18/8, 2014 at 14:42

"For release/relaxed operations, just remove the lock prefix.". This seems wrong to my understanding. Read/Modify/Write operations, even memory_order_relaxed, need to have "Lock" on them in x86-64. Because internally, the x86 CPU will split the operation into uOps, and those uOps do NOT execute atomically. – Algerian 27/12, 2017 at 1:42

yeah, you are right. I said it probably about R/W ops, not RMW; or not about C++ definition of 'relaxed'. Thanks, will fix. – Godship 27/12, 2017 at 2:12

I think fetch_add is translated to lock add only if the returned value is not used. Otherwise, lock xadd is used (at least what I have seen with GCC). – Brawn 27/4, 2021 at 13:32

I recommend: x86-TSO: A Rigorous and Usable Programmer's Model for x86 Multiprocessors.

Your x86 and x86_64 are indeed pretty "well behaved". In particular, they do not re-order write operations (and any speculative writes are discarded while they are in the cpu/core's write-queue), and they do not re-order read operations. However, they will start read operations as early as they can, which means that reads and writes can be re-ordered. (A read of something sitting in the write-queue reads the queued value, so reads/writes of the same location are not re-ordered.) So:

read-modify-write operations require LOCKs which makes them, implicitly, memory_order_seq_cst.

So for these operations you gain nothing by weakening the memory ordering (on the x86/x86_64). The general advice is to "keep it simple" and stick with memory_order_seq_cst, which happily is not costing anything extra for the x86 and x86_64.

For anything newer than a Pentium, if the cpu/core already has "exclusive" access to the affected memory, the LOCK does not affect other cpus/cores, and may be a relatively simple operation.
memory_order_acquire/_release do not require an mfence or any other overhead.

So, for atomic load/store, if acquire/release is sufficient, then for the x86/x86_64 those operations are "tax free".
memory_order_seq_cst does require mfence...

...which is worth understanding.

(NB: we are here talking about what the processor does with the instructions generated by the compiler. The compiler's re-ordering of operations is a very similar issue, but not addressed here.)

An mfence stalls the cpu/core until all pending writes are cleared out of the write-queue. In particular, any read operations which follow the mfence will not start until the write-queue is empty. Consider two threads:

  initial state: wa = wb = 0

  thread 'A'                    thread 'B'
    wa = 1 ;  (mov [wa] ← 1)      wb = 1 ;   (mov [wb] ← 1)
    a  = wb ; (mov ebx ← [wb])    b  = wa ;  (mov ebx ← [wa])

Left to their own devices, the x86/x86_64 can produce any of (a = 1, b = 1), (a = 0, b = 1), (a = 1, b = 0) and (a = 0, b = 0). The last is invalid if you expect memory_order_seq_cst -- since you cannot get that by any interleaving of the operations. The reason this can happen is that the writes of wa and wb are queued in the respective cpu's/core's queue, and the reads of wa and wb can both be scheduled and can both complete before either write. To achieve memory_order_seq_cst you need an mfence:

  thread 'A'                    thread 'B'
    wa = 1 ;  (mov [wa] ← 1)      wb = 1 ;   (mov [wb] ← 1)
        mfence ;                      mfence
    a  = wb ; (mov ebx ← [wb])    b  = wa ;  (mov ebx ← [wa])

Since there is no synchronization between the threads, the result may be anything except (a = 0, b = 0). Interestingly, the mfence is for the benefit of the thread itself, because it prevents the read operation starting before the write completes. The only thing that other threads care about is the order in which writes occur, and the x86/x86_64 does not re-order those in any case.

So, to implement memory_order_seq_cst atomic_load() and atomic_store(), it is necessary to insert an mfence after one or more stores and before a load. Where these operations are implemented as library functions, the common convention is to add the mfence to all stores, leaving the load "naked". (The logic being that loads are more common than stores, and it seems better to add the overhead to the store.)

For spin-locks, at least, your question seems to boil down to whether a spin-unlock operation requires an mfence, or not, and what difference it makes.

The C11 atomic_flag_clear() is, implicitly, memory_order_seq_cst, for which an mfence is required. The C11 atomic_flag_test_and_set() is not only a read-modify-write operation but is also implictly memory_order_seq_cst -- and LOCK does that.

C11 does not offer a spin-lock in the threads.h library. But you can use an atomic_flag -- though for your x86/x86_64 you have PAUSE instruction problem to deal with. The question is, do you need memory_order_seq_cst for this, in particular for the unlock ? I think the answer is no, and that the trick is to do: atomic_flag_test_and_set_explicit(xxx, memory_order_acquire) and atomic_flag_clear(xxx, memory_order_release).

FWIW, the glibc pthread_spin_unlock() does not have an mfence. Nor does the gcc __sync_lock_release() (which is explicitly a "release" operation). But the gcc _atomic_clear() is aligned with the C11 atomic_flag_clear(), and takes a memory order parameter.

What difference does the mfence make to the unlock ? Clearly it's very disruptive to the pipe-line, and since it's not necessary, there's not much to be gained working out the exact scale of its impact, which will depend on the circumstances.

Shostakovich answered 25/8, 2014 at 14:16 Comment(6)

Thx for more light on the topic. Conclusion seem to be to leave the counters as they are (RMW = LOCK), and use spinlocks from some library (e.g. pthread_spinlock_t). – Wilden 25/8, 2014 at 14:52

You said "For anything newer than a Pentium, if the cpu/core already has "exclusive" access to the affected memory, the LOCK does not affect other cpus/cores, and may be a relatively simple operation" That's not correct. Intel's system manual states that not only are locked operations ordered across all cpus, but no mov from or to memory can be moved across a locked instruction from any cpu as seen by another cpu, therefor locked instructions are much stronger than you said, they wait to drain the store buffer on all cpus, all cpus therefor get a hit. It even says that for "p6 family". – Ilanailangilang 8/10, 2014 at 9:52

@JoshS: locking produces all that wonderful ordering, and when (but only when) more than one CPU is attempting to perform a locked operation, on the same piece of memory, then there is an interaction. See section 8.1.4 of the Intel "System Programming Guide", where it talks about “cache locking”. I note Section 8.2.2 of the same manual -- where it discusses ordering of locked operations -- but I don't believe that does imply that all CPUs are forced to do, effectively, an mfence as you describe. – Shostakovich 8/10, 2014 at 18:36

Looking closer, you may be right. This page which is often pointed to from stackoverflow implies that I'm right, but I don't really know: – Ilanailangilang 9/10, 2014 at 7:16

Stupid rule didn't let me finish the comment, hit return by mistake then ran over 5 minutes. grr!!!! Here's that page ... cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html It says that (intel order instructions) lock xadd [add],0 (for loading) and mov [add],value (storing) form a sequential access pair that's totally ordered across processors. It also claims (even less likely) that mfence; mov reg,[add] can replace the locked xadd. If either of these is true, then you could do WONDERFUL things, create biased locks without the horrors the jvm does to lock from the slow side. – Ilanailangilang 9/10, 2014 at 7:22

@JoshS the x86 memory model is very strong: it doesn't re-order reads or writes (for a CPU), and all writes to "global" memory are visible to all CPUs at the same time. There is a write-queue (in code order) so all writes to "global" memory are from the queue (in code order). There is no read-queue. The process of emptying the write-queue proceeds independently of its CPU, except where an mfence stalls the CPU until the write-queue is empty. As noted above, the mfence ensures the CPU's next read does not occur before all its previous writes are visible to all other CPUs. – Shostakovich 9/10, 2014 at 10:37

spinlock do not use mfence, mfence only enforce serialise/flush of data of current core. The fence itself do not in any way relate to atomic operation.

For spinlock you need some kind of atomic action to exchange data to a memory place. There are many different implementation, targeted for different requirement: for example, do it work on kernel or user-space? is it fair-lock?

A very simple and dumb spinlock for x86 looks like this (my kernel use this):

typedef volatile uint32_t _SPINLOCK __attribute__ ((aligned(16)));
static inline void _SPIN_LOCK(_SPINLOCK* lock) {
__asm (
       "cli\n"
       "lock bts %0, 0\n"
       "jnc 1f\n"
       "0:\n"
       "pause\n"
       "test %0, 1\n"
       "je 0b\n"
       "lock bts %0, 0\n"
       "jc 0b\n"
       "1:\n"
       :
       : "m"(lock)
       :
       );
}

The logic is simple

test and exchange a bit, if zero it mean the lock not taken, and we got it.
if bit is not zero, it mean the lock is taken by other, pause is a hint recommended by cpu manufacture so that it doesn't burn the cpu with a tight look.
loop until you got the lock

Note 1. you may also implement spinlock with intrinsics and extensions, it should be fairly similar.

Note 2. Spinlock is not judge by cycles, a sane implementation should be quite fast, for instant, the above implementation you should grab the lock on first try in well designed usage, if not, fix the algorithm or split the lock to prevent/reduce lock contention.

Note 3. You should also consider other things like fairness.

Digressive answered 18/8, 2014 at 12:41 Comment(2)

Is this valid for x86_64? maybe I should mention that explicitly in my question (I don't know which architecture uses memory fences, I want answer for all Intel/AMD PCs I can buy now - both 32/64bits ...excluding Itanimu, just those I can encounter). – Wilden 18/8, 2014 at 12:55

For this particular implementation, yes it work for both x86 (32-bit protected flat mode) and x86_64 (long mode) at source code level, i have not check 32e (long compatible mode) mode yet. – Digressive 18/8, 2014 at 14:6

and the cost (how many CPU cycles)?

On x86 at least, instructions that perform memory synchronization (atomic ops, fences) have a very variable CPU cycle latency. They wait for the processor store buffers to be flushed to memory, and this varies dramatically depending on the store buffer content.

E.g., if an atomic op is straight after a memcpy() that pushes multiple cache lines out to main memory, the delay may be in the 100's of nanoseconds. The same atomic op, but after a series of register-only arithmetic instructions, may take only a few clock cycles.

Homunculus answered 27/12, 2017 at 9:31 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++