Preface
I recently came across some synchronization problems, which led me to spinlocks and atomic counters. Then I was searching a bit more, how these work and found std::memory_order and memory barriers (mfence
, lfence
and sfence
).
So now, it seems that I should use acquire/release for the spinlocks and relaxed for the counters.
Some reference
x86 MFENCE - Memory Fence
x86 LOCK - Assert LOCK# Signal
Question
What is the machine code (edit: see below) for those three operations (lock = test_and_set, unlock = clear, increment = operator++ = fetch_add) with default (seq_cst) memory order and with acquire/release/relaxed (in that order for those three operations). What is the difference (which memory barriers where) and the cost (how many CPU cycles)?
Purpose
I was just wondering how bad my old code (not specifying memory order = seq_cst used) really is and if I should create some class atomic_counter
derived from std::atomic
but using relaxed memory ordering (as well as good spinlock with acquire/release instead of mutexes on some places ...or to use something from boost library - I have avoided boost so far).
My Knowledge
So far I do understand that spinlocks protect more than itself (but some shared resource/memory as well), so, there must be something that makes some memory view coherent for multiple threads/cores (that would be those acquire/release and memory fences). Atomic counter just lives for itself and only need that atomic increment (no other memory involved and I do not really care about the value when I read it, it is informative and can be few cycles old, no problem). There is some LOCK
prefix and some instructions like xchg
implicitly have it. Here my knowledge ends, I don't know how the cache and buses really work and what is behind (but I know that modern CPUs can reorder instructions, execute them in parallel and use memory cache and some synchronization). Thank you for explanation.
P.S.: I have old 32bit PC now, can only see lock addl
and simple xchg
, nothing else - all versions look the same (except unlock), memory_order makes no difference on my old PC (except unlock, release uses move
instead of xchg
). Will that be true for 64bit PC? (edit: see below) Do I have to care about memory order? (answer: no, not much, release on unlock saves few cycles, that's all.)
The Code:
#include <atomic>
using namespace std;
atomic_flag spinlock;
atomic<int> counter;
void inc1() {
counter++;
}
void inc2() {
counter.fetch_add(1, memory_order_relaxed);
}
void lock1() {
while(spinlock.test_and_set()) ;
}
void lock2() {
while(spinlock.test_and_set(memory_order_acquire)) ;
}
void unlock1() {
spinlock.clear();
}
void unlock2() {
spinlock.clear(memory_order_release);
}
int main() {
inc1();
inc2();
lock1();
unlock1();
lock2();
unlock2();
}
g++ -std=c++11 -O1 -S (32bit Cygwin, shortened output)
__Z4inc1v:
__Z4inc2v:
lock addl $1, _counter ; both seq_cst and relaxed
ret
__Z5lock1v:
__Z5lock2v:
movl $1, %edx
L5:
movl %edx, %eax
xchgb _spinlock, %al ; both seq_cst and acquire
testb %al, %al
jne L5
rep ret
__Z7unlock1v:
movl $0, %eax
xchgb _spinlock, %al ; seq_cst
ret
__Z7unlock2v:
movb $0, _spinlock ; release
ret
UPDATE for x86_64bit: (see mfence
in unlock1
)
_Z4inc1v:
_Z4inc2v:
lock addl $1, counter(%rip) ; both seq_cst and relaxed
ret
_Z5lock1v:
_Z5lock2v:
movl $1, %edx
.L5:
movl %edx, %eax
xchgb spinlock(%rip), %al ; both seq_cst and acquire
testb %al, %al
jne .L5
ret
_Z7unlock1v:
movb $0, spinlock(%rip)
mfence ; seq_cst
ret
_Z7unlock2v:
movb $0, spinlock(%rip) ; release
ret
g++ -fverbose-asm -O1 -std=c++11 -S
might give slightly more readable assembly code.... – Trona