The two things pthread_spin_lock
does differently on contention are:
- Runs a
pause
instruction before retrying the atomic RMW.
- In the retry loop, checks read-only that the lock looks available before attempting another atomic RMW which needs to take exclusive ownership of the cache line, stopping the other thread from even reading it for a while.
(I'm assuming you're on an x86-64 CPU? You didn't mention it. But I'm guessing Intel, not AMD, based on how much pause
helps in this weird case where fine-grained multithreading actively hurts). See also
Differences that don't matter:
pthread_spin_lock
uses lock dec dword ptr [rdi]
as its atomic RMW.
flag.test_and_set
uses xchg
to store and get the old value in a register for test al,al
.
pthread_spinlock_t
is a 32-bit type, vs. atomic_flag
being 8-bit.
(lock dec
requires a wide-enough type to not wrap around back to 1
= unlocked.)
- non-difference: both unlock with a plain x86
mov
store, since they only need release
semantics. (A seq_cst store like flag.clear(seq_cst)
would be done with xchg
, since the implicit lock
prefix makes it a full memory barrier.)
I found what it does by setting a breakpoint before the call to pthread_spin_lock
and single-stepping the asm in GDB. (layout asm
). The asm is also visible in objdump -drwC -Mintel /lib/libc.so.6 | less
and search for spin_lock
.
Why these matter so much in this case
This case of extreme contention (threads trying to take the lock again right after unlocking, with no useful work in between) magnifies the effect of these differences.
When one thread backs off due to the first attempt failing to get the lock, it gives the other thread time to complete multiple iterations, releasing and re-taking the lock without contention.
In Skylake and later Intel CPUs, pause
pauses the front-end for about 100 cycles, up from about 5 in Broadwell. AFAIK, current AMD still use a pretty short pause
, so I'd expect the effect to be a lot less pronounced on a Ryzen.
(xchg mem,reg
throughput is one per 18 cycles on current Intel (https://uops.info/) if done back to back with no other memory ops in between to drain from the store buffer. In our case there's a load and store, but those probably hit in cache since they were allocated sequentially, so the load-use latency is pretty short. Normally linked lists suck because they make a long chain of load latencies.)
So one back-off by another thread lets the thread holding the lock probably complete that iteration and then another 2 or 3 without disturbance, keeping exclusive ownership of the cache line. (With Intel's pause
time).
When the other thread only checks read-only for availability, it doesn't disturb the other thread as much, since it can keep the cache line in Shared state.
Both flag
and the_list
are probably in the same cache line. We could try aligning them both by 128 to avoid that, but it makes no measurable difference. (Cache lines are 64 bytes, but the L2 spatial prefetcher likes to complete an aligned pair of cache lines. If you were going to define std::hardware_destructive_interference_size
, 128 would be a good choice for current x86-64.
Those things speed up the atomic_flag
version to match pthread
Just adding _mm_pause()
from <immintrin.h>
into the while(flag.TAS()){ _mm_pause(); }
spin-wait loop speeds it up from about 1.13-1.20sec to about 0.58 sec on my Skylake i7-6700k. (Linux 6.5.3, glibc 2.38)
- 1.13 to 1.20 sec - your hand-rolled
atomic_flag
spinlock
- 0.57 to 0.58 sec - that with
_mm_pause()
- 0.31 sec - spin read-only with
pause
on contention.
- 0.33 sec -
pthread_spinlock_t
which also spins read-only with pause
, but with function-call overhead.
Adding the read-only test before spin-retry speeds it up all the way, making it faster than pthread_spin_unlock
since there's no function-call overhead.
while (flag.test_and_set(std::memory_order_acquire)){
do {
_mm_pause();
// don't retry the RMW until it might succeed
}while(flag.test(std::memory_order_relaxed));
}
You could experiment with doing 2 or 4 pauses per check, like _mm_pause(); _mm_pause();
to further magnify this effect. Or pin both threads to the same core so they can't contend with each other, like taskset -c 2 ./spinlock_custom
vs. -c 1,2
to allow cores #1 and #2. (But that will often mean a context switch while holding the lock, leading to a fully wasted timeslice for the other thread since we don't sched_yield()
even after hundreds of spin iterations. That's why it's actually slightly slower to run with both threads pinned to a single core.)
4x _mm_pause();
makes this hand-rolled spinlock microbenchmark complete another 1.5x faster, since we're trading fairness for throughput. And we know there's another thread that will also be hammering on the lock indefinitely. vs. in the normal case, we'd be aiming for a backoff time where they'll probably be done, or where this burst of contention has ended. But it's not a burst, it's constant contention. Pausing longer just means taking turns with a coarser time scale, bouncing the cache line back and forth less often. And we have no useful work we could be doing instead of pausing. So the only useful work is serialized, and multithreading + locking makes it much slower; the farther we get away from actual multithreading, the better our throughput.
So a benchmark like this would be a poor choice for making tuning decisions for a general-purpose spinlock. It's totally fine for comparing two different implementations to see how they differ and what effect that has on this situation, though. (pthread's choices are normally good in general; that's why they do it. They also happen to help a lot for this artificial case.)
flag.test()
is a C++20 feature; I had to compile with g++ -O2 -std=gnu++20 -pthread
. In earlier C++ revisions, simply use std::atomic<bool>
with .exchange(acquire)
and .load(relaxed)
, and .store(0, release)
.
Some primitive ISAs (or early versions of them) only provide xchg
/swap instructions, or an actual test-and-set where the value to be swapped is baked in. Either is sufficient for the operations atomic_flag
provides, hence it being the only guaranteed always_lock_free
type, but modern versions of all mainstream ISAs have always_lock_free
for power-of-2 types up to pointer width at least, and some for 2 pointers wide.
Terminology: a spinlock can't be lock-free by definition
A spinlock is by definition not lock-free: its whole purpose is to set a variable such that other threads have to wait until we're done before they can do anything.
Using lock-free atomic building blocks only results in a lock-free algorithm if you avoid things like spinning indefinitely waiting to see a value stored by another thread. (https://en.wikipedia.org/wiki/Non-blocking_algorithm). There are non-blocking queue implementations, often using a fixed-size circular buffer to avoid the deallocation problem, especially in non-garbage-collected languages like C++.
Rolling your own lock just to see what happens is a valid exercise, just don't call it "lock-free"!