Performance pthread_spinlock_t is 2x better than my own implementation with lock free std::atomic_flag, around std::list
Asked Answered
P

1

3

I wanted to replace the pthread_spinlock_t example with my own spinlock implementation. However, my implementation's result is literally far lower than the pthread_spinlock_t performance. While the pthread_spinlock_t result is around 0.9s, my own implementation is taking around 2.4s. Can someone explain what is missing in my implementation or what the further room for improvement? I believe that I am missing something related to memory ordering. Here is my implementation below

#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <errno.h>
#include <sys/time.h>
#include <atomic>
#include <list>

#define LOOPS 10000000

using namespace std;

list<int> the_list;

//pthread_spinlock_t spinlock;
std::atomic_flag flag = ATOMIC_FLAG_INIT;

pid_t gettid() { return syscall( __NR_gettid ); }

void *consumer(void *ptr)
{
    printf("Consumer TID %lu\n", (unsigned long)gettid());

    while (1)
    {
        //pthread_spin_lock(&spinlock);
        while (flag.test_and_set(std::memory_order_acquire));

        if (the_list.empty())
        {
            //pthread_spin_unlock(&spinlock);
            flag.clear(std::memory_order_release);
            break;
        }

        the_list.front();
        the_list.pop_front();

        //pthread_spin_unlock(&spinlock);
        flag.clear(std::memory_order_release);
    }

    return NULL;
}

int main()
{
    int i;
    pthread_t thr1, thr2;
    struct timeval tv1, tv2;

    //pthread_spin_init(&spinlock, 0);

    // Creating the list content...
    for (i = 0; i < LOOPS; i++)
        the_list.push_back(i);

    // Measuring time before starting the threads...
    gettimeofday(&tv1, NULL);

    pthread_create(&thr1, NULL, consumer, NULL);
    pthread_create(&thr2, NULL, consumer, NULL);

    pthread_join(thr1, NULL);
    pthread_join(thr2, NULL);

    // Measuring time after threads finished...
    gettimeofday(&tv2, NULL);

    if (tv1.tv_usec > tv2.tv_usec)
    {
        tv2.tv_sec--;
        tv2.tv_usec += 1000000;
    }

    printf("Result - %ld.%ld\n", tv2.tv_sec - tv1.tv_sec,
        tv2.tv_usec - tv1.tv_usec);

    //pthread_spin_destroy(&spinlock);
    return 0;
}

I was expecting to achive the performance of pthread_spin with my own implementation

Pilcomayo answered 2/10, 2023 at 17:3 Comment(0)
D
4

The two things pthread_spin_lock does differently on contention are:

  • Runs a pause instruction before retrying the atomic RMW.
  • In the retry loop, checks read-only that the lock looks available before attempting another atomic RMW which needs to take exclusive ownership of the cache line, stopping the other thread from even reading it for a while.

(I'm assuming you're on an x86-64 CPU? You didn't mention it. But I'm guessing Intel, not AMD, based on how much pause helps in this weird case where fine-grained multithreading actively hurts). See also


Differences that don't matter:

  • pthread_spin_lock uses lock dec dword ptr [rdi] as its atomic RMW.
    flag.test_and_set uses xchg to store and get the old value in a register for test al,al.
  • pthread_spinlock_t is a 32-bit type, vs. atomic_flag being 8-bit.
    (lock dec requires a wide-enough type to not wrap around back to 1 = unlocked.)
  • non-difference: both unlock with a plain x86 mov store, since they only need release semantics. (A seq_cst store like flag.clear(seq_cst) would be done with xchg, since the implicit lock prefix makes it a full memory barrier.)

I found what it does by setting a breakpoint before the call to pthread_spin_lock and single-stepping the asm in GDB. (layout asm). The asm is also visible in objdump -drwC -Mintel /lib/libc.so.6 | less and search for spin_lock.


Why these matter so much in this case

This case of extreme contention (threads trying to take the lock again right after unlocking, with no useful work in between) magnifies the effect of these differences.

When one thread backs off due to the first attempt failing to get the lock, it gives the other thread time to complete multiple iterations, releasing and re-taking the lock without contention.

In Skylake and later Intel CPUs, pause pauses the front-end for about 100 cycles, up from about 5 in Broadwell. AFAIK, current AMD still use a pretty short pause, so I'd expect the effect to be a lot less pronounced on a Ryzen.

(xchg mem,reg throughput is one per 18 cycles on current Intel (https://uops.info/) if done back to back with no other memory ops in between to drain from the store buffer. In our case there's a load and store, but those probably hit in cache since they were allocated sequentially, so the load-use latency is pretty short. Normally linked lists suck because they make a long chain of load latencies.)

So one back-off by another thread lets the thread holding the lock probably complete that iteration and then another 2 or 3 without disturbance, keeping exclusive ownership of the cache line. (With Intel's pause time).

When the other thread only checks read-only for availability, it doesn't disturb the other thread as much, since it can keep the cache line in Shared state.

Both flag and the_list are probably in the same cache line. We could try aligning them both by 128 to avoid that, but it makes no measurable difference. (Cache lines are 64 bytes, but the L2 spatial prefetcher likes to complete an aligned pair of cache lines. If you were going to define std::hardware_destructive_interference_size, 128 would be a good choice for current x86-64.


Those things speed up the atomic_flag version to match pthread

Just adding _mm_pause() from <immintrin.h> into the while(flag.TAS()){ _mm_pause(); } spin-wait loop speeds it up from about 1.13-1.20sec to about 0.58 sec on my Skylake i7-6700k. (Linux 6.5.3, glibc 2.38)

  • 1.13 to 1.20 sec - your hand-rolled atomic_flag spinlock
  • 0.57 to 0.58 sec - that with _mm_pause()
  • 0.31 sec - spin read-only with pause on contention.
  • 0.33 sec - pthread_spinlock_t which also spins read-only with pause, but with function-call overhead.

Adding the read-only test before spin-retry speeds it up all the way, making it faster than pthread_spin_unlock since there's no function-call overhead.

  while (flag.test_and_set(std::memory_order_acquire)){
      do {
          _mm_pause();
            // don't retry the RMW until it might succeed
      }while(flag.test(std::memory_order_relaxed));
  }

You could experiment with doing 2 or 4 pauses per check, like _mm_pause(); _mm_pause(); to further magnify this effect. Or pin both threads to the same core so they can't contend with each other, like taskset -c 2 ./spinlock_custom vs. -c 1,2 to allow cores #1 and #2. (But that will often mean a context switch while holding the lock, leading to a fully wasted timeslice for the other thread since we don't sched_yield() even after hundreds of spin iterations. That's why it's actually slightly slower to run with both threads pinned to a single core.)

4x _mm_pause(); makes this hand-rolled spinlock microbenchmark complete another 1.5x faster, since we're trading fairness for throughput. And we know there's another thread that will also be hammering on the lock indefinitely. vs. in the normal case, we'd be aiming for a backoff time where they'll probably be done, or where this burst of contention has ended. But it's not a burst, it's constant contention. Pausing longer just means taking turns with a coarser time scale, bouncing the cache line back and forth less often. And we have no useful work we could be doing instead of pausing. So the only useful work is serialized, and multithreading + locking makes it much slower; the farther we get away from actual multithreading, the better our throughput.

So a benchmark like this would be a poor choice for making tuning decisions for a general-purpose spinlock. It's totally fine for comparing two different implementations to see how they differ and what effect that has on this situation, though. (pthread's choices are normally good in general; that's why they do it. They also happen to help a lot for this artificial case.)


flag.test() is a C++20 feature; I had to compile with g++ -O2 -std=gnu++20 -pthread. In earlier C++ revisions, simply use std::atomic<bool> with .exchange(acquire) and .load(relaxed), and .store(0, release).

Some primitive ISAs (or early versions of them) only provide xchg/swap instructions, or an actual test-and-set where the value to be swapped is baked in. Either is sufficient for the operations atomic_flag provides, hence it being the only guaranteed always_lock_free type, but modern versions of all mainstream ISAs have always_lock_free for power-of-2 types up to pointer width at least, and some for 2 pointers wide.


Terminology: a spinlock can't be lock-free by definition

A spinlock is by definition not lock-free: its whole purpose is to set a variable such that other threads have to wait until we're done before they can do anything.

Using lock-free atomic building blocks only results in a lock-free algorithm if you avoid things like spinning indefinitely waiting to see a value stored by another thread. (https://en.wikipedia.org/wiki/Non-blocking_algorithm). There are non-blocking queue implementations, often using a fixed-size circular buffer to avoid the deallocation problem, especially in non-garbage-collected languages like C++.

Rolling your own lock just to see what happens is a valid exercise, just don't call it "lock-free"!

Dendy answered 2/10, 2023 at 19:52 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.