Why do I have worst performance on my spinlock implementation when I use non-cst memory model?
Asked Answered
G

1

6

I have two versions of spinlock below. The first uses the default which is memory_order_cst while the latter uses memory_order_acquire/memory_order_release. Since the latter is more relaxed, I expect it to have better performance. However it doesn't seem to be case.

class SimpleSpinLock
{
public:

    inline SimpleSpinLock(): mFlag(ATOMIC_FLAG_INIT) {}

    inline void lock()
    {   
        int backoff = 0;
        while (mFlag.test_and_set()) { DoWaitBackoff(backoff); }
    }   

    inline void unlock()
    {   
        mFlag.clear();
    }   

private:

    std::atomic_flag mFlag = ATOMIC_FLAG_INIT;
};

class SimpleSpinLock2
{
public:

    inline SimpleSpinLock2(): mFlag(ATOMIC_FLAG_INIT) {}

    inline void lock()
    {   
        int backoff = 0;
        while (mFlag.test_and_set(std::memory_order_acquire)) { DoWaitBackoff(backoff); }
    }   

    inline void unlock()
    {   
        mFlag.clear(std::memory_order_release);
    }   

private:

    std::atomic_flag mFlag = ATOMIC_FLAG_INIT;
};

const int NUM_THREADS = 8;
const int NUM_ITERS = 5000000;

const int EXPECTED_VAL = NUM_THREADS * NUM_ITERS;

int val = 0;
long j = 0;

SimpleSpinLock spinLock;

void ThreadBody()
{
    for (int i = 0; i < NUM_ITERS; ++i)
    {   
        spinLock.lock();

        ++val; 

        j = i * 3.5 + val;  

        spinLock.unlock();
    }   
}

int main()
{
    vector<thread> threads;

    for (int i = 0; i < NUM_THREADS; ++i)
    {   
        cout << "Creating thread " << i << endl; 
        threads.push_back(std::move(std::thread(ThreadBody)));
    }   

    for (thread& thr: threads)
    {   
        thr.join();
    }   

    cout << "Final value: " << val << "\t" << j << endl;
    assert(val == EXPECTED_VAL);

    return 1;  
}

I am running on Ubuntu 12.04 with gcc 4.8.2 running optimization O3.

-- Spinlock with memory_order_cst:

Run 1:
real    0m1.588s
user    0m4.548s
sys 0m0.052s

Run 2:
real    0m1.577s
user    0m4.580s
sys 0m0.032s

Run 3:
real    0m1.560s
user    0m4.436s
sys 0m0.032s

-- Spinlock with memory_order_acquire/release:

Run 1:

real    0m1.797s
user    0m4.608s
sys 0m0.100s

Run 2:

real    0m1.853s
user    0m4.692s
sys 0m0.164s

Run 3:
real    0m1.784s
user    0m4.552s
sys 0m0.124s

Run 4:
real    0m1.475s
user    0m3.596s
sys 0m0.120s

With the more relaxed model, I see a lot more variability. Sometimes it's better. Often times it is worse, does anyone have an explanation for this?

Gillenwater answered 7/5, 2014 at 15:28 Comment(4)
What happens if you remove the backoff? (As a general rule, you'll want to spin on read rather than an atomic op.)Sullivan
For GCC on Intel, I'd expect these to behave identically if not generate exactly the same code. Have you compared the assembly output of both versions of ThreadBody?Duley
@Casey: It turns out that there is an extra fence in the CST model. I'd have to think really hard though to have an opinion on whether or not that is really needed.Sullivan
@Sullivan If I remove the backoff, it takes a bit longer probably due to bus contention (150-200% longer)Gillenwater
S
7

The generated unlock code is different. The CST memory model (with g++ 4.9.0) generates:

    movb    %sil, spinLock(%rip)
    mfence

for the unlock. The acquire/release generates:

    movb    %sil, spinLock(%rip)

The lock code is the same. Someone else will have say something about why it's better with the fence, but if I had to guess, I would guess that it reduces bus/cache-coherence contention, possibly by reducing interference on the bus. Sometimes stricter is more orderly, and thus faster.

ADDENDUM: According to this, mfence costs around 100 cycles. So maybe you are reducing bus contention, because when a thread finishes the loop body, it pauses a bit before trying to reacquire the lock, letting the other thread finish. You could try to do the same thing by putting in a short delay loop after the unlock, though you'd have to make sure that it didn't get optimized out.

ADDENDUM2: It does seem to be caused by bus interference/contention caused by looping around too fast. I added a short delay loop like:

    spinLock.unlock();
    for (int i = 0; i < 5; i++) {
        j = i * 3.5 + val;
    }

Now, the acquire/release performs the same.

Sullivan answered 7/5, 2014 at 16:9 Comment(2)
I think the answer is the bus contention. That makes sense to me.Gillenwater
Just for the record, making unlock slower isn't better in general, only in silly benchmarks that do nothing but hammer on one spinlock. Another weird effect in such benchmarks is that the less fairness the better, e.g. if one thread backs off for a really long time on seeing contention, that can let the other thread get many iterations done without contention, keeping ownership of the cache lines. So tuning for throughput in high-contention cases would lead you to something like sleep (1) or 100x _mm_pause() if the lock isn't available right away!Overcurious

© 2022 - 2024 — McMap. All rights reserved.