DWCAS-alternative with no help of the kernel
Asked Answered
U

1

1

I just wanted to test if my compiler recognizes ...

atomic<pair<uintptr_t, uintptr_t>

... and uses DWCASes on it (like x86-64 lock cmpxchg16b) or if it supplements the pair with a usual lock.

So I first wrote a minimal program with a single noinline-function which does a compare and swap on a atomic pair. The compiler generated a lot of code for this which I didn't understand, and I didn't saw any LOCK-prefixed instructions in that. I was curious about whether the implementation places a lock within the atomic and printed a sizeof of the above atomic pair: 24 on a 64-bit-platform, so obviously without a lock.

At last I wrote a program which increments both portions of a single atomic pair by all the threads my system has (Ryzen Threadripper 64 core, Win10, SMT off) a predefined number of times. Then I calculated the time for each increment in nanoseconds. The time is rather high, about 20.000ns for each successful increment, so it first looked to me if there was a lock I overlooked; so this couldn't be true with a sizeof of this atomic of 24 bytes. And when I saw at the Processs Viewer I saw that all 64 cores were nearly at 100% user CPU time all the time - so there couldn't be any kernel-interventions.

So is there anyone here smarter than me and can identify what this DWCAS-substitute does from the assembly-dump ?

Here's my test-code:

#include <iostream>
#include <atomic>
#include <utility>
#include <thread>
#include <mutex>
#include <condition_variable>
#include <chrono>
#include <vector>

using namespace std;
using namespace chrono;

struct uip_pair
{
    uip_pair() = default;
    uip_pair( uintptr_t first, uintptr_t second ) :
        first( first ),
        second( second )
    {
    }
    uintptr_t first, second;
};

using atomic_pair = atomic<uip_pair>;

int main()
{
    cout << "sizeof(atomic<pair<uintptr_t, uintptr_t>>): " << sizeof(atomic_pair) << endl;
    atomic_pair ap( uip_pair( 0, 0 ) );
    cout << "atomic<pair<uintptr_t, uintptr_t>>::is_lock_free: " << ap.is_lock_free() << endl;
    mutex mtx;
    unsigned nThreads = thread::hardware_concurrency();
    unsigned ready = nThreads;
    condition_variable cvReady;
    bool run = false;
    condition_variable cvRun;
    atomic_int64_t sumDur = 0;
    auto theThread = [&]( size_t n )
    {
        unique_lock<mutex> lock( mtx );
        if( !--ready )
            cvReady.notify_one();
        cvRun.wait( lock, [&]() -> bool { return run; } );
        lock.unlock();
        auto start = high_resolution_clock::now();
        uip_pair cmp = ap.load( memory_order_relaxed );
        for( ; n--; )
            while( !ap.compare_exchange_weak( cmp, uip_pair( cmp.first + 1, cmp.second + 1 ), memory_order_relaxed, memory_order_relaxed ) );
        sumDur.fetch_add( duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count(), memory_order_relaxed );
        lock.lock();
    };
    vector<jthread> threads;
    threads.reserve( nThreads );
    static size_t const ROUNDS = 100'000;
    for( unsigned t = nThreads; t--; )
        threads.emplace_back( theThread, ROUNDS );
    unique_lock<mutex> lock( mtx );
    cvReady.wait( lock, [&]() -> bool { return !ready; } );
    run = true;
    cvRun.notify_all();
    lock.unlock();
    for( jthread &thr : threads )
        thr.join();
    cout << (double)sumDur / ((double)nThreads * ROUNDS) << endl;
    uip_pair p = ap.load( memory_order_relaxed );
    cout << "synch: " << (p.first == p.second ? "yes" : "no") << endl;
}

[EDIT]: I've extracted the compare_exchange_weak-function into a noinline-function and disassembled the code:

struct uip_pair
{
    uip_pair() = default;
    uip_pair( uintptr_t first, uintptr_t second ) :
        first( first ),
        second( second )
    {
    }
    uintptr_t first, second;
};

using atomic_pair = atomic<uip_pair>;

#if defined(_MSC_VER)
    #define NOINLINE __declspec(noinline)
#elif defined(__GNUC__) || defined(__clang__)
    #define NOINLINE __attribute((noinline))
#endif

NOINLINE
bool cmpXchg( atomic_pair &ap, uip_pair &cmp, uip_pair xchg )
{
    return ap.compare_exchange_weak( cmp, xchg, memory_order_relaxed, memory_order_relaxed );
}

    mov    eax, 1
    mov    r10, rcx
    mov    r9d, eax
    xchg   DWORD PTR [rcx], eax
    test   eax, eax
    je     SHORT label8
label1:
    mov    eax, DWORD PTR [rcx]
    test   eax, eax
    je     SHORT label7
label2:
    mov    eax, r9d
    test   r9d, r9d
    je     SHORT label5
label4:
    pause
    sub    eax, 1
    jne    SHORT label4
    cmp    r9d, 64
    jl     SHORT label5
    lea    r9d, QWORD PTR [rax+64]
    jmp    SHORT label6
label5:
    add    r9d, r9d
label6:
    mov    eax, DWORD PTR [rcx]
    test   eax, eax
    jne    SHORT label2
label7:
    mov    eax, 1
    xchg   DWORD PTR [rcx], eax
    test   eax, eax
    jne    SHORT label1
label8:
    mov    rax, QWORD PTR [rcx+8]
    sub    rax, QWORD PTR [rdx]
    jne    SHORT label9
    mov    rax, QWORD PTR [rcx+16]
    sub    rax, QWORD PTR [rdx+8]
label9:
    test   rax, rax
    sete   al
    test   al, al
    je     SHORT label10
    movups xmm0, XMMWORD PTR [r8]
    movups XMMWORD PTR [rcx+8], xmm0
    xor    ecx, ecx
    xchg   DWORD PTR [r10], ecx
    ret
label10:
    movups xmm0, XMMWORD PTR [rcx+8]
    xor    ecx, ecx
    movups XMMWORD PTR [rdx], xmm0
    xchg   DWORD PTR [r10], ecx
    ret

Maybe someone understands the disassembly. Remember that XCHG is implicitly LOCK'ed on x86. It seems to me that MSVC uses some kind of software transactional memory here. I can extend the shared structure embedded in the atomic arbitrarily but the difference is still 8 bytes; so MSVC always uses some kind of STM.

Uxorious answered 19/9, 2021 at 16:42 Comment(22)
You might want to look at the result of std::atomic<T>::is_lock_freeGony
Which compiler are you using? Both gcc and clang fail for me with 'std::atomic requires a trivially copyable type'.Telesthesia
Same as @Telesthesia - failed to compile - live - godbolt.org/z/bEnhqhfhvGony
As @RichardCritten pointed out, use the is_lock_free function to determine if it is using a lock under the hood. That being said, if I were you I'd test the performance difference between a sequentially consistent memory ordering (default for std::atomic) and an std::mutex. I would expect, in low contention situations, that the std::mutex would perform better than sequentially consistent operations on an std::atomic.Circumpolar
I changed the code so that I use my own pair. I compiled it with g++ 12 and interestingly the compiler complains about missing __atomic_load_16 and __atomic_compare_exchange_16 functions. So at least gcc12 recognises my pair and uses a DCAS for this. But how do I fix the linker-errors ? Supplying -mcx16 doesn't work.Uxorious
@BonitaMontero what standard version are you targeting?Circumpolar
@Circumpolar -std=c++20 -mcx16 with gcc 11 (not 12 as I said above).Uxorious
Those functions are in libatomic. So you have to link with -latomic, and look inside its source to see how it's implemented.Flanker
Ok, now it works ! On my older Linux computer, a Phenom X4 945 3GHz I get about 55 clock-cycles with one thread and 727 clock-cycles with four threads. So the compiler really uses a DCAS !!!Uxorious
If I single-step into __atomic_compare_exchange_16, I do indeed encounter a lock cmpxchg16b. I think it's wrapped in a library function so that it can fall back at runtime to another implementation if cmpxchg16b isn't available.Flanker
Yes, but try it with MSVC - there's a lot of code without any lock-prefixes and without any kernel-calls ! And for the Linux-issue: the code on Linux still says that my atomic isn't lock-free although it uses a single locked instruction according to the timings and what you've been disassembling.Uxorious
I believe the issue is that is_lock_free is determined at compile time, and only returns true if the type is guaranteed lock-free. Since it might fall back to a non-lock-free implementation at runtime if you run your binary on some other machine, the compiler can't make that promise.Flanker
As best I can tell from this output, MSVC is putting a spinlock mutex around it. I'm not able to actually test it very easily.Flanker
@NateEldredge: Of course is_lock_free is evaluated at runtime - because there might be different behaviour depending on misalignment or whatever. That's while there is is_always_lock_free, which is static.Uxorious
Oh good point. It looks like this is a gcc libatomic decision. github.com/gcc-mirror/gcc/blob/… says "Users likely expect 'lock-free' to also mean 'fast', which is why we do not return true if, for example, we implement loads with this size/alignment using a CAS." Which is exactly the case for 16 bytes on x86-64 - if you step through __atomic_load_16, it does lock cmpxchg16b as well.Flanker
At a quick glance at the code, I think MSVC is putting a simple spinlock as a member of the atomic<pair<...>>. That's why it's 24 bytes instead of 16, as you'd expect if the compiler were going to handle it with atomic DCAS. And you can see in the disassembly: rcx points to the atomic<pair<...>> to be handled, DWORD PTR [rcx] is the lock, and the data itself is at QWORD PTR [rcx+8] and QWORD PTR [rcx+16]. Calling this "software transactional memory" seems too fancy a description.Flanker
By the way, are you compiling with optimization? There are some bits of the code that don't look optimized at all, e.g. sete al followed by test al, al.Flanker
@NateEldredge. With a "simple spinlock" the code wouldn'*t be so long. And spinlocks are unfeasible in userland because the code could be scheduled away in the middle of its work and other code mit wait infinitely.Uxorious
@BonitaMontero: I agree the asm is a little overcomplicated (which is why I asked about optimization), and it appears to have something like a linear backoff to not hammer on the cache line too frequently if the lock cannot be taken right away. But it definitely is a spinlock, I'm quite confident of that. You're absolutely right about the risk of a spinlock in case the thread holding the lock is scheduled out - that appears to be a risk that MSVC has decided to take, given that the lock should otherwise only be held for a few cycles.Flanker
We can walk through the disassembly if you'd like, but also think about what's not there - there's clearly no call to the operating system or any library function, so no way the code can yield its timeslice while waiting. There's no cmpxchg16b, and short of locking, there isn't any way to atomically emulate a DCAS with smaller atomic CAS (else CPU designers would not bother to provide DCAS). And it's clear that some sort of looping is taking place, with the backward jumps.Flanker
(It's exponential backoff actually, with a cap of 64 pause iterations between lock attempts. I misread add r9d, r9d before.)Flanker
@BonitaMontero: Hardly any ISAs have DCAS. (en.wikipedia.org/wiki/Double_compare-and-swap). Some m68k had it. What you're looking for is DWCAS, double-word-CAS, i.e. a CAS on a contiguous memory region two pointer-widths wide. x86-64 can do that with lock cmpxchg16b (except in the earliest AMD64 K8 CPUs which omitted that instruction). I edited your question to use that terminology.Embus
S
3

As Nate pointed out in comments, it is a spinlock.

You can look up source, it ships with the compiler. and is available on Github.

If you build an unoptimized debug, you can step into this source during interactive debugging!

There's a member variable called _Spinlock.

And here's the locking function:

#if 1 // TRANSITION, ABI, GH-1151
inline void _Atomic_lock_acquire(long& _Spinlock) noexcept {
#if defined(_M_IX86) || (defined(_M_X64) && !defined(_M_ARM64EC))
    // Algorithm from Intel(R) 64 and IA-32 Architectures Optimization Reference Manual, May 2020
    // Example 2-4. Contended Locks with Increasing Back-off Example - Improved Version, page 2-22
    // The code in mentioned manual is covered by the 0BSD license.
    int _Current_backoff   = 1;
    const int _Max_backoff = 64;
    while (_InterlockedExchange(&_Spinlock, 1) != 0) {
        while (__iso_volatile_load32(&reinterpret_cast<int&>(_Spinlock)) != 0) {
            for (int _Count_down = _Current_backoff; _Count_down != 0; --_Count_down) {
                _mm_pause();
            }
            _Current_backoff = _Current_backoff < _Max_backoff ? _Current_backoff << 1 : _Max_backoff;
        }
    }
#elif defined(_M_ARM) || defined(_M_ARM64) || defined(_M_ARM64EC)
    while (_InterlockedExchange(&_Spinlock, 1) != 0) { // TRANSITION, GH-1133: _InterlockedExchange_acq
        while (__iso_volatile_load32(&reinterpret_cast<int&>(_Spinlock)) != 0) {
            __yield();
        }
    }
#else // ^^^ defined(_M_ARM) || defined(_M_ARM64) || defined(_M_ARM64EC) ^^^
#error Unsupported hardware
#endif
}

(disclosure: I brought this increasing backoff from Intel manual into there, it was just an xchg loop before, issue, PR)

Spinlock use is known to be suboptimal, instead mutex that does kernel wait should have been used. The problem with spinlock is that in a rare case when context switch happens while holding a spinlock by a low priority thread it will take a while to unlock that spinlock, as scheduler will not be aware of high-priority thread waiting on that spinlock.

Sure not using cmpxchg16b is also suboptimal. Still, for bigger atomics non-lock-free mechanism has to be used. (There's no decision to avoid cmpxchg16b made, it is just a consequence of ABI compatibility down to Visual Studio 2015)

There's an issue about making it better, that will hopefully be addressed with the next ABI break: https://github.com/microsoft/STL/issues/1151

As for transaction memory. It might make sense to use hardware transaction memory there. I can speculate that Intel RTM could possibly be implemented there with intrinsics, or there may be some future OS API for them (like, enhanced SRWLOCK), but it is likely that nobody will want more complexity there, as non-lock-free atomic is a compatibility facility, not something you would deliberately want to use.

Shipman answered 18/11, 2021 at 7:29 Comment(4)
RTM could be used as a drop-in replacement for lock cmpxchg16b, likely making pure-load more efficient. (Especially with multiple concurrent readers, because it would truly be read only, not needing to get the line into Exclusive or Modified state.) It might not be any faster for pure-store, except for avoiding cx16 retry loops on contended stores. And of course replacing cx16 retry loops for .fetch_add and whatever if you had a 128-bit integer type that supported any of the other methods.Embus
RTM would be compatible with other threads using lock cmpxchg16 so it wouldn't be an ABI break; that's the beauty or transactional memory: it's just making an atomic transaction out of multiple instructions, instead of the limits of one like lock xadd. (Of course a transaction that touches multiple memory locations is a fundamentally new capability, like actual DCAS, not just DWCAS for a contiguous pair of pointers.)Embus
@PeterCordes, unavailability of immediately accessible CPU with enabled RTM prevents me from attempts to introduce RTM anywhere, otherwise I might have tried alreadyShipman
I haven't actually played with it myself, I just understand the concept of how it works on a CPU-architecture level. (Or at least I think I do :P based on stuff like realworldtech.com/haswell-tm and understanding how lock cmpxchg16b works internally). Practical problems in using it include efficiently using the right code on CPUs that have it. And with Intel being like Lucy with a football to our Charlie Brown, it keeps getting disabled on more and more CPUs that once supported it.Embus

© 2022 - 2024 — McMap. All rights reserved.