How do atomics larger than the CPU's native support work
Asked Answered
O

2

11

With current C++ compilers you can have atomic support of atomics that are larger than the actual support of your CPU. With x64 you can have atomics that are 16 bytes, but std::atomic also works with larger tuples. Look at this code:

#include <iostream>
#include <atomic>

using namespace std;

struct S { size_t a, b, c; };

atomic<S> apss;

int main()
{
    auto ref = apss.load( memory_order_relaxed );
    apss.compare_exchange_weak( ref, { 123, 456, 789 } );
    cout << sizeof ::apss << endl;
}

The cout above always prints 32 for my platform. But how do these transactions actually work without a mutex ? I don't get any clue from inspecting the disassembly.

If I run the following code with MSVC++:

#include <atomic>
#include <thread>
#include <array>

using namespace std;

struct S { size_t a, b, c, d, e; };

atomic<S> apss;

int main()
{
    array<jthread, 2> threads;
    auto threadFn = []()
    {
        auto ref = apss.load( memory_order_relaxed );
        for( size_t i = 10'000'000; i--; apss.compare_exchange_weak( ref, { } ) );
    };
    threads[0] = jthread( threadFn );
    threads[1] = jthread( threadFn );
}

There's almost no kernel-time consumed by the code. So the contention actually happens completely in user-space. I guess that's some kind of software transactional memory happening here.

Oxymoron answered 14/8, 2023 at 12:20 Comment(27)
The atomic may internally use locking mechanism ... call its is_lock_free() to figure if it is done "without a mutex".Commission
std::atomic::is_lock_freeOrchestra
The above code compiles, but doesn't link with GCC 13.1.0 (MinGW built by Brecht Sanders) for me.Shorthand
@Fureeish: Link with -latomic.Oxymoron
@EdisonvonMyosotis Have you checked the standard library source code, it may well do the locking (it may still use compiler intrinsics which will forward to the OS)Cherice
@PepijnKramer The only external library call for the above two first lines in main() is to memcmp() with MSVC++. I think the code uses sth. like software transactional memory but I don't know how this actually works.Oxymoron
@EdisonvonMyosotis weird, that actually worked, but it didn't require me to manually link against atomic for "simpler" (e.g., atomic<SmallObject>) use-cases. And I have no idea why was that the caseShorthand
std::atomic<T> does not imply that T is atomic on the hardware level. The point of std::atomic<T> is that you need not know if T is atomic on the hardware level. Actually, even if bool is atomic for the hardware it is not for C++, but you need to use std::atomic<bool>Crock
@Shorthand #76854980Crock
FWIW there is no threading going on here so the compiler is within its rights to optimize al the atomic code away. Essentially your program can be optimized to cout << sizeof ::apss << endl;Coercion
What output do you get with cout << (apss.is_lock_free() ? "LOCKFREE" : "MUTEX") << "\n";?Goodoh
@Eljay: The above atomic claims not be be lock-free, but if I constantly do compare_exchange_weak() from two threads I get two loaded cores without any kernel memory consumption. So the whole thing is happening in userspace and there must be some kind of software transactional memory here.Oxymoron
@Coercion The first two lines actually aren't optimized away.Oxymoron
@EdisonvonMyosotis Why would a mutex require kernel memory consumption?Selfsupport
lock doesn't imply a mutex, in some implementations it's implemented with a spin lockPivotal
Relevant question: Where is the lock for a std::atomic?Hannon
Can you show us the disassembly you are looking at?Roentgenogram
AFAIK there isn't any transactional memory mechanism that could feasibly be used here. Intel's TSX exists but isn't widely available; it was disabled by microcode updates on older CPUs due to security bugs, and is not being implemented on newer CPUs. I think you are going to find that a lock of some kind is being used.Roentgenogram
I remember a question some time ago where we worked through a disassembly of MSVC's non-lock-free atomics and found that they added a spinlock as an extra hidden member of the struct. That would be consistent with your observation that both cores run 100% and no kernel resources are used. I can't find it now, unfortunately.Roentgenogram
@NateEldredge This can be easily checked by using sizeof. A hidden member needs to occupy some storage. libstdc++ and libc++ seem to use another solution (hash table of locks indexed by the pointer to an atomic object), as written in the post I linked above.Hannon
@Yakk-AdamNevraumont If theres no contention a mutex is completely locked in userspace, if there's contention the kernel participates in locking.Oxymoron
@AlanBirtles Mutexes with partitial spinning are common, but pure spinlocks don't make sense in user space since a thread holding a spinlock could be scheduled away, thereby keeping contenders spinning.Oxymoron
@NateEldredge Transactional memory is also possible in userspace without hardware support. That's called software transactional memory. STM is much less efficient than hardware transactional memory and because of that not used very often.Oxymoron
@NateEldredge As I described spinlocks don't make sense in userspace.Oxymoron
@EdisonvonMyosotis: You are absolutely right about the problem with spinlocks, but nevertheless that is what that previous disassembly showed. I too thought it was a strange design. I wish I could find it. I'll search some more.Roentgenogram
@EdisonvonMyosotis: Aha, I found it: #69245683. It was for atomic<pair<uintptr_t, uintptr_t>>. Interestingly the OP there also initially guessed that transactional memory was involved.Roentgenogram
@NateEldredge Software transactional memory and hardware transactional memory are very different to program.Oxymoron
C
3

If there is no machine code primitive to perform the action without a lock, std::atomic will add the required lock to ensure things are atomic.

There is a even a compile time is_always_lock_free member that can be used to test this.

This is really important in contexts where mutexes cannot be used like signal handlers.

Edit: Worth adding that a good locking mechanism will use atomics in user-space and only defer to the kernel if there is contention. The futex on Linux is one such mechanism. This is used for mutexes on Linux.

Coerce answered 14/8, 2023 at 14:18 Comment(7)
Is there any way to distinguish between situations where an implementation is aware of (and follows) a target platform's convention for locking, thus allowing interop with code outside the implementation, versus those where the implementation is unaware of such a convention and thus has to implement its own locking mechanism which would thus be unsuitable for interop with outside code?Clevelandclevenger
Check the MSVC machine code - there's no locking for the above code. I gues the code uses software trasnactional memory.Oxymoron
@EdisonvonMyosotis: I would love to check the MSVC machine code, but I don't have MSVC readily available, nor do I know what version or compiler options you used. Would you please post it for us?Roentgenogram
@NateEldredge I use Visual Studio 2022 with the latest updates.Oxymoron
The non-lock-free fallback may just be a simple spinlock, or may use the same locking code as std::mutex (which yes on Linux will use futex if the lock is unavailable after some retries). Depends on the C++ standard library, or on the compiler's internal implementation of GNU C builtins like __atomic_load_n. See Where is the lock for a std::atomic?Harriettharrietta
@EdisonvonMyosotis: godbolt.org/z/n417vbx6W shows MSVC 19.35 inlining a spinlock loop for apss.load(). Note the xchg DWORD PTR std::atomic<S> apss, eax and the branching involving a pause in the spin-wait loop.Harriettharrietta
Not 100% sure of the implementation but I think a Windows CriticalSection will operate all userside if there is no contention.Coerce
Q
3

TL;DR: it is a userspace spinlock, a bad decision that is currently locked for some time for ABI reasons.


MSVC uses a spinlock for atomic but a SRWLOCK for atomic_ref

See the source:

    // Spinlock integer for non-lock-free atomic. <xthreads.h> mutex pointer for non-lock-free atomic_ref
    mutable typename _Atomic_storage_types<_Ty>::_Spinlock _Spinlock{};

Spinlock is currently considered a bad practice, specifically because it does not yield to the kernel, and can provoke long busy wait due to an unfortunate context switch.

This is acknowledged by MSVC STL maintainers, but due to ABI compatibility reasons, it cannot be fixed right now. A couple of years ago a PR was accepted that at least add pause instruction in the busy wait loop that makes situation a bit better, still no kernel wait.

With atomic_ref added in C++23 was able to go from scratch and use SRWLOCK which after some unspecified amount of unsuccessful spinning will go to kernel.

With the next ABI-breaking version, std::atomic is expected to use SRWLOCK too.

By the way, in MSVC each non-lock-free atomic has its own dedicated spinlock as a member, and likely to have its own SRWLOCK in the future. (Another possibility is a hash table of such object, which is effectively the only possibility for atomic_ref)


No, MSVC does not use transacted memory yet, neither for atomics, nor for anything else, except that some intrinsics are available. It looks like to me a good idea to use it for atomics though.

Sure I mean Hardware transactional memory (at least the Intel RTM, doubt that MSVC ever supported the AMD thing), and in an ABI-breaking version. I don't know much about software transactional memory.

Quar answered 17/8, 2023 at 13:26 Comment(4)
Other implementations, such as GCC and Clang (at least targeting non-Windows) do use a hash table of locks, keyed on the address of the atomic object. Where is the lock for a std::atomic? . So they're not address-free, and won't work across processes in shared memory the way MSVC's will (?) with the lock inside the atomic object. Interesting, godbolt.org/z/ef5ndEsxo shows MSVC inlining the locking code for .load() on the OP's struct.Harriettharrietta
Note the OP said software transactional memory. That would be more expensive than just using a lock per object, since it allows different combinations of things to be read and written as atomic transactions. And it's not ABI-compatible with hardware transactional memory (like Intel TSX / RTM). With the HLE part of TSX disabled in microcode on current CPUs, we can't have nice things. (hardware lock elision made spinlocks work as transactions without actually contending over the spinlock's cache line.)Harriettharrietta
@PeterCordes, yes, I meant hardware transaction memory, specifically RTM, and with ABI break. I would not rely on MSVC non-is_lock_free std::atomic being address free, as the ABI breaking version is likely to use a SWRLOCK, not something custom with RTM or without it, and SRWLOCK isn't adress-free (it is like a futex-based nonrecursive lightweight shared mutex).Quar
Right yes, good point that it's not future-proof to rely on MSVC's std::atomic fallback locks being address-free. The ISO C++ standard recommends (with "should" phrasing IIRC) that is_lock_free atomics should be address-free, which is the case on all implementations I'm aware of, so software wanting to do shared memory across processes should be checking for is_always_lock_free for both portability (to non-Windows) and future-proofing. And besides, locking/unlocking every access sucks; it takes more code but a totally different fallback path using your own locking could be much better.Harriettharrietta

© 2022 - 2024 — McMap. All rights reserved.