How do atomics larger than the CPU's native support work

Asked 14/8, 2023 at 12:20 Answered 17/8, 2023 at 13:26

With current C++ compilers you can have atomic support of atomics that are larger than the actual support of your CPU. With x64 you can have atomics that are 16 bytes, but std::atomic also works with larger tuples. Look at this code:

#include <iostream>
#include <atomic>

using namespace std;

struct S { size_t a, b, c; };

atomic<S> apss;

int main()
{
    auto ref = apss.load( memory_order_relaxed );
    apss.compare_exchange_weak( ref, { 123, 456, 789 } );
    cout << sizeof ::apss << endl;
}

The cout above always prints 32 for my platform. But how do these transactions actually work without a mutex ? I don't get any clue from inspecting the disassembly.

If I run the following code with MSVC++:

#include <atomic>
#include <thread>
#include <array>

using namespace std;

struct S { size_t a, b, c, d, e; };

atomic<S> apss;

int main()
{
    array<jthread, 2> threads;
    auto threadFn = []()
    {
        auto ref = apss.load( memory_order_relaxed );
        for( size_t i = 10'000'000; i--; apss.compare_exchange_weak( ref, { } ) );
    };
    threads[0] = jthread( threadFn );
    threads[1] = jthread( threadFn );
}

There's almost no kernel-time consumed by the code. So the contention actually happens completely in user-space. I guess that's some kind of software transactional memory happening here.

Oxymoron answered 14/8, 2023 at 12:20 Comment(27)

The atomic may internally use locking mechanism ... call its is_lock_free() to figure if it is done "without a mutex". – Commission 14/8, 2023 at 12:27

std::atomic::is_lock_free – Orchestra 14/8, 2023 at 12:27

The above code compiles, but doesn't link with GCC 13.1.0 (MinGW built by Brecht Sanders) for me. – Shorthand 14/8, 2023 at 12:29

@Fureeish: Link with -latomic. – Oxymoron 14/8, 2023 at 12:30

@EdisonvonMyosotis Have you checked the standard library source code, it may well do the locking (it may still use compiler intrinsics which will forward to the OS) – Cherice 14/8, 2023 at 12:30

@PepijnKramer The only external library call for the above two first lines in main() is to memcmp() with MSVC++. I think the code uses sth. like software transactional memory but I don't know how this actually works. – Oxymoron 14/8, 2023 at 12:34

@EdisonvonMyosotis weird, that actually worked, but it didn't require me to manually link against atomic for "simpler" (e.g., atomic<SmallObject>) use-cases. And I have no idea why was that the case – Shorthand 14/8, 2023 at 12:41

std::atomic<T> does not imply that T is atomic on the hardware level. The point of std::atomic<T> is that you need not know if T is atomic on the hardware level. Actually, even if bool is atomic for the hardware it is not for C++, but you need to use std::atomic<bool> – Crock 14/8, 2023 at 12:41

@Shorthand #76854980 – Crock 14/8, 2023 at 12:42

FWIW there is no threading going on here so the compiler is within its rights to optimize al the atomic code away. Essentially your program can be optimized to cout << sizeof ::apss << endl; – Coercion 14/8, 2023 at 12:42

What output do you get with cout << (apss.is_lock_free() ? "LOCKFREE" : "MUTEX") << "\n";? – Goodoh 14/8, 2023 at 12:52

@Eljay: The above atomic claims not be be lock-free, but if I constantly do compare_exchange_weak() from two threads I get two loaded cores without any kernel memory consumption. So the whole thing is happening in userspace and there must be some kind of software transactional memory here. – Oxymoron 14/8, 2023 at 12:58

@Coercion The first two lines actually aren't optimized away. – Oxymoron 14/8, 2023 at 13:0

@EdisonvonMyosotis Why would a mutex require kernel memory consumption? – Selfsupport 14/8, 2023 at 13:20

lock doesn't imply a mutex, in some implementations it's implemented with a spin lock – Pivotal 14/8, 2023 at 13:29

Relevant question: Where is the lock for a std::atomic? – Hannon 14/8, 2023 at 13:58

Can you show us the disassembly you are looking at? – Roentgenogram 14/8, 2023 at 14:20

AFAIK there isn't any transactional memory mechanism that could feasibly be used here. Intel's TSX exists but isn't widely available; it was disabled by microcode updates on older CPUs due to security bugs, and is not being implemented on newer CPUs. I think you are going to find that a lock of some kind is being used. – Roentgenogram 14/8, 2023 at 14:26

I remember a question some time ago where we worked through a disassembly of MSVC's non-lock-free atomics and found that they added a spinlock as an extra hidden member of the struct. That would be consistent with your observation that both cores run 100% and no kernel resources are used. I can't find it now, unfortunately. – Roentgenogram 14/8, 2023 at 14:29

@NateEldredge This can be easily checked by using sizeof. A hidden member needs to occupy some storage. libstdc++ and libc++ seem to use another solution (hash table of locks indexed by the pointer to an atomic object), as written in the post I linked above. – Hannon 14/8, 2023 at 14:56

@Yakk-AdamNevraumont If theres no contention a mutex is completely locked in userspace, if there's contention the kernel participates in locking. – Oxymoron 15/8, 2023 at 9:18

@AlanBirtles Mutexes with partitial spinning are common, but pure spinlocks don't make sense in user space since a thread holding a spinlock could be scheduled away, thereby keeping contenders spinning. – Oxymoron 15/8, 2023 at 9:19

@NateEldredge Transactional memory is also possible in userspace without hardware support. That's called software transactional memory. STM is much less efficient than hardware transactional memory and because of that not used very often. – Oxymoron 15/8, 2023 at 9:20

@NateEldredge As I described spinlocks don't make sense in userspace. – Oxymoron 15/8, 2023 at 9:21

@EdisonvonMyosotis: You are absolutely right about the problem with spinlocks, but nevertheless that is what that previous disassembly showed. I too thought it was a strange design. I wish I could find it. I'll search some more. – Roentgenogram 15/8, 2023 at 15:16

@EdisonvonMyosotis: Aha, I found it: #69245683. It was for atomic<pair<uintptr_t, uintptr_t>>. Interestingly the OP there also initially guessed that transactional memory was involved. – Roentgenogram 15/8, 2023 at 15:26

@NateEldredge Software transactional memory and hardware transactional memory are very different to program. – Oxymoron 16/8, 2023 at 17:21

If there is no machine code primitive to perform the action without a lock, std::atomic will add the required lock to ensure things are atomic.

There is a even a compile time is_always_lock_free member that can be used to test this.

This is really important in contexts where mutexes cannot be used like signal handlers.

Edit: Worth adding that a good locking mechanism will use atomics in user-space and only defer to the kernel if there is contention. The futex on Linux is one such mechanism. This is used for mutexes on Linux.

Coerce answered 14/8, 2023 at 14:18 Comment(7)

Is there any way to distinguish between situations where an implementation is aware of (and follows) a target platform's convention for locking, thus allowing interop with code outside the implementation, versus those where the implementation is unaware of such a convention and thus has to implement its own locking mechanism which would thus be unsuitable for interop with outside code? – Clevelandclevenger 14/8, 2023 at 20:53

Check the MSVC machine code - there's no locking for the above code. I gues the code uses software trasnactional memory. – Oxymoron 15/8, 2023 at 3:19

@EdisonvonMyosotis: I would love to check the MSVC machine code, but I don't have MSVC readily available, nor do I know what version or compiler options you used. Would you please post it for us? – Roentgenogram 15/8, 2023 at 15:32

@NateEldredge I use Visual Studio 2022 with the latest updates. – Oxymoron 16/8, 2023 at 17:21

The non-lock-free fallback may just be a simple spinlock, or may use the same locking code as std::mutex (which yes on Linux will use futex if the lock is unavailable after some retries). Depends on the C++ standard library, or on the compiler's internal implementation of GNU C builtins like __atomic_load_n. See Where is the lock for a std::atomic? – Harriettharrietta 17/8, 2023 at 17:17

@EdisonvonMyosotis: godbolt.org/z/n417vbx6W shows MSVC 19.35 inlining a spinlock loop for apss.load(). Note the xchg DWORD PTR std::atomic<S> apss, eax and the branching involving a pause in the spin-wait loop. – Harriettharrietta 17/8, 2023 at 17:33

Not 100% sure of the implementation but I think a Windows CriticalSection will operate all userside if there is no contention. – Coerce 18/8, 2023 at 6:59

TL;DR: it is a userspace spinlock, a bad decision that is currently locked for some time for ABI reasons.

MSVC uses a spinlock for atomic but a SRWLOCK for atomic_ref

See the source:

    // Spinlock integer for non-lock-free atomic. <xthreads.h> mutex pointer for non-lock-free atomic_ref
    mutable typename _Atomic_storage_types<_Ty>::_Spinlock _Spinlock{};

Spinlock is currently considered a bad practice, specifically because it does not yield to the kernel, and can provoke long busy wait due to an unfortunate context switch.

This is acknowledged by MSVC STL maintainers, but due to ABI compatibility reasons, it cannot be fixed right now. A couple of years ago a PR was accepted that at least add pause instruction in the busy wait loop that makes situation a bit better, still no kernel wait.

With atomic_ref added in C++23 was able to go from scratch and use SRWLOCK which after some unspecified amount of unsuccessful spinning will go to kernel.

With the next ABI-breaking version, std::atomic is expected to use SRWLOCK too.

By the way, in MSVC each non-lock-free atomic has its own dedicated spinlock as a member, and likely to have its own SRWLOCK in the future. (Another possibility is a hash table of such object, which is effectively the only possibility for atomic_ref)

No, MSVC does not use transacted memory yet, neither for atomics, nor for anything else, except that some intrinsics are available. It looks like to me a good idea to use it for atomics though.

Sure I mean Hardware transactional memory (at least the Intel RTM, doubt that MSVC ever supported the AMD thing), and in an ABI-breaking version. I don't know much about software transactional memory.

Quar answered 17/8, 2023 at 13:26 Comment(4)

Other implementations, such as GCC and Clang (at least targeting non-Windows) do use a hash table of locks, keyed on the address of the atomic object. Where is the lock for a std::atomic? . So they're not address-free, and won't work across processes in shared memory the way MSVC's will (?) with the lock inside the atomic object. Interesting, godbolt.org/z/ef5ndEsxo shows MSVC inlining the locking code for .load() on the OP's struct. – Harriettharrietta 17/8, 2023 at 17:26

Note the OP said software transactional memory. That would be more expensive than just using a lock per object, since it allows different combinations of things to be read and written as atomic transactions. And it's not ABI-compatible with hardware transactional memory (like Intel TSX / RTM). With the HLE part of TSX disabled in microcode on current CPUs, we can't have nice things. (hardware lock elision made spinlocks work as transactions without actually contending over the spinlock's cache line.) – Harriettharrietta 17/8, 2023 at 17:29

@PeterCordes, yes, I meant hardware transaction memory, specifically RTM, and with ABI break. I would not rely on MSVC non-is_lock_free std::atomic being address free, as the ABI breaking version is likely to use a SWRLOCK, not something custom with RTM or without it, and SRWLOCK isn't adress-free (it is like a futex-based nonrecursive lightweight shared mutex). – Quar 17/8, 2023 at 18:41

Right yes, good point that it's not future-proof to rely on MSVC's std::atomic fallback locks being address-free. The ISO C++ standard recommends (with "should" phrasing IIRC) that is_lock_free atomics should be address-free, which is the case on all implementations I'm aware of, so software wanting to do shared memory across processes should be checking for is_always_lock_free for both portability (to non-Windows) and future-proofing. And besides, locking/unlocking every access sucks; it takes more code but a totally different fallback path using your own locking could be much better. – Harriettharrietta 17/8, 2023 at 19:2

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags