Where is the lock for a std::atomic?
Asked Answered
P

3

87

If a data structure has multiple elements in it, the atomic version of it cannot (always) be lock-free. I was told that this is true for larger types because the CPU can not atomically change the data without using some sort of lock.

for example:

#include <iostream>
#include <atomic>

struct foo {
    double a;
    double b;
};

std::atomic<foo> var;

int main()
{
    std::cout << var.is_lock_free() << std::endl;
    std::cout << sizeof(foo) << std::endl;
    std::cout << sizeof(var) << std::endl;
}

the output (Linux/gcc) is:

0
16
16

Since the atomic and foo are the same size, I don't think a lock is stored in the atomic.

My question is:
If an atomic variable uses a lock, where is it stored and what does that mean for multiple instances of that variable ?

Philo answered 11/5, 2018 at 18:38 Comment(9)
you need to use an external lock (mutex)Gulgee
Have you tried with a larger type than just 16 bytes? I'm not super familiar with the x86 architecture, but if it had a single CAS instruction for 16 bytes that would preclude the need for an explicit lock, I wouldn't be surprised.Shalom
@Shalom In that case I would expect is_lock_free() to be true.Peppy
@Xirema: gcc / clang on x86-64 Linux do use lock cmpxchg16b if available, but gcc7 and later still return false for is_lock_free even though technically it is; but pure-loads and pure-stores are slow and pure loads contend with each other. See is_lock_free() returned false after upgrading to MacPorts gcc 7.3 for links to more details about this design decision.Leandraleandre
@Gulgee No, this defeat the purpose of std::atomic.Historiated
MSVC and GCC have divergent behavior on this subject. MSVC starts storing a lock as soon as you start dealing with composite types, whereas GCC is happy to defer everything to instruction-level constructs: godbolt.org/g/nJhTn7Shalom
@Shalom But where is stored the state of the lock. Let say that foo size is 800 bytes, gcc needs to use a mutex or something like that, right ?Historiated
@Historiated If I knew, I'd post as an answer, not as a comment. ;)Shalom
As per github.com/llvm-mirror/compiler-rt/blob/master/lib/builtins/…, clang (and probably gcc) use the address of the atomic as an index inside a hash-map of locks.Nolen
N
69

The easiest way to answer such questions is generally to just look at the resulting assembly and take it from there.

Compiling the following (I made your struct larger to dodge crafty compiler shenanigans):

#include <atomic>

struct foo {
    double a;
    double b;
    double c;
    double d;
    double e;
};

std::atomic<foo> var;

void bar()
{
    var.store(foo{1.0,2.0,1.0,2.0,1.0});
}

In clang 5.0.0 yields the following under -O3: see on godbolt

bar(): # @bar()
  sub rsp, 40
  movaps xmm0, xmmword ptr [rip + .LCPI0_0] # xmm0 = [1.000000e+00,2.000000e+00]
  movaps xmmword ptr [rsp], xmm0
  movaps xmmword ptr [rsp + 16], xmm0
  movabs rax, 4607182418800017408
  mov qword ptr [rsp + 32], rax
  mov rdx, rsp
  mov edi, 40
  mov esi, var
  mov ecx, 5
  call __atomic_store

Great, the compiler delegates to an intrinsic (__atomic_store), that's not telling us what's really going on here. However, since the compiler is open source, we can easily find the implementation of the intrinsic (I found it in https://github.com/llvm-mirror/compiler-rt/blob/master/lib/builtins/atomic.c):

void __atomic_store_c(int size, void *dest, void *src, int model) {
#define LOCK_FREE_ACTION(type) \
    __c11_atomic_store((_Atomic(type)*)dest, *(type*)dest, model);\
    return;
  LOCK_FREE_CASES();
#undef LOCK_FREE_ACTION
  Lock *l = lock_for_pointer(dest);
  lock(l);
  memcpy(dest, src, size);
  unlock(l);
}

It seems like the magic happens in lock_for_pointer(), so let's have a look at it:

static __inline Lock *lock_for_pointer(void *ptr) {
  intptr_t hash = (intptr_t)ptr;
  // Disregard the lowest 4 bits.  We want all values that may be part of the
  // same memory operation to hash to the same value and therefore use the same
  // lock.  
  hash >>= 4;
  // Use the next bits as the basis for the hash
  intptr_t low = hash & SPINLOCK_MASK;
  // Now use the high(er) set of bits to perturb the hash, so that we don't
  // get collisions from atomic fields in a single object
  hash >>= 16;
  hash ^= low;
  // Return a pointer to the word to use
  return locks + (hash & SPINLOCK_MASK);
}

And here's our explanation: The address of the atomic is used to generate a hash-key to select a pre-alocated lock.

Nolen answered 11/5, 2018 at 19:13 Comment(2)
You can use godbolt.org to easily generate and share assembly for various compilers.Peppy
@FrançoisAndrieux yes, I know. My personal preference is to use godbolt links in comments, but actually copy-paste the results when writing answers so that they stay fully self-contained (if the assembly is short enough)Nolen
L
77

The usual implementation is a hash-table of mutexes (or even just simple spinlocks without a fallback to OS-assisted sleep/wakeup), using the address of the atomic object as a key. The hash function might be as simple as just using the low bits of the address as an index into a power-of-2 sized array, but @Frank's answer shows LLVM's std::atomic implementation does XOR in some higher bits so you don't automatically get aliasing when objects are separated by a large power of 2 (which is more common than any other random arrangement).

I think (but I'm not sure) that g++ and clang++ are ABI-compatible; i.e. that they use the same hash function and table, so they agree on which lock serializes access to which object. The locking is all done in libatomic, though, so if you dynamically link libatomic then all code inside the same program that calls __atomic_store_16 will use the same implementation; clang++ and g++ definitely agree on which function names to call, and that's enough. (But note that only lock-free atomic objects in shared memory between different processes will work: each process has its own hash table of locks. Lock-free objects are supposed to (and in fact do) Just Work in shared memory on normal CPU architectures, even if the region is mapped to different addresses.)

Hash collisions mean that two atomic objects might share the same lock. This is not a correctness problem, but it could be a performance problem: instead of two pairs of threads separately contending with each other for two different objects, you could have all 4 threads contending for access to either object. Presumably that's unusual, and usually you aim for your atomic objects to be lock-free on the platforms you care about. But most of the time you don't get really unlucky, and it's basically fine.

Deadlocks aren't possible because there aren't any std::atomic functions that try to take the lock on two objects at once. So the library code that takes the lock never tries to take another lock while holding one of these locks. Extra contention / serialization is not a correctness problem, just performance.


x86-64 16-byte objects with GCC vs. MSVC:

As a hack, compilers can use lock cmpxchg16b to implement 16-byte atomic load/store, as well as actual read-modify-write operations.

This is better than locking, but has bad performance compared to 8-byte atomic objects (e.g. pure loads contend with other loads). It's the only documented safe way to atomically do anything with 16 bytes1.

AFAIK, MSVC never uses lock cmpxchg16b for 16-byte objects, and they're basically the same as a 24 or 32 byte object.

gcc6 and earlier inlined lock cmpxchg16b when you compile with -mcx16 (cmpxchg16b unfortunately isn't baseline for x86-64; first-gen AMD K8 CPUs are missing it.)

gcc7 decided to always call libatomic and never report 16-byte objects as lock-free, even though libatomic functions would still use lock cmpxchg16b on machines where the instruction is available. See is_lock_free() returned false after upgrading to MacPorts gcc 7.3. The gcc mailing list message explaining this change is here.

You can use a union hack to get a reasonably cheap ABA pointer+counter on x86-64 with gcc/clang: How can I implement ABA counter with c++11 CAS?. lock cmpxchg16b for updates of both pointer and counter, but simple mov loads of just the pointer. This only works if the 16-byte object is actually lock-free using lock cmpxchg16b, though.


Footnote 1: movdqa 16-byte load/store is atomic in practice on some (but not all) x86 microarchitectures, and there's no reliable or documented way to detect when it's usable. See Why is integer assignment on a naturally aligned variable atomic on x86?, and SSE instructions: which CPUs can do atomic 16B memory operations? for an example where K10 Opteron shows tearing at 8B boundaries only between sockets with HyperTransport.

So compiler writers have to err on the side of caution and can't use movdqa the way they use SSE2 movq for 8-byte atomic load/store in 32-bit code. It would be great if CPU vendors could document some guarantees for some microarchitectures, or add CPUID feature bits for atomic 16, 32, and 64-byte aligned vector load/store (with SSE, AVX, and AVX512). Maybe which mobo vendors could disable in firmware on funky many-socket machines that use special coherency glue chips that don't transfer whole cache lines atomically.

Leandraleandre answered 11/5, 2018 at 19:13 Comment(2)
Nitpick: LLVM's implementation is a fair bit more involved/crafty than using the low bits as an index.Nolen
@Frank: thanks, fixed. I saw a shift and mask but of course that's going to happen with a more complex hash function, too, so I should have kept looking to see the XORing of some high bits. That makes more sense; large power-of-2 strides are not rare in computer programs, and naive low-bits would collide for those.Leandraleandre
N
69

The easiest way to answer such questions is generally to just look at the resulting assembly and take it from there.

Compiling the following (I made your struct larger to dodge crafty compiler shenanigans):

#include <atomic>

struct foo {
    double a;
    double b;
    double c;
    double d;
    double e;
};

std::atomic<foo> var;

void bar()
{
    var.store(foo{1.0,2.0,1.0,2.0,1.0});
}

In clang 5.0.0 yields the following under -O3: see on godbolt

bar(): # @bar()
  sub rsp, 40
  movaps xmm0, xmmword ptr [rip + .LCPI0_0] # xmm0 = [1.000000e+00,2.000000e+00]
  movaps xmmword ptr [rsp], xmm0
  movaps xmmword ptr [rsp + 16], xmm0
  movabs rax, 4607182418800017408
  mov qword ptr [rsp + 32], rax
  mov rdx, rsp
  mov edi, 40
  mov esi, var
  mov ecx, 5
  call __atomic_store

Great, the compiler delegates to an intrinsic (__atomic_store), that's not telling us what's really going on here. However, since the compiler is open source, we can easily find the implementation of the intrinsic (I found it in https://github.com/llvm-mirror/compiler-rt/blob/master/lib/builtins/atomic.c):

void __atomic_store_c(int size, void *dest, void *src, int model) {
#define LOCK_FREE_ACTION(type) \
    __c11_atomic_store((_Atomic(type)*)dest, *(type*)dest, model);\
    return;
  LOCK_FREE_CASES();
#undef LOCK_FREE_ACTION
  Lock *l = lock_for_pointer(dest);
  lock(l);
  memcpy(dest, src, size);
  unlock(l);
}

It seems like the magic happens in lock_for_pointer(), so let's have a look at it:

static __inline Lock *lock_for_pointer(void *ptr) {
  intptr_t hash = (intptr_t)ptr;
  // Disregard the lowest 4 bits.  We want all values that may be part of the
  // same memory operation to hash to the same value and therefore use the same
  // lock.  
  hash >>= 4;
  // Use the next bits as the basis for the hash
  intptr_t low = hash & SPINLOCK_MASK;
  // Now use the high(er) set of bits to perturb the hash, so that we don't
  // get collisions from atomic fields in a single object
  hash >>= 16;
  hash ^= low;
  // Return a pointer to the word to use
  return locks + (hash & SPINLOCK_MASK);
}

And here's our explanation: The address of the atomic is used to generate a hash-key to select a pre-alocated lock.

Nolen answered 11/5, 2018 at 19:13 Comment(2)
You can use godbolt.org to easily generate and share assembly for various compilers.Peppy
@FrançoisAndrieux yes, I know. My personal preference is to use godbolt links in comments, but actually copy-paste the results when writing answers so that they stay fully self-contained (if the assembly is short enough)Nolen
P
13

From 29.5.9 of the C++ standard:

Note: The representation of an atomic specialization need not have the same size as its corresponding argument type. Specializations should have the same size whenever possible, as this reduces the effort required to port existing code. — end note

It is preferable to make the size of an atomic the same as the size of its argument type, although not necessary. The way to achieve this is by either avoiding locks or by storing the locks in a separate structure. As the other answers have already explained clearly, a hash table is used to hold all the locks. This is the most memory efficient way of storing any number of locks for all the atomic objects in use.

Phalangeal answered 11/5, 2018 at 19:25 Comment(1)
Another reason for not putting a lock inside each object is interoperability with C11 atomics, where static initialization is a problem. The C11 standard defines an ATOMIC_VAR_INIT macro (which doesn't work well for compound types), and also requires that static zero-initialization of an atomic object works. And C11 doesn't provide any destructors to free OS resources in per-object locks. See also developers.redhat.com/blog/2016/01/14/… for some discussion of the problems discovered in the C11 standard.Leandraleandre

© 2022 - 2024 — McMap. All rights reserved.