Do atomic operations need any hardware capabilities?
Practically speaking, yes. However, the extent of this support varies from architecture to architecture, and possibly for different sizes and types of atomic object1.
std::atomic::is_lock_free
returns whether the atomic object supports lock-free operations. If this member function returns false
, then the atomic is usually implemented using a mutex. Using a mutex is obviously a correct implementation of atomics, however, it often doesn't mean the expectations of the user when it comes to performance. In this scenario, the atomic still needs some hardware support because mutexes require some limited hardware support to be implemented.
std::atomic_flag
is the bare minimum lock-free support that is given. It supports a test_and_set
member function which must be lock-free. It is impossible to implement an atomic test_and_set
without some hardware support, such as having a hardware instruction like lock bts
or lock xchg
that does it directly, or locking the entire memory bus.
Do languages provide APIs for it? if yes, how are atomic APIs implemented?
Your question is tagged c and c++ so I'll focus on those. C++ provides the std::atomic
class template, and C provides the _Atomic
qualifier for objects. There is also an _Atomic()
compatibility macro in C++, or specifier in C. The following code works in both languages:
_Atomic(int) x = 0;
x++; // atomically increment
The ++
is an atomic fetch-add with sequentially consistent memory order. If the atomic is lock-free, this may be implemented as a lock add
x86_64 instruction. Otherwise, it the atomic object may simply be locked with a mutex for the atomic operation.
Are these limited only to kernel space programming, or are they available for user-space programming too?
Atomic operations and their capabilities are not limited to kernel space.
As in the example above, a ++
could compiled to a lock add
instruction, which is just fine in user space.
1 For example, std::atomic<int>
is typically implemented using hardware atomic instructions on x86, with most of the important operations mapping to a single instructions each, although .fetch_or
and other operations other than add/sub (which can use lock xadd
) need a lock cmpxchg
retry loop if the return value is used, otherwise they can use lock and dword [eax], ecx
or equivalent which only sets FLAGS according to the result.
Any atomic RMW on a single object can be synthesized in a lock-free way with a retry loop around a CAS operation (Compare And Swap), as wikipedia explains. So that's an operation ISAs need to make possible, both as a building block and because C++ lock-free objects support compare_exchange_strong
/weak
, allowing programs to build their own atomic RMW operations the standard doesn't already provide.
A CAS retry loop is still lock-free (no thread can block progress by other threads indefinitely no matter where it sleeps), but not wait-free. In practice, if contention is low, retries are rare. And hopefully that's the case, because that's where lock-free atomics are best.
On 32-bit x86, std::atomic<long long>
is still lock-free (unless you compile for i486 or earlier), but it's wider than a single integer register so the atomic primitive operations are only load, store, and lock cmpxchg8b
(8-byte compare_exchange_strong
).
So all the RWM operations on std::atomic<long long>
including .exchange
and .fetch_add
have to be synthesized out of CAS retry loops, except for compare_exchange_strong
and weak
itself of course.
64-bit pure load / pure store in 32-bit mode can only be done using SSE or MMX, or x87 fild
, which are guaranteed atomic for aligned accesses on P5 Pentium and later. (Compilers targeting 32-bit x86 usually default to assuming at least i686 (Pentium Pro) instructions and ISA features, these days often also including SSE2.)
That situation is somewhat like machines with load-linked / store-conditional such as ARM (ldrex
/ strex
), and AArch64 before ARMv8.1, where every atomic RMW (including compare_exchange_strong
but not weak
) requires a loop around LL/SC, (CAS_weak can be a single LL/SC attempt, since spurious failure is allowed.)