Do atomic operations require support from hardware?

Asked 16/11, 2018 at 6:8 Answered 28/9, 2023 at 8:56

Recently when I was reading about lock-free programming, I came across "atomic operations". I started digging deep into it. All links explain how to write atomic operations and their usages.

However, I am looking for some details on atomic operations.

Do atomic operations need any hardware capabilities?
Do languages provide APIs for it? if yes, how are atomic APIs implemented?
Are these limited only to kernel space programming, or are they available for user-space programming too?

Atropine answered 16/11, 2018 at 6:8 Comment(3)

If I'm not mistaken, the short answer is proper locking, so #1 is a no: en.m.wikipedia.org/wiki/Dekker%27s_algorithm, en.m.wikipedia.org/wiki/Peterson%27s_algorithm. Hardware support makes everything easier if course – Disagree 16/11, 2018 at 6:11

Yes. On x86 and x64 this kind of "lockless" code is usually implemented with the LOCK instruction prefix :) Not an api, the compiler's back-end knows about the intrinsics. – Longdistance 16/11, 2018 at 8:41

This does indeed require hardware support (for some definition of "requires hardware support"). The algorithms in the links above assume that memory writes are viewed by other threads in the same order they were written, which is not guaranteed on all CPUs. – Re 28/9, 2023 at 20:47

Do atomic operations need any hardware capabilities?

Sure, CPUs guarantee that some of their instructions are atomic. Some of those instructions are "special", i.e. differ from other instructions (prefixed, or have other mnemonics), but some instructions might be "normal". For example, aligned stores and loads are guaranteed to be atomic on most CPUs.

Do languages provide APIs for it? if yes, how are atomic APIs implemented?

Sure, have a look for example at C++ implementation: https://en.cppreference.com/w/cpp/atomic/atomic

Are these limited only to kernel space programming, or are they available for user-space programming too?

Sure, those instructions do not require any privileges, so they are available for userspace. There is a variety of libraries and data structures which leverage atomic operations.

The keywords for the search are "lockless" or "non-blocking". Here is an example: https://en.wikipedia.org/wiki/Non-blocking_linked_list

Oakes answered 16/11, 2018 at 18:30 Comment(0)

Do atomic operations need any hardware capabilities?

Practically speaking, yes. However, the extent of this support varies from architecture to architecture, and possibly for different sizes and types of atomic object¹.

std::atomic::is_lock_free returns whether the atomic object supports lock-free operations. If this member function returns false, then the atomic is usually implemented using a mutex. Using a mutex is obviously a correct implementation of atomics, however, it often doesn't mean the expectations of the user when it comes to performance. In this scenario, the atomic still needs some hardware support because mutexes require some limited hardware support to be implemented.

std::atomic_flag is the bare minimum lock-free support that is given. It supports a test_and_set member function which must be lock-free. It is impossible to implement an atomic test_and_set without some hardware support, such as having a hardware instruction like lock bts or lock xchg that does it directly, or locking the entire memory bus.

Do languages provide APIs for it? if yes, how are atomic APIs implemented?

Your question is tagged c and c++ so I'll focus on those. C++ provides the std::atomic class template, and C provides the _Atomic qualifier for objects. There is also an _Atomic() compatibility macro in C++, or specifier in C. The following code works in both languages:

_Atomic(int) x = 0;
x++; // atomically increment

The ++ is an atomic fetch-add with sequentially consistent memory order. If the atomic is lock-free, this may be implemented as a lock add x86_64 instruction. Otherwise, it the atomic object may simply be locked with a mutex for the atomic operation.

Are these limited only to kernel space programming, or are they available for user-space programming too?

Atomic operations and their capabilities are not limited to kernel space. As in the example above, a ++ could compiled to a lock add instruction, which is just fine in user space.

¹ For example, std::atomic<int> is typically implemented using hardware atomic instructions on x86, with most of the important operations mapping to a single instructions each, although .fetch_or and other operations other than add/sub (which can use lock xadd) need a lock cmpxchg retry loop if the return value is used, otherwise they can use lock and dword [eax], ecx or equivalent which only sets FLAGS according to the result.

Any atomic RMW on a single object can be synthesized in a lock-free way with a retry loop around a CAS operation (Compare And Swap), as wikipedia explains. So that's an operation ISAs need to make possible, both as a building block and because C++ lock-free objects support compare_exchange_strong/weak, allowing programs to build their own atomic RMW operations the standard doesn't already provide.

A CAS retry loop is still lock-free (no thread can block progress by other threads indefinitely no matter where it sleeps), but not wait-free. In practice, if contention is low, retries are rare. And hopefully that's the case, because that's where lock-free atomics are best.

On 32-bit x86, std::atomic<long long> is still lock-free (unless you compile for i486 or earlier), but it's wider than a single integer register so the atomic primitive operations are only load, store, and lock cmpxchg8b (8-byte compare_exchange_strong).

So all the RWM operations on std::atomic<long long> including .exchange and .fetch_add have to be synthesized out of CAS retry loops, except for compare_exchange_strong and weak itself of course.

64-bit pure load / pure store in 32-bit mode can only be done using SSE or MMX, or x87 fild, which are guaranteed atomic for aligned accesses on P5 Pentium and later. (Compilers targeting 32-bit x86 usually default to assuming at least i686 (Pentium Pro) instructions and ISA features, these days often also including SSE2.)

That situation is somewhat like machines with load-linked / store-conditional such as ARM (ldrex / strex), and AArch64 before ARMv8.1, where every atomic RMW (including compare_exchange_strong but not weak) requires a loop around LL/SC, (CAS_weak can be a single LL/SC attempt, since spurious failure is allowed.)

Airtight answered 28/9, 2023 at 8:56 Comment(4)

All the mainstream compilers for 32-bit x86 have lock-free atomic<long long>, using SSE or x87 fild/fistp for load/store, and lock cmpxchg8b for atomic RMWs. That means fetch_add requires a CAS retry loop instead of a single lock xadd, but it's still better than a mutes. godbolt.org/z/KsYGxYd6K (If compiling for i486 or earlier, then cmpxchg8b isn't available, at least not with the same opcode Pentium and later use, so then it won't be lock-free.) Many ISAs provide a way to do lock-free atomics on a 2-pointer-wide object, although some don't. – Cholecystotomy 28/9, 2023 at 16:17

cmpxchg doesn't implement test_and_set, it implements the much more flexible building block CAS, from which any lock-free RMW can be synthesized with a retry loop (including TAS). TAS is x86 xchg or lock bts, or ARM swp. Some older ISAs had an atomic exchange as their only atomic RMW. (Unlike other RMWs, the store value doesn't depend on processing the load result, so it could be done with a single read+write access to cache in the same clock cycle.) Or some ISAs literally had a TAS instruction that set a byte or bit to a fixed value and read the old value, like x86's lock bts. – Cholecystotomy 28/9, 2023 at 16:22

@PeterCordes thanks for the corrections and additional info. I've included some if it in the answer. Feel free to make additional edits if there's something else to correct or elaborate on. – Airtight 28/9, 2023 at 19:36

atomic<long long> is actually lock-free on 32-bit x86, no mutex. (Again, unless you compile for i486 or earlier). I made an edit; probably some of that could be said in fewer words, I'll have another look later. (Let me know if you want that much detail in your answer, or if you'd rather I posted a separate answer.) – Cholecystotomy 28/9, 2023 at 21:36

-2

Does atomic operation need any hardware capabilities.

In practice, yes. In principle, the C++ (read n3337) or C (read n1570) standards don't even require a computer like the one we are using (you could, in theory and unethically, use a bunch of human slaves instead; a nicer variant of that is a teacher using the students of his classroom to "run" a tiny C or C++ program; this is a very good way of teaching programming).

See also this and that answers of mine (to questions similar to yours).

Thermosiphon answered 16/11, 2018 at 6:12 Comment(8)

Interesting how that was the first alternative to a computer that came to your mind. But on second thought, I couldn't find anything better that isn't essentially derivative :) – Disagree 16/11, 2018 at 6:14

@user4581301 ...trained to meticulously execute machine instructions, sure. I feel like you'd need underpaid humans to teach the monkeys though :) – Disagree 16/11, 2018 at 6:26

But if the monkeys can only carry one byte at a time, they might not be able to move a word atomically. You will need a mechanism to ensure that no monkey carries in a new ls byte, while another monkey is busy moving the ms byte. For example a mechanism the gives the approaching monkey a banana if it patiently waits for the first monkey to carry away both bytes, before dropping a new one. Btw, are monkeys big or little endian? – Cesspool 16/11, 2018 at 9:35

I'm curious why the downvotes. This answers the question...in theory. – Prizefight 28/9, 2023 at 12:57

@MadPhysicist: Turing-complete models of computation that aren't RAM machines include finite automata (Conway's game of life), but IDK how plausible it is to implement a sequential / imperative language like C++ on that, especially not to provide std::atomic operations. – Cholecystotomy 29/9, 2023 at 0:3

One way that wouldn't need special HW support would be a uniprocessor system so only one thread at once can ever be running. Using cooperative multi-tasking with yield points invented by the compiler where needed to satisfy the standard's forward-progress guarantees, but not in the middle of atomic RMWs. Or if the machine is a CISC like x86 that has single-instruction RMWs like xadd [mem], eax for fetch_add, a uniprocessor can just do that to get atomicity wrt. interrupts and thus wrt. other threads. – Cholecystotomy 29/9, 2023 at 0:6

That's still a "hardware capability" though, so it I think only cooperative multi-tasking (I think you could call that turning threads into coroutines) could get away with saying it wasn't relying on any special hardware capability or guarantee that was only relevant for atomic operations, not for single-threaded code. – Cholecystotomy 29/9, 2023 at 0:8

@Matt, if execution is done by a room full of students, that would probably be like a uniprocessor (unless you have multiple parts of the room running different "threads", then you'd definitely need special support.) If std::atomic ops are atomic "for free" in this model, it's because context-switches don't happen in the middle of a student doing an operation. Arguably that's natural for this model of computation and might not count as a "hardware feature", so I'm choosing not to downvote or upvote. (This answer doesn't analyze why students (following one instructor?) would have atomicity.) – Cholecystotomy 29/9, 2023 at 0:12

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags