How are atomic operations implemented at a hardware level?

Asked 7/2, 2013 at 18:13 Answered 7/5, 2017 at 23:59

Solved x86 atomic cpu-architecture lock-free smp

I get that at the assembly language level instruction set architectures provide compare and swap and similar operations. However, I don't understand how the chip is able to provide these guarantees.

As I imagine it, the execution of the instruction must

Fetch a value from memory
Compare the value
Depending on the comparison, possibly store another value in memory

What prevents another core from accessing the memory address after the first has fetched it but before it sets the new value? Does the memory controller manage this?

edit: If the x86 implementation is secret, I'd be happy to hear how any processor family implements it.

Adnate answered 7/2, 2013 at 18:13 Comment(5)

Short answer: extra transistors in the chip to implement special cache and memory coherency and bus synchronization procotols. The long answer is way too long. – Rillings 7/2, 2013 at 18:49

Processor manufacturer have stopped providing the kind of info you are asking for a long time ago. They merely describe how to do it, not how it is implemented. You can get some insight from the Intel Processor Manuals, volume 3a, chapter 8.1 – Cratch 7/2, 2013 at 19:0

@NikBougalis This sounds like exactly what I'm interested in. Where would I find the longer answer? Thanks! – Adnate 7/2, 2013 at 19:8

@HansPassant gave you a good starting point. More detailed information will probably be very hard to get. – Rillings 7/2, 2013 at 19:11

@HansPassant I'll look into this, thanks! I'm not necessarily interested in modern implementations. – Adnate 7/2, 2013 at 19:13

Here is an article over at software.intel.com on that sheds little light on user level locks:

User level locks involve utilizing the atomic instructions of processor to atomically update a memory space. The atomic instructions involve utilizing a lock prefix on the instruction and having the destination operand assigned to a memory address. The following instructions can run atomically with a lock prefix on current Intel processors: ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG. [...] On most instructions a lock prefix must be explicitly used except for the xchg instruction where the lock prefix is implied if the instruction involves a memory address.

In the days of Intel 486 processors, the lock prefix used to assert a lock on the bus along with a large hit in performance. Starting with the Intel Pentium Pro architecture, the bus lock is transformed into a cache lock. A lock will still be asserted on the bus in the most modern architectures if the lock resides in uncacheable memory or if the lock extends beyond a cache line boundary splitting cache lines. Both of these scenarios are unlikely, so most lock prefixes will be transformed into a cache lock which is much less expensive.

So what prevents another core from accessing the memory address? The cache coherency protocol already manages access rights for cache lines. So if a core has (temporal) exclusive access rights to a cache line, no other core can access that cache line. To access that cache line the other core has to obtain access rights first, and the protocol to obtain those rights involves the current owner. In effect, the cache coherency protocol prevents other cores from accessing the cache line silently.

If the locked access is not bound to a single cache line things get more complicated. There are all kinds of nasty corner cases, like locked accesses over page boundaries, etc. Intel does not tell details and they probably use all kinds of tricks to make locks faster.

Vladamar answered 7/2, 2013 at 21:10 Comment(1)

Actually, cache-line-split locked instructions are disastrously slow (like the old bus-lock mechanism that stalls memory access by all cores), so slow that there's a perf counter event specifically for that, and recent CPUs have added support for making that always fault to enable detection of stray usage even in VMs, and so on. See also Can num++ be atomic for 'int num'? re: x86 atomic RMWs in general, a less concise explanation of the same thing you wrote here. – Chesterchesterfield 10/4, 2022 at 23:35

Cache coherency protocol by itself is not sufficient to implement atomic operations. Lets say you want to implement an atomic increment. Below are the steps involved

Load the value from the cache into a register
Increment the value loaded into the register
Store the updated value back to the cache

So in order to implement the above 3 instructions in an atomic fashion, we should first get exclusive access to the cacheline which contains the required value. Once we get exclusive access, we should not relinquish exclusive access on this cacheline until the "store" operation is completed. This means the CPU executing the atomic instructions should not respond to any cache coherency protocol messages for this cacheline in the mean time. While the devil is in the details of how this is implemented, at-least it gives us a mental model

Below is what linus torvalds mentioned about atomic instructions

Atomic instructions bypass the store buffer or at least they act as if they do - they likely actually use the store buffer, but they flush it and the instruction pipeline before the load and wait for it to drain after, and have a lock on the cacheline that they take as part o the load, and release as part of the store - all to make sure that the cacheline doesn't go away in between and that nobody else can see the store buffer contents while this is going on.

Dominance answered 7/5, 2017 at 23:59 Comment(0)

An example implementation of this is LL/SC where a processor will actually have extra instructions that are used to complete atomic operations. On the memory side of it is cache coherency. One of the most popular cache coherency protocols is the MESI Protocol. .

Lancewood answered 7/2, 2013 at 21:10 Comment(1)

Yes, many non-x86 ISAs use LL/SC. The details of how they manage to monitor a cache line (or larger region) for activity from other cores is non-obvious tricky part there. – Chesterchesterfield 10/4, 2022 at 23:42

-3

The memory controller is only in charge of making sure that memory & cache on different processors stays consistent - if you write to memory on CPU1, CPU2 won't be able to read something else from its cache. It's not its responsibility to make sure that they're both trying to manipulate the same data. There are a few low level instructions used locking and atomic operations. These are used at the OS level to manipulate small chunks of memory to create things like mutexes and semaphores, these are literally one or two bytes of memory that need to have atomic, synchronized operations performed on them. Applications then build on top of this to perform operations on larger data structures and resources.

Pitch answered 7/2, 2013 at 18:33 Comment(1)

lock add [rdi], eax works anywhere, it doesn't need a special memory region. The OS doesn't have to do anything special for user-space to be able to use some stack space as a spin-lock or whatever. (This old answer seems just plain wrong, I'd recommend deleting.) – Chesterchesterfield 10/4, 2022 at 23:39

Recommended topics

Hot tags