Atomicity of loads and stores on x86

Asked 18/7, 2016 at 23:2 Answered 19/7, 2016 at 17:53

Solved c++x86 cpu-architecture atomic memory-barriers

8.1.2 Bus Locking

Intel 64 and IA-32 processors provide a LOCK# signal that is asserted automatically during certain critical memory operations to lock the system bus or equivalent link. While this output signal is asserted, requests from other processors or bus agents for control of the bus are blocked. Software can specify other occasions when the LOCK semantics are to be followed by prepending the LOCK prefix to an instruction.

It comes from Intel Manual, Volume 3

It sounds like the atomic operations on memory will be executed directly on memory (RAM). I am confused because I see "nothing special" when I analyze assembly output.

Basically, the assembly output generated for std::atomic<int> X; X.load() is just a normal mov load. And if I understand properly, X.store(2) is just mov [somewhere], 2, plus mfence if it's seq_cst. But it's responsible for proper memory ordering, not for an atomicity. That's all. It seems that it doesn't "skip" the cache.

I know that moving aligned ints for example ints to memory is atomic. However, I am confused.

So, I have presented my doubts but the main question is:

How does the CPU implement atomic operations internally?

Bonaventura answered 18/7, 2016 at 23:2 Comment(3)

Do you have any external bus (shared bus) in your CPU (i3/i5/i7)? – Homburg 19/7, 2016 at 3:8

You won't see anything in the assembly for automatic LOCK signal assertions. Which is the whole point, it's automatic, not requested by code... You can use LOCK instructions as well of course if you so desire. – Parquet 19/7, 2016 at 13:31

The manual also says "The LOCK prefix can be prepended only to the following instructions and only to those forms of the instructions where the destination operand is a memory operand: ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG" so you should not expect to see a lock mov. – Knopp 19/7, 2016 at 15:55

It sounds like the atomic operations on memory will be executed directly on memory (RAM).

Nope, as long as every possible observer in the system sees the operation as atomic, the operation can involve cache only. Cache is coherent between cores, and on x86 even DMA is cache-coherent.

Satisfying this requirement is much more difficult for atomic read-modify-write operations (like lock add [mem], eax). On an unaligned address, an atomic RMW might even have to assert the LOCK# signal that stalls memory ops from other cores. (Otherwise it's still just a cache-lock, not replying MESI requests for that line during the operation; lock-free std::atomic only uses aligned variables.) You still wouldn't see any more than that in the asm: the hardware implements the ISA-required semantics for locked instructions.

Although I doubt that there is a physical external LOCK# pin on modern CPUs where the memory controller is built-in to the CPU, instead of in a separate northbridge chip.

std::atomic<int> X; X.load() puts only "extra" mfence.

Compilers don't MFENCE for seq_cst loads.

I think I read that old MSVC at one point did emit MFENCE for this (maybe to prevent reordering with unfenced NT stores? Or instead of on stores?). But it doesn't anymore: I tested MSVC 19.00.23026.0. Look for foo and bar in the asm output from this program that dumps its own asm in an online compile&run site.

The reason we don't need a fence here is that the x86 memory model disallows both LoadStore and LoadLoad reordering. Earlier (non seq_cst) stores can still be delayed until after a seq_cst load, so it's different from using a stand-alone std::atomic_thread_fence(mo_seq_cst); before an X.load(mo_acquire);

If I understand properly the X.store(2) is just mov [somewhere], 2

That's consistent with your idea that loads needed mfence; one or the other of seq_cst loads or stores need a full barrier to prevent disallow StoreLoad reordering which could otherwise happen.

In practice compiler devs picked cheap loads (mov) / expensive stores (mov+mfence) because loads are more common. C++11 mappings to processors.

(The x86 memory-ordering model is program order plus a store buffer with store-forwarding (see also). This makes mo_acquire and mo_release free in asm, only need to block compile-time reordering, and lets us choose whether to put the MFENCE full barrier on loads or stores.)

So seq_cst stores are either mov+mfence or xchg. Why does a std::atomic store with sequential consistency use XCHG? discusses the performance advantages of xchg on some CPUs. On AMD, MFENCE is (IIRC) documented to have extra serialize-the-pipeline semantics (for instruction execution, not just memory ordering) that blocks out-of-order exec, and on some Intel CPUs in practice (Skylake) that's also the case.

MSVC's asm for stores is the same as clang's, using xchg to do the store + memory barrier with the same instruction.

Atomic release or relaxed stores can be just mov, with the difference between them being only how much compile-time reordering is allowed.

This question looks like the part 2 of your earlier Memory Model in C++ : sequential consistency and atomicity, where you asked:

How does the CPU implement atomic operations internally?

As you pointed out in the question, atomicity is unrelated to ordering with respect to any other operations. (i.e. memory_order_relaxed). It just means that the operation happens as a single indivisible operation, hence the name, not as multiple parts which can happen partially before and partially after something else.

You get atomicity "for free" with no extra hardware for aligned loads or stores up to the size of the data paths between cores, memory, and I/O busses like PCIe. i.e. between the various levels of cache, and between the caches of separate cores. The memory controllers are part of the CPU in modern designs, so even a PCIe device accessing memory has to go through the CPU's system agent. (This even lets Skylake's eDRAM L4 (not available in any desktop CPUs :( ) work as a memory-side cache (unlike Broadwell, which used it as a victim cache for L3 IIRC), sitting between memory and everything else in the system so it can even cache DMA).

This means the CPU hardware can do whatever is necessary to make sure a store or load is atomic with respect to anything else in the system which can observe it. This is probably not much, if anything. DDR memory uses a wide enough data bus that a 64bit aligned store really does electrically go over the memory bus to the DRAM all in the same cycle. (fun fact, but not important. A serial bus protocol like PCIe wouldn't stop it from being atomic, as long as a single message is big enough. And since the memory controller is the only thing that can talk to the DRAM directly, it doesn't matter what it does internally, just the size of transfers between it and the rest of the CPU). But anyway, this is the "for free" part: no temporary blocking of other requests is needed to keep an atomic transfer atomic.

x86 guarantees that aligned loads and stores up to 64 bits are atomic, but not wider accesses. Low-power implementations are free to break up vector loads/stores into 64-bit chunks like P6 did from PIII until Pentium M.

Atomic ops happen in cache

Remember that atomic just means all observers see it as having happened or not happened, never partially-happened. There's no requirement that it actually reaches main memory right away (or at all, if overwritten soon). Atomically modifying or reading L1 cache is sufficient to ensure that any other core or DMA access will see an aligned store or load happen as a single atomic operation. It's fine if this modification happens long after the store executes (e.g. delayed by out-of-order execution until the store retires).

Modern CPUs like Core2 with 128-bit paths everywhere typically have atomic SSE 128b loads/stores, going beyond what the x86 ISA guarantees. But note the interesting exception on a multi-socket Opteron probably due to hypertransport. That's proof that atomically modifying L1 cache isn't sufficient to provide atomicity for stores wider than the narrowest data path (which in this case isn't the path between L1 cache and the execution units).

Alignment is important: A load or store that crosses a cache-line boundary has to be done in two separate accesses. This makes it non-atomic.

x86 guarantees that cached accesses up to 8 bytes are atomic as long as they don't cross an 8B boundary on AMD/Intel. (Or for Intel only on P6 and later, don't cross a cache-line boundary). This implies that whole cache lines (64B on modern CPUs) are transferred around atomically on Intel, even though that's wider than the data paths (32B between L2 and L3 on Haswell/Skylake). This atomicity isn't totally "free" in hardware, and maybe requires some extra logic to prevent a load from reading a cache-line that's only partially transferred. Although cache-line transfers only happen after the old version was invalidated, so a core shouldn't be reading from the old copy while there's a transfer happening. AMD can tear in practice on smaller boundaries, maybe because of using a different extension to MESI that can transfer dirty data between caches.

For wider operands, like atomically writing new data into multiple entries of a struct, you need to protect it with a lock which all accesses to it respect. (You may be able to use x86 lock cmpxchg16b with a retry loop to do an atomic 16b store. Note that there's no way to emulate it without a mutex.)

Atomic read-modify-write is where it gets harder

related: my answer on Is incrementing an int effectively atomic in specific cases? goes into more detail about this.

Each core has a private L1 cache which is coherent with all other cores (using the MOESI protocol). Cache-lines are transferred between levels of cache and main memory in chunks ranging in size from 64 bits to 256 bits. (these transfers may actually be atomic on a whole-cache-line granularity?)

To do an atomic RMW, a core can keep a line of L1 cache in Modified state without accepting any external modifications to the affected cache line between the load and the store, the rest of the system will see the operation as atomic. (And thus it is atomic, because the usual out-of-order execution rules require that the local thread sees its own code as having run in program order.)

It can do this by not processing any cache-coherency messages while the atomic RMW is in-flight (or some more complicated version of this which allows more parallelism for other ops).

Unaligned locked ops are a problem: we need other cores to see modifications to two cache lines happen as a single atomic operation. This may require actually storing to DRAM, and taking a bus lock. (AMD's optimization manual says this is what happens on their CPUs when a cache-lock isn't sufficient.)

Guntar answered 19/7, 2016 at 17:53 Comment(2)

@Gilgamesz: Was this longer / more detailed than you needed, or did it still leave some things out? I had some ideas for rephrasing to say the same thing more clearly and in fewer words, e.g. "atomic just means nothing can observe it as multiple steps. Physically/electrically happening at the same time isn't necessary, but is a convenient way to achieve this." – Guntar 21/7, 2016 at 23:20

Peter Cordes, it is sufficient. Everything is clear now :). – Bonaventura 23/7, 2016 at 11:22

The LOCK# signal (pin of the cpu package/socket) was used on old chips (for LOCK prefixed atomic operations), now there is cache lock. And for more complex atomic operations, like .exchange or .fetch_add you will operation with LOCK prefix, or some other kind of atomic instruction (cmpxchg/8/16?).

Same manual, System Programming Guide part:

In the Pentium 4, Intel Xeon, and P6 family processors, the locking operation is handled with either a cache lock or bus lock. If a memory access is cacheable and affects only a single cache line, a cache lock is invoked and the system bus and the actual memory location in system memory are not locked during the operation

You may check papers and book from Paul E. McKenney: * Memory Ordering in Modern Microprocessors, 2007 * Memory Barriers: a Hardware View for Software Hackers, 2010 * perfbook, "Is Parallel Programming Hard, And If So, What Can You Do About It?"

And * Intel 64 Architecture Memory Ordering White Paper, 2007.

There is need of memory barrier for x86/x86_64 to prevent loads from reordering. From first paper:

x86 (..AMD64 is compatible with x86..) Since the x86 CPUs provide “process ordering” so that all CPUs agree on the order of a given CPU’s writes to memory, the smp_wmb() primitive is a no-op for the CPU [7]. However, a compiler directive is required to prevent the compiler from performing optimizations that would result in reordering across the smp_wmb() primitive.

On the other hand, x86 CPUs have traditionally given no ordering guarantees for loads, so the smp_mb() and smp_rmb() primitives expand to lock;addl. This atomic instruction acts as a barrier to both loads and stores.

What does read memory barrier (from second paper):

The effect of this is that a read memory barrier orders only loads on the CPU that executes it, so that all loads preceding the read memory barrier will appear to have completed before any load following the read memory barrier.

For example, from "Intel 64 Architecture Memory Ordering White Paper"

Intel 64 memory ordering guarantees that for each of the following memory-access instructions, the constituent memory operation appears to execute as a single memory access regardless of memory type: ... Instructions that read or write a doubleword (4 bytes) whose address is aligned on a 4 byte boundary.

Intel 64 memory ordering obeys the following principles: 1. Loads are not reordered with other loads. ... 5. In a multiprocessor system, memory ordering obeys causality (memory ordering respects transitive visibility). ... Intel 64 memory ordering ensures that loads are seen in program order

Also, definition of mfence: http://www.felixcloutier.com/x86/MFENCE.html

Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes the MFENCE instruction in program order becomes globally visible before any load or store instruction that follows the MFENCE instruction.

Homburg answered 19/7, 2016 at 4:46 Comment(6)

gcc/clang don't actually emit any barrier instructions for seq_cst loads. I guess the semantics allow earlier relaxed stores to become globally visible after seq_cst loads? – Guntar 19/7, 2016 at 17:59

@PeterCordes, it seems to be right you wrote. It is written in C++ reference: "Atomic operations tagged memory_order_seq_cst not only order memory the same way as release/acquire ordering (everything that happened-before a store in one thread becomes a visible side effect in the thread that did a load)". So, sequential constiency defined by C++ just ensures about release/acquire semantic. But, as you know, that semantic allows StoreLoad reordering so it dosn't emit memory barrier before load operation. Actually, earlier relaxed stores can be globally visible before load operation. – Bonaventura 7/9, 2016 at 10:57

But only earlier stores because on x86 load opeartion is something like a fence. Also, note that the sequential constiency defined by C++ is weaker than that semantic understood from general point of view. On the preshing.com it is said about that semantic is: "In a sequentially consistent memory model, there is no memory reordering. preshing.com/20120930/weak-vs-strong-memory-models Am I right? – Bonaventura 7/9, 2016 at 10:57

@Gilgamesz: Preshing makes no claim about C++'s memory_order_seq_cst being weaker than what's normally meant by sequential consistency. In fact he explicitly says the opposite. What's the rest of the sentence from the C++ reference, which you chopped off before it got to the difference between seq_cst and acq_rel? – Guntar 7/9, 2016 at 11:8

@PeterCordes, I cannot get it: Preshing said: "In C++11, you can use the default ordering constraint, memory_order_seq_cst, when performing operations on atomic library types. If you do those things, the toolchain will restrict compiler reordering and emit CPU-specific instructions which act as the appropriate memory barrier types". And everything is fine, but, why the compiler doesn't emit mfence in the situation from your the first comment? – Bonaventura 7/9, 2016 at 15:17

@Gilgamesz: I wasn't trying to explain that, just correct your incorrect statements. I think I found the answer: seq_cst means: acq-load/rel-store (a pure load can't be a release), plus a single total order exists in which all threads observe all modifications in the same order. Apparently you get that without fences before loads, and it's not the same thing as no reordering; just none that matters. See also the C++ rules for The Synchronizes-With Relation. – Guntar 7/9, 2016 at 16:47

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

8.1.2 Bus Locking

How does the CPU implement atomic operations internally?

Atomic ops happen in cache

Recommended topics

Hot tags