Memory ordering restrictions on x86 architecture

Asked 10/5, 2012 at 15:54 Answered 12/12, 2019 at 3:44

c++multithreading architecture c++11 memory-model

In his great book 'C++ Concurrency in Action' Anthony Williams writes the following (page 309):

For example, on x86 and x86-64 architectures, atomic load operations are always the same, whether tagged memory_order_relaxed or memory_order_seq_cst (see section 5.3.3). This means that code written using relaxed memory ordering may work on systems with an x86 architecture, where it would fail on a system with a finer- grained set of memory-ordering instructions such as SPARC.

Do I get this right that on x86 architecture all atomic load operations are memory_order_seq_cst? In addition, on the cppreference std::memory_order site is mentioned that on x86 release-aquire ordering is automatic.

If this restriction is valid, do the orderings still apply to compiler optimizations?

Apostil answered 10/5, 2012 at 15:54 Comment(1)

"all atomic load operations are memory_order_seq_cst?" is not even wrong, it's a meaningless statement. No operation is, or is not, an ordering. Operations in a program are. – Grouchy 12/12, 2019 at 3:28

Yes, ordering still applies to compiler optimizations.

Also, it is not entirely exact that on x86 "atomic load operations are always the same".

On x86, all loads done with mov have acquire semantics and all stores done with mov have release semantics. So acq_rel, acq and relaxed loads are simple movs, and similarly acq_rel, rel and relaxed stores (acq stores and rel loads are always equal to relaxed).

This however is not necessarily true for seq_cst: the architecture does not guarantee seq_cst semantics for mov. In fact, the x86 instruction set does not have any specific instruction for sequentially consistent loads and stores. Only atomic read-modify-write operations on x86 will have seq_cst semantics. Hence, you could get seq_cst semantics for loads by doing a fetch_and_add operation (lock xadd instruction) with an argument of 0, and seq_cst semantics for stores by doing a seq_cst exchange operation (xchg instruction) and discarding the previous value.

But you do not need to do both! As long as all seq_cst stores are done with xchg, seq_cst loads can be implemented simply with a mov. Dually, if all loads were done with lock xadd, seq_cst stores could be implemented simply with a mov.

xchg and lock xadd are much slower than mov. Because a program has (usually) more loads than stores, it is convenient to do seq_cst stores with xchg so that the (more frequent) seq_cst loads can simply use a mov. This implementation detail is codified in the x86 Application Binary Interface (ABI). On x86, a compliant compiler must compile seq_cst stores to xchg so that seq_cst loads (which may appear in another translation unit, compiled with a different compiler) can be done with the faster mov instruction.

Thus it is not true in general that seq_cst and acquire loads are done with the same instruction on x86. It is only true because the ABI specifies that seq_cst stores be compiled to an xchg.

Taxicab answered 29/8, 2013 at 13:21 Comment(2)

"Only atomic read-modify-write operations on x86 will have seq_cst semantics." No they don't. That's a meaningless claim. Only program execution can be sequential, instruction cannot. – Grouchy 12/12, 2019 at 3:29

@curiousguy: more precisely: only locked read-modify-write operations on x86 will have semantics corresponding to C11 atomic operations with seq_cst ordering. loads will have semantics corresponding to C11 acquire ordering and stores to C11 release ordering. – Taxicab 16/1, 2020 at 18:18

The compiler must of course follow the rules of the language, whatever hardware it runs on.

What he says is that on an x86 you don't have relaxed ordering, so you get a stricter ordering even if you don't ask for it. That also means that such code tested on an x86 might not work properly on a system that does have relaxed ordering.

Cramoisy answered 10/5, 2012 at 16:18 Comment(0)

It is worth keeping in mind that although a load relaxed and seq_cst load may map to the same instruction on x86, they are not the same. A load relaxed can be freely reordered by the compiler across memory operations to different memory locations while a seq_cst load cannot be reordered across other memory operations.

Nicholnichola answered 3/9, 2013 at 5:38 Comment(0)

The sentence from the book is written in a somewhat misleading way. The ordering obtained on an architecture depends on not just how you translate atomic loads, but how you translate atomic stores.

The usual way to implement seq_cst on x86 is to flush the store buffer at some point between any seq_cst store and a subsequent seq_cst load from the same thread. The usual way for the compiler to guarantee this is to flush after stores, since there are fewer stores than loads. In this translation, seq_cst loads don't need to flush.

If you program x86 with just plain loads and stores, loads are guaranteed to provide acquire semantics, not seq_cst.

As for compiler optimization, in C11/C++11, the compiler does optimizations depending on code movement based on the semantics of the particular atomics, before considering the underlying hardware. (The hardware might provide stronger ordering, but there's no reason for the compiler to restrict its optimizations because of this.)

Carpometacarpus answered 7/11, 2013 at 13:58 Comment(1)

Note that you only have to flush between a store and a load; several store operations in sequence need no fence between them. If your codegen is for each atomic operation in isolation, you can't accomplish that of course. – Grouchy 12/12, 2019 at 3:46

Do I get this right that on x86 architecture all atomic load operations are memory_order_seq_cst?

Only executions (of a program, of some inter thread visible operations in a program) can be sequential. A single operation is not in itself sequential.

Asking whether the implementation of a single isolated operation is sequential is a meaningless question.

The translation of all memory operations that need some guarantee must be done following a strategy that enables that guarantee. There can be different strategies that have different compiler complexity costs and runtime costs.

[Just that there are different strategies to implement virtual functions: the only one that is OK (that fits all our expectations of speed, predictability and simplicity) is the use of vtables, so all compilers use vtable, but a virtual function is not defined as going through the vtable.]

In practice, there are not widely different strategies used to implement memory_order_seq_cst operations on a given CPU (that I know of). The differences between compilers are small and do not impede binary compatibility. But there are potentially differences and advanced global optimization of multi-threaded programs might open new opportunities for more efficient code generation for atomic operations.

Depending on your compiler, a program that contains only relaxed loads and memory_order_seq_cst modifications of std::atomic<> objects may or may not have exhibit only sequential behaviors, even on a strongly ordered CPU.

Grouchy answered 12/12, 2019 at 3:44 Comment(0)

Recommended topics

Hot tags