Yes, ordering still applies to compiler optimizations.
Also, it is not entirely exact that on x86 "atomic load operations are always the same".
On x86, all loads done with mov
have acquire semantics and all stores done with mov
have release semantics. So acq_rel, acq and relaxed loads are simple mov
s, and similarly acq_rel, rel and relaxed stores (acq stores and rel loads are always equal to relaxed).
This however is not necessarily true for seq_cst: the architecture does not guarantee seq_cst semantics for mov
. In fact, the x86 instruction set does not have any specific instruction for sequentially consistent loads and stores. Only atomic read-modify-write operations on x86 will have seq_cst semantics. Hence, you could get seq_cst semantics for loads by doing a fetch_and_add operation (lock xadd
instruction) with an argument of 0, and seq_cst semantics for stores by doing a seq_cst exchange operation (xchg
instruction) and discarding the previous value.
But you do not need to do both! As long as all seq_cst stores are done with xchg
, seq_cst loads can be implemented simply with a mov
. Dually, if all loads were done with lock xadd
, seq_cst stores could be implemented simply with a mov
.
xchg
and lock xadd
are much slower than mov
. Because a program has (usually) more loads than stores, it is convenient to do seq_cst stores with xchg
so that the (more frequent) seq_cst loads can simply use a mov
. This implementation detail is codified in the x86 Application Binary Interface (ABI). On x86, a compliant compiler must compile seq_cst stores to xchg
so that seq_cst loads (which may appear in another translation unit, compiled with a different compiler) can be done with the faster mov
instruction.
Thus it is not true in general that seq_cst and acquire loads are done with the same instruction on x86. It is only true because the ABI specifies that seq_cst stores be compiled to an xchg
.