If you "relax" some ordering requirements of seq_cst, there's mo_acq_rel
(and pure acquire and pure release).
Even more relaxed than that is mo_relaxed
; no ordering wrt. anything else, just atomicity1.
When compiling for most ISAs, a seq_cst load can use the same asm as acquire loads; we choose to make stores expensive, not loads. C/C++11 mappings to processors for ISAs including x86, POWER, ARMv7, ARMv8 includes 2 alternatives for some ISAs. To be compatible with each other, compilers for the same platform have to pick the same strategy, otherwise a seq_cst store in one function could maybe reorder with a seq_cst load in another function.
On a typical CPU where the memory model includes a store buffer and coherent cache, if you store and then reload in the same thread, seq_cst requires that you don't let the reload happen until after the store is globally visible to all threads. This means either a full barrier (including StoreLoad) after seq_cst stores or before seq_cst loads. Since cheap loads are more valuable than cheap stores, the usual mapping picks x86 mov
+ mfence
for stores, for example. (Same applies for loading any other location; can't do that until the store commits. That's what Jeff Preshing's Memory Reordering Caught in the Act is about.)
This is a practical example of creating a global total order of operations on different variables that all threads can agree on. (x86 asm provides acquire for pure-load / release for pure-store, or seq_cst for lock
-prefixed atomic RMW instructions. So Preshing's x86 asm example corresponds exactly to C++11 mo_release
stores instead of mo_seq_cst
.
ARMv8 / AArch64 is interesting: it has STLR (sequential-release store) and LDAR (acquire load). Instead of stalling all later loads until the store buffer drains and commits an STLR to L1d cache (global visibility), an implementation can be more efficient.
Waiting for flush only has to happen before an LDAR executes; other loads can execute, and even later stores can commit to L1d. (A sequential-release is still at minimum a one-way barrier). To be this efficient / weak, LDAR has to probe the store buffer to check for STLR stores. But if you can do that, mo_seq_cst
stores can be significantly cheaper than on x86 if you don't do a seq_cst load of anything else right away after that.
On most other ISAs, the only option to recover sequential consistency is a full barrier instruction (after a store). This blocks all later loads and stores from happening until after all previous stores commit to L1d cache. But that's not what ISO C++ seq_cst
implies or requires, it's just that only AArch64 has the capability to be as strong as ISO C++ requires but no stronger.
(Compiling for many other weakly-ordered ISAs needs to promote acq / release to significantly stronger than needed, e.g. ARMv7 needs a full barrier for release stores.)
Footnote 1: (Like what you get in old pre-C++11 code using roll-your-own atomics using volatile
without any barriers).