How do memory_order_seq_cst and memory_order_acq_rel differ?
Asked Answered
P

4

65

Stores are release operations and loads are acquire operations for both. I know that memory_order_seq_cst is meant to impose an additional total ordering for all operations, but I'm failing to build an example where it isn't the case if all the memory_order_seq_cst are replaced by memory_order_acq_rel.

Do I miss something, or the difference is just a documentation effect, i.e. one should use memory_order_seq_cst if one intend not to play with a more relaxed model and use memory_order_acq_rel when constraining the relaxed model?

Photocopier answered 9/9, 2012 at 16:26 Comment(0)
N
60

http://en.cppreference.com/w/cpp/atomic/memory_order has a good example at the bottom that only works with memory_order_seq_cst. Essentially memory_order_acq_rel provides read and write orderings relative to the atomic variable, while memory_order_seq_cst provides read and write ordering globally. That is, the sequentially consistent operations are visible in the same order across all threads.

The example boils down to this:

bool x= false;
bool y= false;
int z= 0;

a() { x= true; }
b() { y= true; }
c() { while (!x); if (y) z++; }
d() { while (!y); if (x) z++; }

// kick off a, b, c, d, join all threads
assert(z!=0);

Operations on z are guarded by two atomic variables, not one, so you can't use acquire-release semantics to enforce that z is always incremented.

Nemato answered 9/9, 2012 at 16:45 Comment(15)
I don't understand why x=true;y=true;c();d() isn't possible? That should cause it to be 0. Also I don't know why i get 2 a lot as the results.Nunciature
@acidzombie24, even in that case, z will be 2.Nemato
I messed up, i misread the code. That makes perfect sense nowNunciature
@Nemato I don't understand this example. 1. don't while(!x) and while(!y) guarantees that either if(y) or if(x) returns true? even we use per atomic variable ordering? 2. how does global ordering help? (this may come clear if i understand 1). thx.Atrium
@CandyChiu With ack_rel, c() can perceive that x=true; in a() happens before y=true; in b() at the same time d() can perceive that y=true; happens before x=true; (due to lack of "global ordering".) In particular c() can perceive x==true and y==false at the same time d() can perceive y==true and x==false. So z might not be incremented by either of c() or d(). With seq_cst, if c() perceives x=true; happens before y=true;, so does d().Toxin
@Nemato You meant int z=0, not bool z=0Toxin
@nodakai, Your explanation is accurate but I think the phrase "happens before" can be misleading since the crux of the issue with acquire-release is that neither of the writes happens-before the other.Digitize
This example is using pure loads and pure stores, not any actual RMW operations that could use std::memory_order_acq_rel. In an atomic read-modify-write, the load and store are tied together because they're an atomic. I'm not sure when if ever acq_rel can differ from seq_cst for something like .fetch_add or .compare_exchange_weakLaboratory
@PeterCordes If you only use RMW in a program, then obviously acq + rel => seq_cst.Abolition
@curiousguy: I'm not convinced that's true. seq_cst requires that all threads agree on a total order; hardware that isn't multi-copy-atomic (e.g. POWER) can use a weaker barrier in acq_rel RMWs vs. seq_cst RMWs (godbolt.org/z/C4MuhU lwsync instead of sync before the RMW), possibly leading to some threads disagreeing about the relative order of things. i.e. the IRIW experiment but with RMWs instead of pure stores. I think acq_rel RMWs could still work as observers without making the delay between 1st and 2nd load so long that you can't see the effect. Or failed-CAS is just a load :PLaboratory
@Nemato does sequential consistency force threads to start in that specific order (that is a, b, c, d)?Emu
@Emu Sequential consistency means that all those operations that have that guarantee (seq_cst ops) are done as if done in some global order, as if the operations by diff threads were interlaced. On a mono CPU system, asm level operations are always so.Abolition
Why I cannot trigger the assert(z!=0) on my intel x86 platform?Apomict
@Abolition are you saying a seq_cst operation guarantees that compiler and CPU cannot reorder any store and load happens before and after that seq_cst regardless whether those stores and loads are related or depending on the seq_cst operation? And acq_rel operation can synchronize with any acquire operation on that same memory address but nothing more?Blintze
@JayZhuang See bartoszmilewski.com/2008/11/05/… . On Intel X86: "In a multiprocessor system, memory ordering obeys causality (memory ordering respects transitive visibility)." In other words, on Intel, it is not possible, that the cores with threads c() and d() see a different order of the writes.Banger
L
18

On ISAs like x86 where atomics map to barriers, and the actual machine model includes a store buffer:

  • seq_cst stores require flushing the store buffer so this thread's later reads are delayed until after the store is globally visible. (At least before the next seq_cst load, not necessarily before non-atomic or non-SC atomic loads or stores.)

    See below re: AArch64's hardware support for this; most other ISAs do just wait for the store buffer to drain after SC stores and RMWs. An alternate mapping from C++ to asm would be to drain the store buffer before each seq_cst load instead of after stores, but we want cheap loads even more than cheap stores. (Multiple cores can be loading the same object in parallel; stores necessarily cause contention to get exclusive ownership of the cache line.)

  • acquire or release do not have to flush the store buffer. Normal x86 loads and stores have essentially acq and rel semantics, making it free in x86 asm. (Each core's accesses to coherent cache happen in program order, except for a store buffer with store forwarding.)

    But x86 atomic RMW operations always get "promoted" to seq_cst because the x86 asm lock prefix is a full memory barrier. Other ISAs can do relaxed or acq_rel RMWs in asm, with the store side being able to do limited reordering with later stores or loads on other objects. (But not in ways that would make the RMW appear non-atomic: For purposes of ordering, is atomic read-modify-write one operation or two?)


https://preshing.com/20120515/memory-reordering-caught-in-the-act is an instructive example of the difference between a seq_cst store and a plain release store. (It's actually mov + mfence vs. plain mov in x86 asm. In practice xchg is a more efficient way to do a seq_cst store on most x86 CPUs, but GCC does use mov+mfence)


Fun fact: AArch64's LDAR acquire-load instruction is actually a sequential-acquire, having a special interaction with STLR. Not until ARMv8.3 LDAPR can arm64 do plain acquire operations that can reorder with earlier release and seq_cst stores (STLR). (seq_cst loads still use LDAR because they need that interaction with STLR to recover sequential consistency; seq_cst and release stores both use STLR).

With STLR / LDAR you get sequential consistency, but only having to drain the store buffer before the next LDAR, not right away after each seq_cst store before other operations. I think real AArch64 HW does implement it this way, rather than simply draining the store buffer before committing an STLR.

Strengthening rel or acq_rel to seq_cst by using LDAR / STLR doesn't need to be expensive, unless you seq_cst store something, and then seq_cst load something else. Then it's just as bad as x86.

Some other ISAs (like PowerPC) have more choices of barriers and can strengthen up to mo_rel or mo_acq_rel more cheaply than mo_seq_cst, but their seq_cst can't be as cheap as AArch64; seq-cst stores need a full barrier.

So AArch64 is an exception to the rule that seq_cst stores drain the store buffer on the spot, either with a special instruction or a barrier instruction after. It's not a coincidence that ARMv8 was designed after C++11 / Java / etc. basically settled on seq_cst being the default for lockless atomic operations, so making them efficient was important. And after CPU architects had a few years to think about alternatives to providing barrier instructions or just acquire/release vs. relaxed load/store instructions.

Laboratory answered 21/9, 2019 at 20:39 Comment(18)
"But x86 atomic RMW operations always get promoted to seq_cst because the x86 asm lock prefix is a full memory barrier." What makes you say they are "promoted"? Also the exec could well speculatively load the value (normally) and do the computation as long as it reloads it safely (locked load) later; if the computation is fast that's probably uninteresting but still possible. (I suppose that these things are documented in a purely descriptive way by Intel for existing designs and not for future ones.)Abolition
@curiousguy: the full-memory-barrier nature of the x86 lock prefix is carefully documented by Intel and AMD in their x86 ISA manuals. (Does lock xchg have the same behavior as mfence?). It's definitely guaranteed for future x86 CPUs; how else could compilers make safe future-proof asm? This is what I mean by compilers having to strengthen all RMW operations to seq_cst in the asm, draining the store buffer before the RMW does its thing.Laboratory
What is guaranteed exactly? That the CPU will not try to get the value already loaded and the computation ready in memory in advance, so speed up a costly RMW, says xdiv (or xcos if the FPU decides to support RMW)?Abolition
@curiousguy: what? There are no "expensive" atomic RMW instructions. Not even integer multiply. Only cheap basic ALU stuff like lock add and lock or, and bitstring instructions like lock bts (bit test-and-set). And special instructions intended for use with lock, like cmpxchg and xadd. But they're still all easy for the ALU to implement the computation part in a single cycle.Laboratory
@curiousguy: But anyway, if a hypothetical implementation wanted to try loading early to set up for a cheaper atomic exchange to actually implement the RMW, it could only do that speculatively and roll back on mis-speculation (if the line changed before the load was architecturally allowed). Regular loads already work this way, to get performance while preserving strong load ordering. (See the machine_clears.memory_ordering performance counter: Why flush the pipeline for Memory Order Violation caused by other logical processors?)Laboratory
@curiousguy: See also What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?, especially comments on the top answer exploring the machine_clears.memory_ordering explanation for the observed performance.Laboratory
"it could only do that speculatively and roll back (...). Regular loads already work this way" So it would only have to do what it already does on regular operations in order to get acq and rel semantics. So it's doable in principle and guarantees acq + rel. And if in another universe Intel has the same CPU but w/ support of xdiv, they can start doing the (potentially useful) optimization w/o breaking code. The only reason why these locked instr don't do it is because a trivial operation is done after the load.Abolition
The point being: xdiv can be implemented like that because it's the normal acq + rel semantics that are automatically enforced and it would even be easy to translate it to uops. And there is no "RMW operations always get promoted to seq_cst" because we just have done an atomic mutation w/ acq + rel as usual. Also the concept of a CPU instr being C/C++ seq_cst isn't a concept at all. It's nonsense: seq_cst doesn't qualify any operation in particular unlike separately compiled acq and rel.Abolition
@PeterCordes - I don't even think it's hypothetical: I think that's how atomic operations are (sometimes) implemented on current Intel x86. That is, that they load the cache line in an optimistic locked state, do the "front end" of the RMW (including the ALU op), and then in the "back end" of the RMW they verify everything was OK in the execute-at-retire op that ensures all the ordering. This works great when the location is not contended. If this fails a lot, a predictor will switch modes to doing the whole thing at retire, which causes a bigger bubble in the pipeline (hence "sometimes").Deuno
Not that this contradicts anything you are saying: this is purely an optimization of the implementation, and the documented semantics are (mostly) maintained. I say "mostly" because I think as part of this overall locked-insn optimization process, they broke some ordering stuff like the weakly ordered loads thing, and while they reverted the mfence behavior at a cost in performance (so it now fences them), they didn't for locked. It's not clear if they'll basically just change the rules so that reordering is allowed or what...Deuno
@BeeOnRope: Oh interesting, I hadn't realized they could be that efficient in the uncontended case.Laboratory
@curiousguy: When we say that an asm sequence provides seq_cst, we mean the sequence used in asm to implement C++ mo_relaxed CAS is the same sequence used to implement mo_seq_cst CAS. So if you do that everywhere, you get sequentially-consistent execution. We also mean that if programming by hand in asm, using full barriers that way is strong enough to recover sequential consistency on top of a weaker machine model. C++ has a handy abbreviation for that so we use it. But anyway, yes, speculative execution allows tricks.Laboratory
@PeterCordes Note that mutexes are only acq-rel.Abolition
So a standalone atomic_thread_fence(memory_order_acq_rel) on x86 doesn't prevent StoreLoad reordering?Filiform
@ilstam: Interesting question, I wasn't sure so I checked: godbolt.org/z/mRCZme GCC compiles it to no asm, so that's correct; it doesn't include a StoreLoad barrier. Same for PowerPC where it compiles to lwsync (light weight sync, not full sync).Laboratory
"Other ISAs can do relaxed or acq_rel RMWs in asm, with the store side being able to do limited reordering with later stores" - shouldn't it rather be "with later loads"? Because the only ordering not enforced by acq_rel is StoreLoad.Articulate
@DanielNitzan: An RMW is actually a load and a store; no operation to the same location can come between them (giving RMW atomicity), but ops on other locations can actually overlap. The store half has release semantics, and in some cases actually can reorder with a later store to another location. (Or with a later load, but the Q&A I linked shows an experiment using two stores.)Laboratory
@DanielNitzan: x86 doesn't have this effect; its RMWs are full memory barriers, so the load+store remain fully stuck together in the globally-visible order of the core's operations. (This is somewhat required by the LoadLoad and StoreStore ordering guarantees. If acq_rel RMWs existed on x86, the store side could reorder with later loads but not stores. But that's due to later stores being release ops themselves, the fact that x86 in general only allows StoreLoad reordering. Same for earlier stores but not loads reordering with the load side of a hypothetical acq_rel RMW on x86.)Laboratory
G
5

Try to build Dekkers or Petersons algorithm with just acquire/release semantics.

That won't work because acquire/release semantics doesn't provide [StoreLoad] fence.

In case of Dekkers algorithm:

flag[self]=1 <-- STORE
while(true){
    if(flag[other]==0) { <--- LOAD
        break;
    }
    flag[self]=0;
    while(turn==other);
    flag[self]=1        
}

Without [StoreLoad] fence the store could jump in front of the load and then the algorithm would break. 2 threads at the same time would see that the other lock is free, set their own lock and continue. And now you have 2 threads within the critical section.

Gory answered 19/5, 2020 at 14:28 Comment(2)
"Without [StoreLoad] fence the store could jump in front of the load" In your code, you've marked a STORE that is already in front of a LOAD. Did you mean to say that the LOAD might be executed before the STORE?Interpose
You are right. The load can jump in front of the store.Gory
B
4

Still use the definition and example from memory_order. But replace memory_order_seq_cst with memory_order_release in store and memory_order_acquire in load.

Release-Acquire ordering guarantees everything that happened-before a store in one thread becomes a visible side effect in the thread that did a load. But in our example, nothing happens before store in both thread0 and thread1.

x.store(true, std::memory_order_release); // thread0

y.store(true, std::memory_order_release); // thread1

Further more, without memory_order_seq_cst, the sequential ordering of thread2 and thread3 are not guaranteed. You can imagine they becomes:

if (y.load(std::memory_order_acquire)) { ++z; } // thread2, load y first
while (!x.load(std::memory_order_acquire)); // and then, load x

if (x.load(std::memory_order_acquire)) { ++z; } // thread3, load x first
while (!y.load(std::memory_order_acquire)); // and then, load y

So, if thread2 and thread3 are executed before thread0 and thread1, that means both x and y stay false, thus, ++z is never touched, z stay 0 and the assert fires.

However, if memory_order_seq_cst enters the picture, it establishes a single total modification order of all atomic operations that are so tagged. Thus, in thread2, x.load then y.load; in thread3, y.load then x.load are sure things.

Broadminded answered 8/5, 2018 at 10:4 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.