What is the (slight) difference on the relaxing atomic rules?
Asked Answered
P

3

3

After seeing Herb Sutters excellent talk about "atomic weapons" I got a bit confused about the Relaxed Atomics examples.

I took with me that an atomic in the C++ Memory Model (SC-DRF = Sequentially Consistent for Data Race Free) does an "acquire" on a load/read.

I understand that for a load [and a store] the default is std::memory_order_seq_cst and therefore the two are the same:

myatomic.load();                          // (1)
myatomic.load(std::memory_order_seq_cst); // (2)

So far so good, no Relaxed Atomics involved (and after hearing the talk I will never to use the relaxed ones. Ever. Promise. But when someone asks me, I might have to explain...).

But why is it the "relaxed" semantics when I use

myatomic.load(std::memory_order_acquire);   // (3)

Since load is acquiring and not releasing, why is this different from (1) and (2)? What actually is relaxed here?

The only thing I can think of is that I misunderstood that load means acquire. And if that is true, and the default seq_cst means both, doesn't that mean a full fence -- nothing can pass up that instruction, nor down? I have to have misunderstood that part.

[and symmetrically for store and release].

Proportionate answered 9/6, 2013 at 21:29 Comment(1)
"a full fence -- nothing can pass up that instruction, nor down?" There is no such thing as a "fence" that guarantees that all operations (even local thread computations) appearing in program code before an atomic operation are done in the binary code in exactly that order. You would need lots of volatile to ensure that.Coagulant
F
6

It can be a bit confusing to call myatomic.load(std::memory_order_acquire); a "relaxed atomic" load, since there is a std::memory_order_relaxed. Some people describe any order weaker than seq_cst as "relaxed".

You're right to note that sequentially-consistent load is an acquire load, but it has an additional requirement: sequentially-consistent load is also a part of the total global order for all seq_cst operations.

It comes into play when you're dealing with more than one atomic variable: individual modification orders of two atomics may appear in different relative order to different threads, unless sequential consistency is imposed.

Flyblow answered 9/6, 2013 at 21:52 Comment(2)
Oh, right! "sequentially-consistent load is also a part of the total global order for all seq_cst operations" -- the guarantee that all the critical load [stores] do not "cross" each other, in addition to be "crossed" by other operations. I remember it now. I called t "Relaxed Atomics", because Herb categorizes them all this way. About std::memory_order_relaxed he was scornfully joking about because "it can go all over the place"...Proportionate
@Proportionate Actually the C++ std doesn't describe "crossing".Coagulant
I
3

If you "relax" some ordering requirements of seq_cst, there's mo_acq_rel (and pure acquire and pure release).

Even more relaxed than that is mo_relaxed; no ordering wrt. anything else, just atomicity1.

When compiling for most ISAs, a seq_cst load can use the same asm as acquire loads; we choose to make stores expensive, not loads. C/C++11 mappings to processors for ISAs including x86, POWER, ARMv7, ARMv8 includes 2 alternatives for some ISAs. To be compatible with each other, compilers for the same platform have to pick the same strategy, otherwise a seq_cst store in one function could maybe reorder with a seq_cst load in another function.

On a typical CPU where the memory model includes a store buffer and coherent cache, if you store and then reload in the same thread, seq_cst requires that you don't let the reload happen until after the store is globally visible to all threads. This means either a full barrier (including StoreLoad) after seq_cst stores or before seq_cst loads. Since cheap loads are more valuable than cheap stores, the usual mapping picks x86 mov + mfence for stores, for example. (Same applies for loading any other location; can't do that until the store commits. That's what Jeff Preshing's Memory Reordering Caught in the Act is about.)

This is a practical example of creating a global total order of operations on different variables that all threads can agree on. (x86 asm provides acquire for pure-load / release for pure-store, or seq_cst for lock-prefixed atomic RMW instructions. So Preshing's x86 asm example corresponds exactly to C++11 mo_release stores instead of mo_seq_cst.


ARMv8 / AArch64 is interesting: it has STLR (sequential-release store) and LDAR (acquire load). Instead of stalling all later loads until the store buffer drains and commits an STLR to L1d cache (global visibility), an implementation can be more efficient.

Waiting for flush only has to happen before an LDAR executes; other loads can execute, and even later stores can commit to L1d. (A sequential-release is still at minimum a one-way barrier). To be this efficient / weak, LDAR has to probe the store buffer to check for STLR stores. But if you can do that, mo_seq_cst stores can be significantly cheaper than on x86 if you don't do a seq_cst load of anything else right away after that.

On most other ISAs, the only option to recover sequential consistency is a full barrier instruction (after a store). This blocks all later loads and stores from happening until after all previous stores commit to L1d cache. But that's not what ISO C++ seq_cst implies or requires, it's just that only AArch64 has the capability to be as strong as ISO C++ requires but no stronger.

(Compiling for many other weakly-ordered ISAs needs to promote acq / release to significantly stronger than needed, e.g. ARMv7 needs a full barrier for release stores.)


Footnote 1: (Like what you get in old pre-C++11 code using roll-your-own atomics using volatile without any barriers).

Ivelisseivens answered 3/12, 2019 at 3:16 Comment(7)
But the store-forwarding letting you see your own stores before they become globally visible is what's happening in Jeff Preshing's Memory Reordering Caught in the Act) My understanding is that this example demonstrates StoreLoad reordering, caused by the store buffer and not total store (re)ordering caused by local store-to-load forwarding.Muoimuon
@DanielNitzan: Oh yes, you're right, Preshing's memory reordering is store X / load Y or vice versa, no store-forwarding involved. Thanks, fixed.Ivelisseivens
Like what you get in old pre-C++11 code using roll-your-own atomics using volatile without any barriers ; volatile has stronger guarantees than memory_order_relaxed atomics; I'd say it's the only way you could achieve compiler level visibility between threads pre-C++11 (bar compiler barriers), but it was overkill nevertheless.Muoimuon
@Daniel? You mean that compile-time reordering of volatile isn't allowed? There are minor differences, esp. if we only talk about x86 where run-time reordering isn't allowed except for StoreLoad. But in general there's no real ordering guarantee you can count on. Also note that current compilers never optimize atomics even if you don't use volatile atomic<>, so it's actually like volatile in that sense. And yes, volatile is essential for hand-rolled atomics before C++11, or like the Linux kernel still does (lwn.net/Articles/793253), without using builtins like __atomic_load_nIvelisseivens
compile-time reordering of volatile is allowed of course; I meant that in theory optimizers are allowed to e.g. optimize away back to back stores to the same atomic var (but as you've mentioned, they don't do it in practice)Muoimuon
@DanielNitzan: That's true for atomics, but not for volatile. Every volatile access is considered an observable side-effect that has to happen, and happen in source order. That's the whole point of volatile, and why it can work as a poor-man's atomic for stores and loads. And also why it works for MMIO accesses.Ivelisseivens
You're right, volatiles can't be reordered wrt one anotherMuoimuon
C
1

And if that is true, and the default seq_cst means both, doesn't that mean a full fence

It absolutely does not mean both or "full fence" whatever that is.

seq_cst implies

  • acquire only on load operations
  • and release only on store operations.

So it implies both only on the operations that combine both: the RMW atomic operations.

Sequential consistency also means that these operations are globally ordered, that is: all operations marked seq_cst of the whole program are run in some sequential order, that is an order compatible with the sequencing of operations in each thread. It says nothing about the order of other atomic operations with respect to these "sequential" operations.

The intent of a seq_cst operation on an atomic object is not to provide a "fence" that would make all other memory operations sequential.

Coagulant answered 2/12, 2019 at 5:42 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.