Atomic operations, std::atomic<> and ordering of writes
Asked Answered
U

1

12

GCC compiles this:

#include <atomic>
std::atomic<int> a; 
int b(0);

void func()
{
  b = 2; 
  a = 1;
}

to this:

func():
    mov DWORD PTR b[rip], 2
    mov DWORD PTR a[rip], 1
    mfence
    ret

So, to clarify things for me:

  • Is any other thread reading ‘a’ as 1 guaranteed to read ‘b’ as 2.
  • Why does the MFENCE happen after the write to ‘a’ not before.
  • Is the write to ‘a’ guaranteed to be an atomic (in the narrow, non C++ sense) operation anyway, and does that apply for all intel processors? I assume so from this output code.

Also, clang (v3.5.1 -O3)does this:

mov dword ptr [rip + b], 2
mov eax, 1
xchg    dword ptr [rip + a], eax
ret

Which appears more straightforward to my little mind, but why the different approach, what’s the advantage of each?

Underbid answered 3/9, 2015 at 20:30 Comment(8)
Answer to your mfence question: x86.renejeschke.de/html/file_module_x86_id_170.htmlMalatya
I was just poking around in the intel manuals myself. I'll go and have a look at the C++ specifications explanation of what std::atomic is actually promising as well I think and see if I can reconcile everything in my mind...Underbid
Ah - AMD clearly states: "The rules governing write ordering are restrictive: • Generally, out-of-order writes are not allowed. Write instructions executed out-of-order cannot commit (write) their result to memory until all previous instructions have completed in program order. The processor can, however, hold the result of an out-of-order write instruction in a private buffer (not visible to software) until that result can be committed to memory."Underbid
And likewise, Intel "Writes to memory are not reordered with other writes [ followed by a list of exceptions ]". So I think that answers my first two bullet points.Underbid
I believe the default memory ordering in C++11 is sequential consistency, which is more than the release-consistency your first question hints at. The memory-fence is only required for sequential consitency.Orthohydrogen
ref: sequential consistency - yes. So, and I've just checked with the compiler output as well, the MFENCE isn't required for the write-release consistency at all. That's guaranteed by the processor. It's all starting to make sense.Underbid
@EOF: Yes, the default model with C++11 atomics is sequential consistency. The default model for other operations is as weak as DEC Alpha.Elytron
@PeterCordes: Yeah, my comment was supposed to be read in the context of the question. It's only a comment anyway, and now there's your nice and comprehensive answer so I don't think my brevity is causing any harm.Orthohydrogen
E
17

I put your example on the Godbolt compiler explorer, and added some functions to read, increment, or combine (a+=b) two atomic variables. I also used a.store(1, memory_order_release); instead of a = 1; to avoid getting more ordering than needed, so it's just a simple store on x86.

See below for (hopefully correct) explanations. update: I had "release" semantics confused with just a StoreStore barrier. I think I fixed all the mistakes, but may have left some.


The easy question first:

Is the write to ‘a’ guaranteed to be an atomic?

Yes, any thread reading a will get either the old or the new value, not some half-written value. This happens for free on x86 and most other architectures with any aligned type that fits in a register. (e.g. not int64_t on 32bit.) Thus, on many systems, this happens to be true for b as well, the way most compilers would generate code.

There are some types of stores that may not be atomic on an x86, including unaligned stores that cross a cache line boundary. But std::atomic of course guarantees whatever alignment is necessary.

Read-modify-write operations are where this gets interesting. 1000 evaluations of a+=3 done in multiple threads at once will always produce a += 3000. You'd potentially get fewer if a wasn't atomic.

Fun fact: signed atomic types guarantee two's complement wraparound, unlike normal signed types. C and C++ still cling to the idea of leaving signed integer overflow undefined in other cases. Some CPUs don't have arithmetic right shift, so leaving right-shift of negative numbers undefined makes some sense, but otherwise it just feels like a ridiculous hoop to jump through now that all CPUs use 2's complement and 8bit bytes. </rant>


Is any other thread reading ‘a’ as 1 guaranteed to read ‘b’ as 2.

Yes, because of the guarantees provided by std::atomic.

Now we're getting into the memory model of the language, and the hardware it runs on.

C11 and C++11 have a very weak memory ordering model, which means the compiler is allowed to reorder memory operations unless you tell it not to. (source: Jeff Preshing's Weak vs. Strong Memory Models). Even if x86 is your target machine, you have to stop the compiler from re-ordering stores at compile time. (e.g. normally you'd want the compiler to hoist a = 1 out of a loop that also writes to b.)

Using C++11 atomic types gives you full sequential-consistency ordering of operations on them with respect to the rest of the program, by default. This means they're a lot more than just atomic. See below for relaxing the ordering to just what's needed, which avoids expensive fence operations.


Why does the MFENCE happen after the write to ‘a’ not before.

StoreStore fences are a no-op with x86's strong memory model, so the compiler just has to put the store to b before the store to a to implement the source code ordering.

Full sequential consistency also requires that the store be globally ordered / globally visible before any later loads in program order.

x86 can re-order stores after loads. In practice, what happens is that out-of-order execution sees an independent load in the instruction stream, and executes it ahead of a store that was still waiting on the data to be ready. Anyway, sequential-consistency forbids this, so gcc uses MFENCE, which is a full barrier, including StoreLoad (the only kind x86 doesn't have for free. (LFENCE/SFENCE are only useful for weakly-ordered operations like movnt.))

Another way to put this is the way the C++ docs use: sequential consistency guarantees that all threads see all changes in the same order. The MFENCE after every atomic store guarantees that this thread sees stores from other threads. Otherwise, our loads would see our stores before other thread's loads saw our stores. A StoreLoad barrier (MFENCE) delays our loads until after the stores that need to happen first.

The ARM32 asm for b=2; a=1; is:

# get pointers and constants into registers
str r1, [r3]     # store b=2
dmb sy           # Data Memory Barrier: full memory barrier to order the stores.
   #  I think just a StoreStore barrier here (dmb st) would be sufficient, but gcc doesn't do that.  Maybe later versions have that optimization, or maybe I'm wrong.
str r2, [r3, #4] # store a=1  (a is 4 bytes after b)
dmb sy           # full memory barrier to order this store wrt. all following loads and stores.

I don't know ARM asm, but what I've figured out so far is that normally it's op dest, src1 [,src2], but loads and stores always have the register operand first and the memory operand 2nd. This is really weird if you're used to x86, where a memory operand can be the source or dest for most non-vector instructions. Loading immediate constants also takes a lot of instructions, because the fixed instruction length only leaves room for 16b of payload for movw (move word) / movt (move top).


Release / Acquire

The release and acquire naming for one-way memory barriers comes from locks:

  • One thread modifies a shared data structure, then releases a lock. The unlock has to be globally visible after all the loads/stores to data it's protecting. (StoreStore + LoadStore)
  • Another thread acquires the lock (read, or RMW with a release-store), and must do all loads/stores to the shared data structure after the acquire becomes globally visible. (LoadLoad + LoadStore)

Note that std:atomic uses these names even for standalone fences which are slightly different from load-acquire or store-release operations. (See atomic_thread_fence, below).

Release/Acquire semantics are stronger than what producer-consumer requires. That just requires one-way StoreStore (producer) and one-way LoadLoad (consumer), without LoadStore ordering.

A shared hash table protected by a readers/writers lock (for example) requires an acquire-load / release-store atomic read-modify-write operation to acquire the lock. x86 lock xadd is a full barrier (including StoreLoad), but ARM64 has load-acquire/store-release version of load-linked/store-conditional for doing atomic read-modify-writes. As I understand it, this avoids the need for a StoreLoad barrier even for locking.


Using weaker but still sufficient ordering

Writes to std::atomic types are ordered with respect to every other memory access in source code (both loads and stores), by default. You can control what ordering is imposed with std::memory_order.

In your case, you only need your producer to make sure stores become globally visible in the correct order, i.e. a StoreStore barrier before the store to a. store(memory_order_release) includes this and more. std::atomic_thread_fence(memory_order_release) is just a 1-way StoreStore barrier for all stores. x86 does StoreStore for free, so all the compiler has to do is put the stores in source order.

Release instead of seq_cst will be a big performance win, esp. on architectures like x86 where release is cheap/free. This is even more true if the no-contention case is common.

Reading atomic variables also imposes full sequential consistency of the load with respect to all other loads and stores. On x86, this is free. LoadLoad and LoadStore barriers are no-ops and implicit in every memory op. You can make your code more efficient on weakly-ordered ISAs by using a.load(std::memory_order_acquire).

Note that the std::atomic standalone fence functions confusingly reuse the "acquire" and "release" names for StoreStore and LoadLoad fences that order all stores (or all loads) in at least the desired direction. In practice, they will usually emit HW instructions that are 2-way StoreStore or LoadLoad barriers. This doc is the proposal for what became the current standard. You can see how memory_order_release maps to a #LoadStore | #StoreStore on SPARC RMO, which I assume was included partly because it has all the barrier types separately. (hmm, the cppref web page only mentions ordering stores, not the LoadStore component. It's not the C++ standard, though, so maybe the full standard says more.)


memory_order_consume isn't strong enough for this use-case. This post talks about your case of using a flag to indicate that other data is ready, and talks about memory_order_consume.

consume would be enough if your flag was a pointer to b, or even a pointer to a struct or array. However, no compiler knows how to do the dependency tracking to make sure it puts thing in the proper order in the asm, so current implementations always treat consume as acquire. This is too bad, because every architecture except DEC alpha (and C++11's software model) provide this ordering for free. According to Linus Torvalds, only a few Alpha hardware implementations actually could have this kind of reordering, so the expensive barrier instructions needed all over the place were pure downside for most Alphas.

The producer still needs to use release semantics (a StoreStore barrier), to make sure the new payload is visible when the pointer is updated.

It's not a bad idea to write code using consume, if you're sure you understand the implications and don't depend on anything that consume doesn't guarantee. In the future, once compilers are smarter, your code will compile without barrier instructions even on ARM/PPC. The actual data movement still has to happen between caches on different CPUs, but on weak memory model machines, you can avoid waiting for any unrelated writes to be visible (e.g. scratch buffers in the producer).

Just keep in mind that you can't actually test memory_order_consume code experimentally, because current compilers are giving you stronger ordering than the code requests.

It's really hard to test any of this experimentally anyway, because it's timing-sensitive. Also, unless the compiler re-orders operations (because you failed to tell it not to), producer-consumer threads will never have a problem on x86. You'd need to test on an ARM or PowerPC or something to even try to look for ordering problems happening in practice.


references:

Elytron answered 4/9, 2015 at 9:20 Comment(4)
@preshing: I linked to a bunch of your blog posts for this answer. I found them really helpful. esp. preshing.com/20120710/… was great for me, since I already knew some stuff, but was fuzzy on terminology and the differences between the x86 memory model, ARM/PPC memory model, and the C++11 memory model.Elytron
@JCx: I'd been picking up bits of stuff about memory models and so on, but hadn't looked into how C++11 did things. I'd wanted to, and you question got me started digging into things and putting together the pieces of what I'd already picked up.Elytron
Ref. unaligned stores, what do you make of "8.1.1 Guaranteed Atomic Operations" in the Intel Systems Programming manual. It says 'fits in a cache line', but I believe cache lines are aligned themselves. Do they mean 'does not cross a cache line boundary' or are cache lines more flexible that I'm thinking...Underbid
@JCx: I'm sure they mean "doesn't cross a cache line boundary". Cache lines are indeed aligned. All recent Intel and AMD x86 CPUs use 64B cache lines. Intel switched sometime around P4. Earlier Intel CPUs, up to early P6 (PII and PIII) used 32B/line. As long as your data is aligned to a multiple of its size, you can load/store it atomically. Atomic RMW is available up to 16B (in 64bit mode) with CMPXCHG16B. Apparently SSE/AVX stores aren't guaranteed to be atomic, even when aligned. :/ This makes sense: Sandybridge does 256b stores in two 128b cycles. Pentium M split 128b ops.Elytron

© 2022 - 2024 — McMap. All rights reserved.