C11 Standalone memory barriers LoadLoad StoreStore LoadStore StoreLoad
Asked Answered
P

1

2

I want to use standalone memory barriers between atomic and non-atomic operations (I think it shouldn't matter at all anyway). I think I understand what a store barrier and a load barrier mean and also the 4 types of possible memory reorderings; LoadLoad, StoreStore, LoadStore, StoreLoad.

However, I always find the acquire/release concepts confusing. Because when reading the documentation, acquire doesn't only speak about loads, but also stores, and release doesn't only speak about stores but also loads. On the other hand plain load barriers only provide you with guarantees on loads and plain store barriers only provide you with guarantees on stores.

My question is the following. In C11/C++11 is it safe to consider a standalone atomic_thread_fence(memory_order_acquire) as a load barrier (preventing LoadLoad reorderings) and an atomic_thread_fence(memory_order_release) as a store barrier (preventing StoreStore reorderings)?

And if the above is correct what can I use to prevent LoadStore and StoreLoad reorderings?

Of course I am interested in portability and I don't care what the above produce on a specific platform.

Parabasis answered 10/5, 2020 at 10:55 Comment(6)
Don't forget that atomic_thread_fence is only defined WRT atomic operations.Search
@Search But atomic_thread_fence serves as a standalone barrier. I don't see how atomicity of types and atomic operations play any role here. A memory barrier just gives you ordering guarantees for everything before and/or after the barrier.Parabasis
What guarantees does it give if you don't use atomics?Search
@Search ordering guarantees? that certain operations complete before others?Parabasis
@Search even the documentation says "ordering of non-atomic and relaxed atomic accesses": en.cppreference.com/w/c/atomic/atomic_thread_fenceParabasis
Just show me some code that usefully uses a fence that doesn't involve an atomic.Search
T
7

No, an acquire barrier after a relaxed load can make into an acquire load (inefficiently on some ISAs compare to just using an acquire load), so it has to block LoadStore as well as LoadLoad.

See https://preshing.com/20120913/acquire-and-release-semantics/ for a couple very helpful diagrams of the orderings showing that and that release stores need to make sure all previous loads and stores are "visible", and thus need to block StoreStore and LoadStore. (Reorderings where the Store part is 2nd). Especially this diagram:

enter image description here

Also https://preshing.com/20130922/acquire-and-release-fences/

https://preshing.com/20131125/acquire-and-release-fences-dont-work-the-way-youd-expect/ explains the 2-way nature of acq and rel fences vs. the 1-way nature of an acq or rel operation like a load or store. Apparently some people had misconceptions about what atomic_thread_fence() guaranteed, thinking it was too weak.

And just for completeness, remember that these ordering rules have to be enforced by the compiler against compile-time reordering, not just runtime.

It may mostly work to think of barriers acting on C++ loads / stores in the C++ abstract machine, regardless of how that's implemented in asm. But there are corner cases like PowerPC where that mental model doesn't cover everything (IRIW reordering, see below).

I do recommend trying to think in terms of acquire and release operations ensuring visibility of other operations to each other, and definitely don't write code that just uses relaxed ops and separate barriers. That can be safe but is often less efficient.


Everything about ISO C/C++ memory / inter-thread ordering is officially defined in terms of an acquire load seeing the value from a release store, and thus creating a "synchronizes with" relationship, not about fences to control local reordering.

std::atomic does not explicitly guarantee the existence of a coherent shared-memory state where all threads see a change at the same time. In the mental model you're using, with local reordering when reading/writing to a single shared state, IRIW reordering can happen when one thread makes its stores visible to some other threads before they become globally visible to all other threads. (Like can happen in practice on some SMT PowerPC CPUs.).

In practice all C/C++ implementations run threads across cores that do have a cache-coherent view of shared memory so the mental model in terms of read/write to coherent shared memory with barriers to control local reordering works. But keep in mind that C++ docs won't talk about re-ordering, just about whether any order is guaranteed in the first place.


For another in-depth look at the divide between how C++ describes memory models, vs. how asm memory models for real architectures are described, see also How to achieve a StoreLoad barrier in C++11? (including my answer there). Also Does atomic_thread_fence(memory_order_seq_cst) have the semantics of a full memory barrier? is related.

fence(seq_cst) includes StoreLoad (if that concept even applies to a given C++ implementation). I think reasoning in terms of local barriers and then transforming that to C++ mostly works, but remember that it doesn't model the possibility of IRIW reordering which C++ allows, and which happens in real life on some POWER hardware.

Also keep in mind that var.load(acquire) can be much more efficient than var.load(relaxed); fence(acquire); on some ISAs, notably ARMv8.

e.g. this example on Godbolt, compiled for ARMv8 by GCC8.2 -O2 -mcpu=cortex-a53

#include <atomic>
int bad_acquire_load(std::atomic<int> &var){
    int ret = var.load(std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_acquire);
    return ret;
}

bad_acquire_load(std::atomic<int>&):
        ldr     r0, [r0]          // plain load
        dmb     ish               // FULL BARRIER
        bx      lr
int normal_acquire_load(std::atomic<int> &var){
    int ret = var.load(std::memory_order_acquire);
    return ret;
}

normal_acquire_load(std::atomic<int>&):
        lda     r0, [r0]            // acquire load
        bx      lr
Tatar answered 10/5, 2020 at 11:18 Comment(9)
So, to recap. In practice, is it safe to assume that an acquire fence will always act at least as a load barrier, and a release fence at least as a store barrier? And is there anything else available in modern C/C++ that will act only as a LoadLoad or StoreStore barrier?Parabasis
@ilstam: yes, at least when compiling for an ISA like x86 or ARM so the entire discussion can have any meaning, an acquire fence must always include a LoadLoad barrier. And no, there's nothing in portable ISO C++ that's only LoadLoad without LoadStore. Most CPUs don't have such a barrier anyway; e.g. PowerPC provides lwsync which is all 3 cheap ones (not StoreLoad).Tatar
For some given implementation you might be able to use inline asm. e.g. x86 sfence is StoreStore only, and affects NT stores as well as plain stores. (Plain stores are strongly ordered so sfence is useless outside of NT stores). ARM dsb may actually have ways to do only LoadLoad without LoadStore (which might be why compilers have to use a full barrier dsb ish even for atomic_thread_fence(mo_acquire). godbolt.org/z/jAwEKX)Tatar
And finally would atomic_thread_fence(memory_order_seq_cst) prevent StoreLoad in a portable way? Or this doesn't make sense in C/C++ standard's parlance?Parabasis
@ilstam: Yes, fence(seq_cst) includes StoreLoad when compiling for normal ISAs with a memory model that works in terms of local fences. ISO C++ doesn't state it that way, but in practice that's what happens on the ISAs where the whole model of reordering of access to shared memory applies. How to achieve a StoreLoad barrier in C++11? has some good answers about this, also and Does atomic_thread_fence(memory_order_seq_cst) have the semantics of a full memory barrier?Tatar
Ok, thank you. Maybe you could incorporate this in your answer as well, since it was kind of included in the question, so I can accept it as a (super) complete answer.Parabasis
@ilstam: ok, yeah that comment is worth incorporating for the benefit of future readers. Done.Tatar
Could you also please link to something that supports the claim that using relaxed ops and separate barriers can be less efficient? I would be interested to know more about it. And it makes me wonder why does the API even offers the standalone barriers then.Parabasis
@ilstam: added an example on Godbolt where the bad way to do an acquire load fails to use lda, and in fact has a full barrier when compiled for ARM32. The API includes barriers because they're useful sometimes, e.g. to build a SeqLock. Does this envelope implementation correctly use C++11 atomics? and make sure a load happens after something else, where that "something else" is not itself an acquire load. An acquire load only makes sure that it happens before later stuff.Tatar

© 2022 - 2024 — McMap. All rights reserved.