C11 Standalone memory barriers LoadLoad StoreStore LoadStore StoreLoad

I want to use standalone memory barriers between atomic and non-atomic operations (I think it shouldn't matter at all anyway). I think I understand what a store barrier and a load barrier mean and also the 4 types of possible memory reorderings; LoadLoad, StoreStore, LoadStore, StoreLoad.

However, I always find the acquire/release concepts confusing. Because when reading the documentation, acquire doesn't only speak about loads, but also stores, and release doesn't only speak about stores but also loads. On the other hand plain load barriers only provide you with guarantees on loads and plain store barriers only provide you with guarantees on stores.

My question is the following. In C11/C++11 is it safe to consider a standalone atomic_thread_fence(memory_order_acquire) as a load barrier (preventing LoadLoad reorderings) and an atomic_thread_fence(memory_order_release) as a store barrier (preventing StoreStore reorderings)?

And if the above is correct what can I use to prevent LoadStore and StoreLoad reorderings?

Of course I am interested in portability and I don't care what the above produce on a specific platform.

No, an acquire barrier after a relaxed load can make into an acquire load (inefficiently on some ISAs compare to just using an acquire load), so it has to block LoadStore as well as LoadLoad.

See https://preshing.com/20120913/acquire-and-release-semantics/ for a couple very helpful diagrams of the orderings showing that and that release stores need to make sure all previous loads and stores are "visible", and thus need to block StoreStore and LoadStore. (Reorderings where the Store part is 2nd). Especially this diagram:

Also https://preshing.com/20130922/acquire-and-release-fences/

https://preshing.com/20131125/acquire-and-release-fences-dont-work-the-way-youd-expect/ explains the 2-way nature of acq and rel fences vs. the 1-way nature of an acq or rel operation like a load or store. Apparently some people had misconceptions about what atomic_thread_fence() guaranteed, thinking it was too weak.

And just for completeness, remember that these ordering rules have to be enforced by the compiler against compile-time reordering, not just runtime.

It may mostly work to think of barriers acting on C++ loads / stores in the C++ abstract machine, regardless of how that's implemented in asm. But there are corner cases like PowerPC where that mental model doesn't cover everything (IRIW reordering, see below).

I do recommend trying to think in terms of acquire and release operations ensuring visibility of other operations to each other, and definitely don't write code that just uses relaxed ops and separate barriers. That can be safe but is often less efficient.

Everything about ISO C/C++ memory / inter-thread ordering is officially defined in terms of an acquire load seeing the value from a release store, and thus creating a "synchronizes with" relationship, not about fences to control local reordering.

std::atomic does not explicitly guarantee the existence of a coherent shared-memory state where all threads see a change at the same time. In the mental model you're using, with local reordering when reading/writing to a single shared state, IRIW reordering can happen when one thread makes its stores visible to some other threads before they become globally visible to all other threads. (Like can happen in practice on some SMT PowerPC CPUs.).

In practice all C/C++ implementations run threads across cores that do have a cache-coherent view of shared memory so the mental model in terms of read/write to coherent shared memory with barriers to control local reordering works. But keep in mind that C++ docs won't talk about re-ordering, just about whether any order is guaranteed in the first place.

For another in-depth look at the divide between how C++ describes memory models, vs. how asm memory models for real architectures are described, see also How to achieve a StoreLoad barrier in C++11? (including my answer there). Also Does atomic_thread_fence(memory_order_seq_cst) have the semantics of a full memory barrier? is related.

fence(seq_cst) includes StoreLoad (if that concept even applies to a given C++ implementation). I think reasoning in terms of local barriers and then transforming that to C++ mostly works, but remember that it doesn't model the possibility of IRIW reordering which C++ allows, and which happens in real life on some POWER hardware.

Also keep in mind that var.load(acquire) can be much more efficient than var.load(relaxed); fence(acquire); on some ISAs, notably ARMv8.

e.g. this example on Godbolt, compiled for ARMv8 by GCC8.2 -O2 -mcpu=cortex-a53

#include <atomic>
int bad_acquire_load(std::atomic<int> &var){
    int ret = var.load(std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_acquire);
    return ret;
}

bad_acquire_load(std::atomic<int>&):
        ldr     r0, [r0]          // plain load
        dmb     ish               // FULL BARRIER
        bx      lr

int normal_acquire_load(std::atomic<int> &var){
    int ret = var.load(std::memory_order_acquire);
    return ret;
}

normal_acquire_load(std::atomic<int>&):
        lda     r0, [r0]            // acquire load
        bx      lr

Recommended topics

Hot tags