Does it make sense to use a relaxed load followed by a conditional fence, if I don't always need acquire semantics?
Asked Answered
G

1

9

Consider following toy example, especially the result function:

#include <atomic>
#include <chrono>
#include <iostream>
#include <thread>

class Worker
{
    std::thread th;
    std::atomic_bool done = false;

    int value = 0;

  public:
    Worker()
        : th([&]
    {
        std::this_thread::sleep_for(std::chrono::seconds(1));
        value = 42;
        done.store(true, std::memory_order_release);
    }) {}

    int result() const
    {
        return done.load(std::memory_order_acquire) ? value : -1;
    }

    Worker(const Worker &) = delete;
    Worker &operator=(const Worker &) = delete;

    ~Worker()
    {
        th.join();
    }
};

int main()
{
    Worker w;
    while (true)
    {
        int r = w.result();
        if (r != -1)
        {
            std::cout << r << '\n';
            break;
        }
    }
}

I reckon that I need acquire sematics only if done.load() returns true, so I could rewrite it like this:

int result() const
{
    if (done.load(std::memory_order_relaxed))
    {
        std::atomic_thread_fence(std::memory_order_acquire);
        return value;
    }
    else
    {
        return -1;
    }
}

It seems to be a legal thing to do, but I lack experience to tell if this change makes sense or not (whether it's more optimized or not).

Which of the two forms should I prefer?

Gley answered 13/6, 2022 at 17:32 Comment(22)
That would potentially never finish because you never force the cpu to refresh its cached value of done. So it never sees that some other thread has written to it.Cutlerr
std::atomic_thread_fence(std::memory_order_acquire) is a stricter memory fence than std::atomic::load(std::memory_order_acquire) so there may be some speculation as to which approach is more optimized. It may depend on external factors, such as target CPU.Schweinfurt
@GoswinvonBrederlow This thread seems to claim otherwise. Did that happen to you in practice?Gley
@DrewDormann Yep, that's why I'm asking.Gley
@DrewDormann Looked up the libstdc++ implementation. They have some wonky code there, but they seem to use acq-rel operations directly, no fences. Same for libc++.Gley
Could be useful if the common case is !done, and there's other useful work for this thread to be doing in that case, not about to sleep and wait or something. But otherwise worse on some ISAs, especially 32-bit ARM with ARMv8 instructions where fence(acquire) is a full memory barrier including draining the store buffer, but load(acquire) is just ldarb.Conformity
(If memory_order_consume worked, you could get the best of both worlds, with no barriers even when loading value, except on DEC Alpha.)Conformity
@Gley It's one of the fundamental examples for the need for memory barriers in the ARM documentation. Two threads write a value to separate variables and then wait for the other threads variable to change using while (!other_var) { }. Both cores will put the write in their write back buffer and due to no other memory access happening neither will write back the value, so neither will see a change and you have a deadlock. In complex code you are often lucky that the amount of other memory traffic will flush out things but in small loops you deadlock without barriers.Cutlerr
@GoswinvonBrederlow: Compilers assume that C++ threads will run in the same inner-shareable domain, so they share a coherent view of cache. The store buffer is not a like a write-back cache, it drains itself to L1d ASAP, making the store globally visible. If your C++ implementation can go indefinitely with std::memory_order_relaxed stores not being visible to other threads, it's broken and violates some fairly strong "should" notes. (eel.is/c++draft/intro.progress#18) I highly doubt that's the case for GCC or clang, but they don't use extra barriers on relaxed loads/stores.Conformity
@GoswinvonBrederlow: What exact ARM docs are you talking about? Are you sure it's talking about inner-shareable coherency domains, and other things that C++ implementations do on CPUs they run threads across? There are hybrid ARM boards with a DSP + microcontroller that aren't cache-coherent with each other, but real systems don't run a single OS (or threads of the same freestanding program) across those cores.Conformity
@PeterCordes I'm talking about ARM boards like the Raspberry Pi. Just look at what the compiler will generate for barriers for the different memory orders and you will see much more happens than on x86 for example.Cutlerr
@GoswinvonBrederlow: I have looked, maybe you need to look again: godbolt.org/z/3ocv46E5z shows a memory_order_release generating a dmb ish before the store, but nothing after it. (Or with -mcpu=cortex-a53 or any other ARMv8, it uses stl, a release-store). If your claim were correct, that would mean release-stores could be invisible indefinitely, too. That would obviously be unacceptable for most real use-cases, so I'm sure it's not correct. And of course relaxed load/store are just plain ldr/str with no barriers, because they don't need any ordering wrt. other stores.Conformity
@GoswinvonBrederlow Do you have a link to the part of docs that warns against this?Gley
Sorry, I have an old link from years ago but ARM restructured their docs and broke all links.Cutlerr
@GoswinvonBrederlow: So you're still claiming that GCC and clang ignore ISO C++ eel.is/c++draft/intro.progress#18 (An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.) and eel.is/c++draft/atomics.order#11 (Implementations should make atomic stores visible to atomic loads within a reasonable amount of time). I'm saying microseconds at most. You're saying you could create a real-world demo on an RPi. Feel free to prove it.Conformity
@PeterCordes No, I'm saying: "Relaxed operation: there are no synchronization or ordering constraints imposed on other reads or writes, only this operation's atomicity is guaranteed (see Relaxed ordering below)." I'm not sure how your quotes apply to a situation where you explicitly tell the compiler to not impose any constraints. If they do apply then the compiler will have to insert a barrier on the store and load even for relaxed.Cutlerr
@GoswinvonBrederlow: The stores to that object are still visible to other threads promptly, they just aren't ordered wrt. operations on other objects. Ordering and inter-thread latency are totally separate things. When C++ says "synchronization", they mean syncs-with relationships that create happens-before ordering. You don't get that with relaxed, but you still get visibility for the atomic object itself. There is no correctness problem with the idea proposed in this question on any mainstream C++ implementation, only hypothetical ones that barely satisfy the multithread progress reqs.Conformity
@PeterCordes It's still just a "should" and not a "must" so I would be careful. I know the ARM hardware needs special care for changes like in the above code to become visible to other cores. std::atomic might always do that. I haven't checked with regards to relaxed loads. But you better check your implementation does or you can end up with deadlocks.Cutlerr
@GoswinvonBrederlow: Again, please prove it. You're making a very surprising claim, that relaxed stores can be invisible to relaxed loads indefinitely on an ARM system. I'm pretty confident I would have heard about it before now if that were true. It's totally contrary to what you'd expect for a system with coherent cache and a normal store buffer (which commits stores ASAP). As I've shown you in the Godbolt link, relaxed load and store don't use any extra barriers with GCC or clang. I'm saying that's because they're not needed for prompt visibility, you're saying it allows huge delayConformity
@GoswinvonBrederlow: What you're describing sounds totally plausible for threads not in the same inner-shareable coherence domain (or whatever the exact criterion is for ARM CPUs to not have coherent cache). Or you're remembering something about invalidating code caches, and requiring manual flush of data caches back to a point of unification; that's true for self-modifying code / JIT, but not for data. What you're describing makes no sense for normal data load/store. That would just be bad design, cumbersome to use. Perhaps some ancient ARM version way before ARMv6 or 7?Conformity
Definetely post ARMv6. And it affects the inner shareable because it's a quirk of the write back buffer and write combining iirc. The architecture doesn't enforce any time limit in the write back but normally the atomic will trigger it explicitly. Thinking about it the store uses "release" order so that should still force the write back. So never mind.Cutlerr
Let us continue this discussion in chat.Conformity
C
1

If most checks of done find it not-done, and happens in a throughput-sensitive part of your program, yes this could make sense, even on ISAs where a separate barrier costs more. Perhaps a use-case like an exit-now flag that also signals some data or a pointer a thread will want. You check often, but the great majority of the time you don't exit and don't need later operations to wait for this load to complete.


This is a win on some ISAs (where a load(acquire) is already a load+barrier), but on others it's usually worse, especially if the case we care about most (the "fast path") is the one that loads value. (On ISAs where a fence(acquire) is more expensive than a load(acquire), especially 32-bit ARM with ARMv8 new instructions: lda is just an acquire load, but a fence is still a dmb ish full barrier.)

If the !done case is common and there's other work to do, then it's maybe worth considering the tradeoff, since std::memory_order_consume is not currently usable for its intended purpose. (See below re: memory dependency ordering solving this specific case without any barrier.)

For other common ISAs, no, it wouldn't make sense because it would make the "success" case slower, maybe much slower if it ended up with a full barrier. If that's the normal fast-path through the function, that would obviously be terrible.


On x86 there's no difference: fence(acquire) is a no-op, and load(acquire) uses the same asm as load(relaxed). That's why we say x86's hardware memory model is "strongly ordered". Most other mainstream ISAs aren't like this.

For some ISAs this is pure win in this case. For ISAs that implement done.load(acquire) with a plain load and then the same barrier instruction fence(acquire) would use (like RISC-V, or 32-bit ARM without ARMv8 instructions). They have to branch anyway, so it's just about where we place the barrier relative to the branch. (Unless they choose to unconditionally load value and branchlessly select, like MIPS movn, which is allowed because they already load another member of that class Worker object so it's known to be a valid pointer to a full object.)


AArch64 can do acquire loads quite cheaply, but an acquire barrier would be more expensive. (And would happen on what would normally be the fast path; speeding up the "failure" path is normally not important.).

Instead of a barrier, a 2nd load, this time with acquire, could possibly be better. If the flag can only change from 0 to 1, you don't even need to re-check its value; accesses to the same atomic object are ordered within the same thread.

(I had a Godbolt link with some examples for many ISAs, but a browser restart ate it.)


Memory dependency order could solve this problem with no barriers

Unfortunately std::memory_order_consume is temporarily deprecated, otherwise you could have the best of both worlds for this case, by creating an &value pointer with a data-dependency on done.load(consume). So the load of value (if done at all) would be dependency-ordered after the load from done, but other independent later loads wouldn't have to wait.

e.g. if ( (tmp = done.load(consume)) ) and return (&value)[tmp-1]. This is easy in asm, but without fully working consume support, compilers would optimize out the use of tmp in the side of the branch that can only be reached with tmp = true.

So the only ISA that actually needs to make this barrier tradeoff in asm is Alpha, but due to C++ limitations we can't easily take advantage of the hardware support that other ISAs offer.

If you're willing to use something that will work in practice despite not having guarantees, use std::atomic<int *> done = nullptr; and do a release-store of &value instead of =true. Then in the reader, do a relaxed load, and if(tmp) { return *tmp; } else { return -1; }. If the compiler can't prove that the only non-null pointer value is &value, it will need to keep the data dependency on the pointer load. (To stop it from proving that, perhaps include a set member function that stores an arbitrary pointer in done, which you never call.)

See C++11: the difference between memory_order_relaxed and memory_order_consume for details, and a link to Paul E. McKenney's CppCon 2016 talk where he explains what consume was supposed to be for, and how Linux RCU does use the kind of thing I suggested, with effectively relaxed loads and depending on the compiler to make asm with data dependencies. (Which requires being careful not to write things where it can optimize away the data dependency.)

Conformity answered 18/7, 2022 at 21:8 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.