If most checks of done
find it not-done, and happens in a throughput-sensitive part of your program, yes this could make sense, even on ISAs where a separate barrier costs more. Perhaps a use-case like an exit-now flag that also signals some data or a pointer a thread will want. You check often, but the great majority of the time you don't exit and don't need later operations to wait for this load to complete.
This is a win on some ISAs (where a load(acquire) is already a load+barrier), but on others it's usually worse, especially if the case we care about most (the "fast path") is the one that loads value
. (On ISAs where a fence(acquire) is more expensive than a load(acquire), especially 32-bit ARM with ARMv8 new instructions: lda
is just an acquire load, but a fence is still a dmb ish
full barrier.)
If the !done
case is common and there's other work to do, then it's maybe worth considering the tradeoff, since std::memory_order_consume
is not currently usable for its intended purpose. (See below re: memory dependency ordering solving this specific case without any barrier.)
For other common ISAs, no, it wouldn't make sense because it would make the "success" case slower, maybe much slower if it ended up with a full barrier. If that's the normal fast-path through the function, that would obviously be terrible.
On x86 there's no difference: fence(acquire) is a no-op, and load(acquire) uses the same asm as load(relaxed). That's why we say x86's hardware memory model is "strongly ordered". Most other mainstream ISAs aren't like this.
For some ISAs this is pure win in this case. For ISAs that implement done.load(acquire)
with a plain load and then the same barrier instruction fence(acquire)
would use (like RISC-V, or 32-bit ARM without ARMv8 instructions). They have to branch anyway, so it's just about where we place the barrier relative to the branch. (Unless they choose to unconditionally load value
and branchlessly select, like MIPS movn
, which is allowed because they already load another member of that class Worker
object so it's known to be a valid pointer to a full object.)
AArch64 can do acquire loads quite cheaply, but an acquire barrier would be more expensive. (And would happen on what would normally be the fast path; speeding up the "failure" path is normally not important.).
Instead of a barrier, a 2nd load, this time with acquire, could possibly be better. If the flag can only change from 0 to 1, you don't even need to re-check its value; accesses to the same atomic object are ordered within the same thread.
(I had a Godbolt link with some examples for many ISAs, but a browser restart ate it.)
Memory dependency order could solve this problem with no barriers
Unfortunately std::memory_order_consume
is temporarily deprecated, otherwise you could have the best of both worlds for this case, by creating an &value
pointer with a data-dependency on done.load(consume)
. So the load of value
(if done at all) would be dependency-ordered after the load from done
, but other independent later loads wouldn't have to wait.
e.g. if ( (tmp = done.load(consume)) )
and return (&value)[tmp-1]
. This is easy in asm, but without fully working consume
support, compilers would optimize out the use of tmp
in the side of the branch that can only be reached with tmp = true
.
So the only ISA that actually needs to make this barrier tradeoff in asm is Alpha, but due to C++ limitations we can't easily take advantage of the hardware support that other ISAs offer.
If you're willing to use something that will work in practice despite not having guarantees, use std::atomic<int *> done = nullptr;
and do a release-store of &value
instead of =true
. Then in the reader, do a relaxed
load, and if(tmp) { return *tmp; } else { return -1; }
. If the compiler can't prove that the only non-null pointer value is &value
, it will need to keep the data dependency on the pointer load. (To stop it from proving that, perhaps include a set
member function that stores an arbitrary pointer in done
, which you never call.)
See C++11: the difference between memory_order_relaxed and memory_order_consume for details, and a link to Paul E. McKenney's CppCon 2016 talk where he explains what consume
was supposed to be for, and how Linux RCU does use the kind of thing I suggested, with effectively relaxed loads and depending on the compiler to make asm with data dependencies. (Which requires being careful not to write things where it can optimize away the data dependency.)
std::atomic_thread_fence(std::memory_order_acquire)
is a stricter memory fence thanstd::atomic::load(std::memory_order_acquire)
so there may be some speculation as to which approach is more optimized. It may depend on external factors, such as target CPU. – Schweinfurt!done
, and there's other useful work for this thread to be doing in that case, not about to sleep and wait or something. But otherwise worse on some ISAs, especially 32-bit ARM with ARMv8 instructions where fence(acquire) is a full memory barrier including draining the store buffer, but load(acquire) is justldarb
. – Conformitymemory_order_consume
worked, you could get the best of both worlds, with no barriers even when loading value, except on DEC Alpha.) – Conformitywhile (!other_var) { }
. Both cores will put the write in their write back buffer and due to no other memory access happening neither will write back the value, so neither will see a change and you have a deadlock. In complex code you are often lucky that the amount of other memory traffic will flush out things but in small loops you deadlock without barriers. – Cutlerrstd::memory_order_relaxed
stores not being visible to other threads, it's broken and violates some fairly strong "should" notes. (eel.is/c++draft/intro.progress#18) I highly doubt that's the case for GCC or clang, but they don't use extra barriers on relaxed loads/stores. – Conformitydmb ish
before the store, but nothing after it. (Or with-mcpu=cortex-a53
or any other ARMv8, it usesstl
, a release-store). If your claim were correct, that would mean release-stores could be invisible indefinitely, too. That would obviously be unacceptable for most real use-cases, so I'm sure it's not correct. And of course relaxed load/store are just plain ldr/str with no barriers, because they don't need any ordering wrt. other stores. – Conformityrelaxed
, but you still get visibility for the atomic object itself. There is no correctness problem with the idea proposed in this question on any mainstream C++ implementation, only hypothetical ones that barely satisfy the multithread progress reqs. – Conformitystd::atomic
might always do that. I haven't checked with regards to relaxed loads. But you better check your implementation does or you can end up with deadlocks. – Cutlerrrelaxed
load and store don't use any extra barriers with GCC or clang. I'm saying that's because they're not needed for prompt visibility, you're saying it allows huge delay – Conformity