Why set the stop flag using `memory_order_seq_cst`, if you check it with `memory_order_relaxed`?
Asked Answered
A

2

13

Herb Sutter, in his "atomic<> weapons" talk, shows several example uses of atomics, and one of them boils down to following: (video link, timestamped)

  • A main thread launches several worker threads.

  • Workers check the stop flag:

    while (!stop.load(std::memory_order_relaxed))
    {
        // Do stuff.
    }
    
  • The main thread eventually does stop = true; (note, using order=seq_cst), then joins the workers.

Sutter explains that checking the flag with order=relaxed is ok, because who cares if the thread stops with a slightly bigger delay.

But why does stop = true; in the main thread use seq_cst? The slide says that it's purposefully not relaxed, but doesn't explain why.

It looks like it would work, possibly with a larger stopping delay.

Is it a compromise between performance and how fast other threads see the flag? I.e. since the main thread only sets the flag once, we might as well use the strongest ordering, to get the message across as quickly as possible?

Atrice answered 4/1, 2022 at 16:2 Comment(4)
Related to the part if stronger order for stop = true will make the thread stop slower or faster -- Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees?Spokesman
To me it looks like a mistake, and stop only needs relaxed. I'm pretty sure I've done that on weakly ordered platforms and it works fine. I'd like an expert to confirm this too.Nanete
How about "...and an atomic load in thread B from the same variable is tagged memory_order_acquire, all memory writes (non-atomic and relaxed atomic) that happened-before the atomic store from the point of view of thread A, become visible side-effects in thread B. ..." en.cppreference.com/w/cpp/atomic/memory_order ?Grassofparnassus
@RichardCritten Can you elaborate? That doesn't happen in this case, since nobody performs an acquire (and stop = true; performs a seq-cst release, or no release at all if the relaxed order is used), but it's not a problem, since workers will see the updated value sooner or later anyway.Atrice
M
6

mo_relaxed is fine for both load and store of a stop flag

There's also no meaningful latency benefit to stronger memory orders, even if latency of seeing a change to a keep_running or exit_now flag was important.

IDK why Herb thinks stop.store shouldn't be relaxed; in his talk, his slides have a comment that says // not relaxed on the assignment, but he doesn't say anything about the store side before moving on to "is it worth it".

Of course, the load runs inside the worker loop, but the store runs only once, and Herb really likes to recommend sticking with SC unless you have a performance reason that truly justifies using something else. I hope that wasn't his only reason; I find that unhelpful when trying to understand what memory order would actually be necessary and why. But anyway, I think either that or a mistake on his part.


The ISO C++ standard doesn't say anything about how soon stores become visible or what might influence that. These apply to all atomic operations, including relaxed. They're not just notes, but only should not must.

ISO C++ section 6.9.2.3 Forward progress

18. An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.

And 33.5.4 Order and consistency [atomics.order] covering only atomics, not mutexes etc.:

11. Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.

Inter-thread latency is primarily a quality-of-implementation thing, with the standard leaving things wide open. Normal C++ implementations that work by compiling to asm for some architecture effectively just expose the hardware's cache-coherence properties, so typically tens of nanoseconds best case, sub-microsecond near-worst case if both threads are currently running on different cores. (Otherwise scheduler timeslice...)


Another thread can loop arbitrarily many times before its load actually sees this store value, even if they're both seq_cst, assuming there's no other synchronization of any kind between them. Low inter-thread latency is a performance issue, not correctness / formal guarantee.

And non-infinite inter-thread latency is apparently only a "should" QOI (quality of implementation) issue. :P Nothing in the standard suggests that seq_cst would help on a hypothetical implementation where store visibility could be delayed indefinitely, although one might guess that could be the case, e.g. on a hypothetical implementation with explicit cache flushes instead of cache coherency. (Although such an implementation is probably not practically usable in terms of performance with CPUs anything like what we have now; every release and/or acquire operation would have to flush the whole cache.)

On real hardware (which uses some form of MESI cache coherency), different memory orders for store or load don't make stores visible sooner in real time, they just control whether later operations can become globally visible while still waiting for the store to commit from the store buffer to L1d cache. (After invalidating any other copies of the line.)

Stronger orders, and barriers, don't make things happen sooner in an absolute sense, they just delay other things until they're allowed to happen relative to the store or load. (This is the case on all real-world CPUs AFAIK; they always try to make stores visible to other cores ASAP anyway, so the store buffer doesn't fill up.)

See also (my similar answers on):

The second Q&A is about x86 where commit from the store buffer to L1d cache is in program order. That limits how far past a cache-miss store execution can get, and also any possible benefit of putting a release or seq_cst fence after the store to prevent later stores (and loads) from maybe competing for resources. (x86 microarchitectures will do RFO (read for ownership) before stores reach the head of the store buffer, and plain loads normally compete for resources to track RFOs we're waiting for a response to.) But these effects are extremely minor in terms of something like exiting another thread; only very small scale reordering.


because who cares if the thread stops with a slightly bigger delay.

More like, who cares if the thread gets more work done by not making loads/stores after the load wait for the check to complete. (Of course, this work will get discarded if it's in the shadow of a a mis-speculated branch on the load result when we eventually load true.) The cost of rolling back to a consistent state after a branch mispredict is more or less independent of how much already-executed work had happened beyond the mispredicted branch. And it's a stop flag which presumably doesn't get set very often, so the total amount of wasted work costing cache/memory bandwidth for other CPUs is pretty minimal.

That phrasing makes it sound like an acquire load or release store would actually get the the store seen sooner in absolute real time, rather than just relative to other code in this thread. (Which is not the case).

The benefit is more instruction-level and memory-level parallelism across loop iterations when the load produces a false. And simply avoiding running extra instructions on ISAs where an acquire or especially an SC load needs extra instructions, especially expensive 2-way barrier instructions (like PowerPC isync/sync or especially ARMv7 dmb ish full barrier even for acquire), not like ARMv8 ldapr or x86 mov acquire-load instructions. (Godbolt)


BTW, Herb is right that the dirty flag can also be relaxed, but only because of the thread.join sync between the reader and any possible writer. Otherwise yeah, release / acquire.

But in this case, dirty only needs to be atomic<> at all because of possible simultaneous writers all storing the same value, which ISO C++ still deems data-race UB. e.g. because of the theoretical possibility of hardware race-detection that traps on conflicting non-atomic accesses. (Or a software implementations like clang -fsanitize=thread)


Fun fact: C++20 introduced std::stop_token for use as a stop or keep_running flag.

Maremma answered 5/1, 2022 at 13:22 Comment(1)
C
0

First of all, stop.store(true, mo_relaxed) would be enough in this context.

launch_workers()
stop = true;  // not relaxed
join_workers()';

why does stop = true; in the main thread use seq_cst?

Herb does not mention the reason why he uses mo_seq_cst, but let's look at a few possibilities.

  • Based on the "not relaxed" comment, he is worried that stop.store(true, mo_relaxed) can be re-ordered with launch_workers() or join_workers().
    Since launch_workers() is a release operation and join_workers() is an acquire operation, the ordering constraints for both will not prevent the store to move in either direction.
    However, it is important to notice that for this scenario, it does not really matter whether the store to stop uses mo_relaxed or mo_seq_cst. Even with the strongest ordering, mo_seq_cst (which by the absence of other SC operations is no stronger than mo_release), the ordering rules still allow the re-ordering with join_workers().
    Of course this reordering isn't going to happen, but my point is that stronger ordering contraints on the store isn't going to make a difference.

  • He could make the argument that a sequentially consistent (SC) store is an advantage since the thread performing the relaxed load will pick up on the new value sooner (an SC store flushes the store buffer).
    But this seems hardly relevant because the store is in between creating and joining threads, which is not in a tight loop, or as Herb puts it: "..is it in a performance-critical region of code where this overhead matters?.."
    He also says about the load: "..you don't care when it arrives.."

We don't know the real reason, but it is possibly based on the programming convention that you don't use explicit ordering parameters (which means mo_seq_cst), unless it makes a difference, and in this case, as Herb explains, only the relaxed load makes a difference.

For example, on the weakly ordered PowerPC platform, a load(mo_seq_cst) uses both the (expensive) sync and (less expensive) isync instructions, a load(mo_acquire) still uses isync and a load(mo_relaxed) uses none of them. In a tight loop, that is a good optimization.
Also worth mentioning is that on the mainstream X86platform, there is no real difference in performance between load(mo_seq_cst) and load(mo_relaxed)

Personally I favor this programming style where ordering parameters are omitted when they don't matter and used when they make a difference.

stop.store(true); // ordering irrelevant, but uses SC
stop.store(true, memory_order_seq_cst); // store requires SC ordering (which is rare)

It's only a matter of style.. for both stores, the compiler will generate the same assembly.

Clan answered 5/1, 2022 at 15:38 Comment(4)
That's an interesting observation - formally, it seems the code is wrong even with seq_cst, as the compiler could move stop = true after the join, which would be catastrophic. And under the literal language of the standard, I don't see how it can be fixed. If join() were really an acquire operation, putting a seq_cst fence between stop=true and the join would fix it - but join is not an "atomic operation" so it does not participate in the seq_cst ordering.Bealle
@NateEldredge I did not really want to make the statement that it is wrong, only that existing the ordering rules might allow it. But there are other rules that will prevent it from happening because it is (nearly) impossible that the store can be executed after the join, I'm just not sure which rules. Jeff Preshing has written about a similar caseClan
Yeah, I was thinking some more and came to the same conclusion. Since any store, even a relaxed one, should become visible in finite time, the compiler cannot reorder it past the join which may potentially take forever. There is the "should" which I guess makes it QoI, but even if they had said "must", a DeathStation 9000 could say "fine, stores become visible after 5000 years" and we'd be no better off.Bealle
@NateEldredge: Yup, exactly. Compile-time reordering that creates a deadlock would violate the finite-time requirement (suggestion?) in the standard. How C++ Standard prevents deadlock in spinlock mutex with memory_order_acquire and memory_order_release? - runtime reordering is allowed, but compile-time reordering that nails that down into the only possible ordering would not be following the as-if rule. Of course some practical considerations make this reordering completely impossible in practice, like .join involving a non-inline call.Maremma

© 2022 - 2024 — McMap. All rights reserved.