Is it possible that a store with memory_order_relaxed never reaches other threads?
Asked Answered
S

3

13

Suppose I have a thread A that writes to an atomic_int x = 0;, using x.store(1, std::memory_order_relaxed);. Without any other synchronization methods, how long would it take before other threads can see this, using x.load(std::memory_order_relaxed);? Is it possible that the value written to x stays entirely thread-local given the current definition of the C/C++ memory model that the standard gives?

The practical case that I have at hand is where a thread B reads an atomic_bool frequently to check if it has to quit; Another thread, at some point, writes true to this bool and then calls join() on thread B. Clearly I do not mind to call join() before thread B can even see that the atomic_bool was set, nor do I mind when thread B already saw the change and exited execution before I call join(). But I am wondering: using memory_order_relaxed on both sides, is it possible to call join() and block "forever" because the change is never propagated to thread B?

Edit

I contacted Mark Batty (the brain behind mathematically verifying and subsequently fixing the C++ memory model requirements). Originally about something else (which turned out to be a known bug in cppmem and his thesis; so fortunately I didn't make a complete fool of myself, and took the opportunity to ask him about this too; his answer was:

Q: Can it theoretically be that such a store [memory_order_relaxed without (any following) release operation] never reaches the other thread?
Mark: Theoretically, yes, but I don't think that has been observed.
Q: In other words, do relaxed stores make no sense whatsoever unless you combine them with some release operation (and acquire on the other thread), assuming you want another thread to see it?
Mark: Nearly all of the use cases for them do use release and acquire, yes.

Schwenk answered 3/5, 2017 at 2:15 Comment(5)
The Edit is more like an answer; but since it isn't my answer I decided to add it as edit rather than as answer. I hope some might find the opinion of this expert useful.Schwenk
Is the question specifically about C++11?Stridulate
It is about the C++ memory model that was introduced in C++11. In practise any write to memory is going to be visible to all other threads within a few micro seconds and probably much faster, even if you don't include assembly instructions that flush the cache to memory. Most notably, on Intel there isn't a difference at all between a relaxed store and a release store (with regard to assembly and hardware - compiler reordering not included in this remark).Schwenk
Which implementations generate instructions that "flush the cache to memory"? In which cases?Stridulate
Nothing as far as I know. It wouldn't make sense. All you can do is add a memory fence (or any other 'memory_order_release' operation) which would at least assure that everything gets flushed to memory before subsequent writes to memory will be.Schwenk
A
-2

This is what the standard says in 29.3.12:

Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.

There is no guarantee a store will become visible in another thread, there is no guaranteed timing and there is no formal relationship with memory order.

Of course, on each regular architecture a store will become visible, but on rare platforms that do not support cache coherency, it may never become visible to a load.
In that case, you would have to reach for an atomic read-modify-write operation to get the latest value in the modification order.

Atp answered 3/5, 2017 at 5:9 Comment(7)
Are you sure this is said about std::memory_order_relaxed (too)? I can image that this remark is necessary even for just release/acquire store/reads because in that case we know about ordering, but still nothing was said about timing; ie, put two single core PC's next to each other and they will obey the standard if it wasn't for this remark ;).Schwenk
@CarloWood Absolutely.. It's a common misconception that memory ordering is related to visibility of the atomic variable itself; It is not.. (what would be the use of relaxed atomics if they never became visible to other cores?). acquire/release semantics specify ordering (and thus visibility) of other memory operations with respect to an atomic operation. If an atomic variable does not become visible in another thread, neither do the memory operations it orders.Atp
"but on rare platforms" Could you give examples?Stridulate
@Stridulate I cannot give you an example, but cache-coherency is an optional feature. You might find a non-cache-coherent architecture in the embedded world.Atp
Doesn't this rule guarantees a store will become visible in another thread, since a reasonable amount of time certainly excludes infinite time?Flambeau
If std::thread starts your threads across non-cache-coherent cores, and std::atomic<T> doesn't manually flush that line for relaxed, or everything for release, then your C++ implementation is almost certainly broken. Remember that for each atomic variable separately, a single modification order (that all threads can agree on) must exist, even with mo_relaxed stores. (This doesn't include observers that see their own store-forwarding early, though). I don't think letting an atomic stay thread-local could be considered valid, at least not by the spirit of the standard.Latinist
And yes there are embedded CPUs with both a microcontroller and DSP on chip that aren't cache-coherent with each other, but std::thread won't start threads across cores on both. See When to use volatile with multi threading? - e.g. "This architecture (ARMv7) is written with an expectation that all processors using the same operating system or hypervisor are in the same Inner Shareable shareability domain". If you take pointers to shared non-coherent memory and cast them to std::atomic<int>* without also using manual flushing, UB is your own fault.Latinist
A
9

This is all the standard has to say on the matter, I believe:

[intro.multithread]/25 An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.

Anomaly answered 3/5, 2017 at 4:51 Comment(2)
And in practice hardware that std::thread starts threads on has coherent caches, not requiring software flushing, so visibility time = time for the store buffer to commit your store. When that happens, other cores will see a MESI invalidate/RFO from the storing thread, then have to do a share request themselves to get a copy of the new value. See When to use volatile with multi threading? for more details about the fact that ISO C++ is written to run on cache-coherent hardware, and running without that is barely plausible.Latinist
My answer on Why set the stop flag using `memory_order_seq_cst`, if you check it with `memory_order_relaxed`? also quotes 33.5.4 Order and consistency [atomics.order] - 11. Implementations should make atomic stores visible to atomic loads within a reasonable amount of time. So that's two should requirements, one with "finite period" and one with "reasonable amount of time". The standard leaves it as basically a quality-of-implementation factor; real hardware is what gives us low latency.Latinist
S
0

In practice

Without any other synchronization methods, how long would it take before other threads can see this, using x.load(std::memory_order_relaxed);?

No time. It's a normal write, it goes to the store buffer, so it will be available in the L1d cache in less time than a blink. But that's only when the assembly instruction is run.

Instructions can be reordered by the compiler, but no reasonable compiler would reorder atomic operation over arbitrarily long loops.

In theory

Q: Can it theoretically be that such a store [memory_order_relaxed without (any following) release operation] never reaches the other thread?

Mark: Theoretically, yes,

You should have asked him what would happen if the "following release fence" was added back. Or with atomic store release operation.

Why wouldn't these be reordered and delayed a loooong time? (so long that it seems like an eternity in practice)

Is it possible that the value written to x stays entirely thread-local given the current definition of the C/C++ memory model that the standard gives?

If an imaginary and especially perverse implementation wanted to delay the visibility of atomic operation, why would it do that only for relaxed operations? It could well do it for all atomic operations.

Or never run some threads.

Or run some threads so slowly that you would believe they aren't running.

Stridulate answered 13/12, 2019 at 5:29 Comment(0)
A
-2

This is what the standard says in 29.3.12:

Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.

There is no guarantee a store will become visible in another thread, there is no guaranteed timing and there is no formal relationship with memory order.

Of course, on each regular architecture a store will become visible, but on rare platforms that do not support cache coherency, it may never become visible to a load.
In that case, you would have to reach for an atomic read-modify-write operation to get the latest value in the modification order.

Atp answered 3/5, 2017 at 5:9 Comment(7)
Are you sure this is said about std::memory_order_relaxed (too)? I can image that this remark is necessary even for just release/acquire store/reads because in that case we know about ordering, but still nothing was said about timing; ie, put two single core PC's next to each other and they will obey the standard if it wasn't for this remark ;).Schwenk
@CarloWood Absolutely.. It's a common misconception that memory ordering is related to visibility of the atomic variable itself; It is not.. (what would be the use of relaxed atomics if they never became visible to other cores?). acquire/release semantics specify ordering (and thus visibility) of other memory operations with respect to an atomic operation. If an atomic variable does not become visible in another thread, neither do the memory operations it orders.Atp
"but on rare platforms" Could you give examples?Stridulate
@Stridulate I cannot give you an example, but cache-coherency is an optional feature. You might find a non-cache-coherent architecture in the embedded world.Atp
Doesn't this rule guarantees a store will become visible in another thread, since a reasonable amount of time certainly excludes infinite time?Flambeau
If std::thread starts your threads across non-cache-coherent cores, and std::atomic<T> doesn't manually flush that line for relaxed, or everything for release, then your C++ implementation is almost certainly broken. Remember that for each atomic variable separately, a single modification order (that all threads can agree on) must exist, even with mo_relaxed stores. (This doesn't include observers that see their own store-forwarding early, though). I don't think letting an atomic stay thread-local could be considered valid, at least not by the spirit of the standard.Latinist
And yes there are embedded CPUs with both a microcontroller and DSP on chip that aren't cache-coherent with each other, but std::thread won't start threads across cores on both. See When to use volatile with multi threading? - e.g. "This architecture (ARMv7) is written with an expectation that all processors using the same operating system or hypervisor are in the same Inner Shareable shareability domain". If you take pointers to shared non-coherent memory and cast them to std::atomic<int>* without also using manual flushing, UB is your own fault.Latinist

© 2022 - 2024 — McMap. All rights reserved.