MESI Protocol & std::atomic - Does it ensure all writes are immediately visible to other threads?

Asked 19/2, 2020 at 2:22 Answered 19/2, 2020 at 18:52

Solved c++cpu-architecture memory-model stdatomic mesi

In regards to std::atomic, the C++11 standard states that stores to an atomic variable will become visible to loads of that variable in a "reasonable amount of time".

From 29.3p13:

Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.

However I was curious to know what actually happens when dealing with specific CPU architectures which are based on the MESI cache coherency protocol (x86, x86-64, ARM, etc..).

If my understanding of the MESI protocol is correct, a core will always read the value previously written/being written by another core immediately, possibly by snooping it. (because writing a value means issuing a RFO request which in turn invalidates other cache lines)

Does it mean that when a thread A stores a value into an std::atomic, another thread B which does a load on that atomic successively will in fact always observe the new value written by A on MESI architectures? (Assuming no other threads are doing operations on that atomic)

By “successively” I mean after thread A has issued the atomic store. (Modification order has been updated)

Stormy answered 19/2, 2020 at 2:22 Comment(1)

"specific cpu architectures (...)" would actually be absolutely all and every implementation where C/C++ MT is expected to run; there is no reasonable efficient and predictable implementation strategy for Java/C/C++ threads on multi CPU targets w/o consistent caches. – Chagres 19/2, 2020 at 2:29

I'll answer for what happens on real implementations on real CPUs, because an answer based only on the standard can barely say anything useful about time or "immediacy".

MESI is just an implementation detail that ISO C++ doesn't have anything to say about. The guarantees provided by ISO C++ only involve order, not actual time. ISO C++ is intentionally non-specific to avoid assuming that it will execute on a "normal" CPU. An implementation on a non-coherent machine that required explicit flushes for store visibility might be theoretically possible (although probably horrible for performance of release / acquire and seq-cst operations)

C++ is non-specific enough about timing to even allow an implementation on a single-core cooperative multi-tasking system (no pre-emption), with the compiler inserting voluntary yields occasionally. (Infinite loops without any volatile accesses or I/O are UB). C++ on a system where only one thread can actually be executing at once is totally fine and possible, assuming you consider a scheduler timeslice to still be a "reasonable" amount of time. (Or less if you yield or otherwise block.)

Even the model of formalism ISO C++ uses to give the guarantees it does about ordering is very different from the way hardware ISAs define their memory models. C++ formal guarantees are purely in terms of happens-before and synchronizes-with, not "re"-ordering litmus tests or any kind of stuff like that. e.g. How to achieve a StoreLoad barrier in C++11? is impossible to answer for pure ISO C++ formalism. The "option C" in that Q&A serves to show just how weak the C++ guarantees are; that case with store then load of two different SC variables is not sufficient to imply happens-before based on it, according to the C++ formalism, even though there has to be a total order of all SC operations. But it is sufficient in real life on systems with coherent cache and only local (within each CPU core) memory reordering, even AArch64 where the SC load right after the SC store does still essentially give us a StoreLoad barrier.

when a thread A stores a value into an std::atomic

It depends what you mean by "doing" a store.

If you mean committing from the store buffer into L1d cache, then yes, that's the moment when a store becomes globally visible, on a normal machine that uses MESI to give all CPU cores a coherent view of memory.

Although note that on some ISAs, some other threads are allowed to see stores before they become globally visible via cache. (i.e. the hardware memory model may not be "multi-copy atomic", and allow IRIW reordering. POWER is the only example I know of that does this in real life. See Will two atomic writes to different locations in different threads always be seen in the same order by other threads? for details on the HW mechanism: Store forwarding for retired aka graduated stores between SMT threads.)

If you mean executing locally so later loads in this thread can see it, then no. std::atomic can use a memory_order weaker than seq_cst.

All mainstream ISAs have memory-ordering rules weak enough to allow for a store buffer to decouple instruction execution from commit to cache. This also allows speculative out-of-order execution by giving stores somewhere private to live after execution, before we're sure that they were on the correct path of execution. (Stores can't commit to L1d until after the store instruction retires from the out-of-order part of the back end, and thus is known to be non-speculative.)

If you want to wait for your store to be visible to other threads before doing any later loads, use atomic_thread_fence(memory_order_seq_cst);. (Which on "normal" ISAs with standard choice of C++ -> asm mappings will compile to a full barrier).

On most ISAs, a seq_cst store (the default) will also stall all later loads (and stores) in this thread until the store is globally visible. But on AArch64, STLR is a sequential-release store and execution of later loads/stores doesn't have to stall unless / until a LDAR (acquire load) is about to execute while the STLR is still in the store buffer. This implements SC semantics as weakly as possible, assuming AArch64 hardware actually works that way instead of just treating it as a store + full barrier.

But note that only blocking later loads/stores is necessary; out-of-order exec of ALU instructions on registers can still continue. But if you were expecting some kind of timing effect due to dependency chains of FP operations, for example, that's not something you can depend on in C++.

Even if you do use seq_cst so nothing happens in this thread before the store is visible to others, that's still not instant. Inter-core latency on real hardware can be on the order of maybe 40ns on mainstream modern Intel x86, for example. (This thread doesn't have to stall that long on a memory barrier instruction; some of that time is the cache miss on the other thread trying to read the line that was invalidated by this core's RFO to get exclusive ownership.) Or of course much cheaper for logical cores that share the L1d cache of a physical core: What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?

Ineffective answered 19/2, 2020 at 18:52 Comment(11)

Thanks, that explained everything nicely! “ If you mean executing locally so later loads in this thread can see it, then no”. I believe these are loads on other atomic variables, not the same one of the store? (To preserve single thread program behaviour) – Stormy 19/2, 2020 at 23:11

@A.S.: I did mean reloads of the same atomic variable (store -> load forwarding from the store buffer: the thread seeing its own loads before other threads can. Globally Invisible load instructions). But yes, perhaps a better example would have been that loads of other vars right after a store will create StoreLoad reordering (preshing.com/20120710/…); allowed by acq_rel and weaker, but not seq_cst. – Ineffective 19/2, 2020 at 23:47

Thank you! That makes sense, if I got it correctly: Single thread behaviour is preserved by forwarding from the store buffer (same variable), however other variables loads can be reordered with the store (because it may still be sitting in the store buffer, not commited to L1d cache yet) – Stormy 20/2, 2020 at 2:10

@A.S. yes, exactly. Note that seq_cst atomics forbid that, at least wrt. other SC atomics. But for acq_rel, SC + a store buffer is a useful mental model (and x86's real hardware memory model). And note that mutex lock/unlock operations are formally only acquire / release operations, not SC, in ISO C++. (In practice at least taking a lock is an atomic RMW, and often implies a full barrier). – Ineffective 20/2, 2020 at 3:14

Forgot to mention: ISO C++ acq_rel is weaker than x86's memory model. It allows IRIW reordering (google it): threads don't all have to agree on a total order for independent stores by 2 other threads. – Ineffective 21/2, 2020 at 5:24

Thanks for the helpful info! This question has spawned two other doubts, which I asked as separate questions as they are not strictly related to this one. – Stormy 21/2, 2020 at 5:28

"(Infinite loops without any volatile accesses or I/O are UB)." It means that w/ that model a thread with very long pure computational loops would stall all reactive I/O bound threads. Sad! – Chagres 29/2, 2020 at 15:52

I didn't say that such a cooperative-multitasking C++ implementation would be good, especially not for systems that get interactive use! For that you'd need manual yields in some loops. Or an implementation that's smarter about detecting long-running loops and placing manual scheduling points. But it's worth keeping in mind that a C++ implementation like that is possible, and could I think be standards-compliant. That is a very low bar to clear as far as quality of implentation, of course.. – Ineffective 29/2, 2020 at 20:50

"the existence of a total order of all seq_cst operations is not sufficient to imply happens-before based on it" why not? If anything seq_cst provides a stronger guarantee: strongly happens before. – Anuska 3/4, 2022 at 6:11

@ixSci: Not if you mis-use them, like in that linked question's "option C" . But you're right, that's not a useful way for this answer to describe things. A better example is AArch64's implementation of seq_cst, only preventing StoreLoad reordering between stlr (seq_cst store) and ldar (seq_cst load), not between other later loads, so an SC store doesn't necessarily involve a full memory barrier, unlike an SC fence. Hmm, I did already mention that in this answer. I'll think about a rewrite of that bit. – Ineffective 3/4, 2022 at 6:26

@ixSci: Ok, expanded the text more to not make it sound like I'm saying mo_seq_cst doesn't give ordering in general. I didn't fully re-read my reasoning on my answer on the linked question from 2 years ago (or mpoeter's) about why C++ does allow something that real HW wouldn't, but if see a problem with it please let me know. Thanks again for the feedback of fresh eyes to find places where an answer says something misleading. – Ineffective 3/4, 2022 at 6:41

From 29.3p13:

Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.

The C and C++ standards are all over the place on threads, hence not usable as formal specifications. They use the concept of time, and somewhat imply that everything runs step by step, sequentially (if not, you wouldn't have a sound program semantic) and then say that some constructs can see effects out of order, without ever telling which is which.

When effects are seen out of order, thread time is ill defined as you don't have a chronometer that would also be out of order: you wouldn't do sport with out of order execution of actions!

Even "out of order" suggests that some things are purely sequential and some other operations can be "out of order" with respect to the firsts. That is not how std::atomic is defined.

What the standards try to say is that there is a notion of progress for each thread, with a CPU time or cost index, and as it increases as more stuff is done, and stuff can only be slightly reordered by the implementation: now reordering is well defined, not in term of other sequential instructions, but in term of cost/cycles/CPU time.

So if two instructions are close to each other in the sequential intra-thread execution, they will be close in CPU time too. A reasonable compiler shouldn't move a volatile operation, a file output, or an atomic operation past a very costly "pure" computation (one that has no externally visible side effect).

A basic idea that many committee members sadly couldn't even spell out!

Chagres answered 19/2, 2020 at 2:41 Comment(6)

I understand the above reasoning, however in my undestanding atomic operations must have ultimately some order even after instruction reordering (otherwise they wouldn't be atomic). Here we are assuming write A at assembly level happens before load B. – Stormy 19/2, 2020 at 2:52

@A.S. A (reasonable) atomic operation translates (in practice) to an asm instruction that operates on a properly aligned memory unit which is subject to consistency guarantee at the cache level. Consistent caches inherently order modifications of the indivisible caching units. – Chagres 19/2, 2020 at 2:57

"asm instruction that operates on a properly aligned memory unit which is subject to consistency guarantee at the cache level", so is it safe to say that an atomic load operation (asm instruction) which comes after that will observe the new value? – Stormy 19/2, 2020 at 3:40

In other words at C++ level we don't know when that store will actually happen (because it may be reordered with a costly operation as stated in your answer), however when the store does occur at assembly level, it will be immediately visible to loads? – Stormy 19/2, 2020 at 3:44

No. The "immediately" in your question isn't even strictly defined as all modern CPU execute memory access instr out of order for efficiency, w/ diff rules for inter-core consistency. Some will start loads in advance but only keep the result if it isn't invalided by another core at the time of use (which can easily be detected via cache invalidation), some don't care and keep the result no matter what, so you may see a stale value. Only by using the (default) mo_seq_cst "ordering" mode (more a "visibility" thing than an ordering thing BTW) on all operations can you (easily) avoid that. – Chagres 19/2, 2020 at 3:51

To have an idea of what harmful issues are avoided in any area, you can see what measures are taken to avoid these and try to understand what they are for. That works for theorems (what if I remove that hypothesis... oh I see the conclusion fails w/ that counter ex... and now I understand the mechanism of the theorem better) and that works for atomics: you can look up what special instructions are needed to implement an atomic op on a given CPU with weak consistency. The barrier (or special load or store) instr gives an hint about the avoided reordering. – Chagres 19/2, 2020 at 3:55

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags