Will two atomic writes to different locations in different threads always be seen in the same order by other threads?
Asked Answered
C

4

16

Similar to my previous question, consider this code

-- Initially --
std::atomic<int> x{0};
std::atomic<int> y{0};

-- Thread 1 --
x.store(1, std::memory_order_release);

-- Thread 2 --
y.store(2, std::memory_order_release);

-- Thread 3 --
int r1 = x.load(std::memory_order_acquire);   // x first
int r2 = y.load(std::memory_order_acquire);

-- Thread 4 --
int r3 = y.load(std::memory_order_acquire);   // y first
int r4 = x.load(std::memory_order_acquire);

Is the weird outcome r1==1, r2==0 and r3==2, r4==0 possible in this case under the C++11 memory model? What if I were to replace all std::memory_order_acq_rel by std::memory_order_relaxed?

On x86 such an outcome seems to be forbidden, see this SO question but I am asking about the C++11 memory-model in general.

Bonus question:

We all agree, that with std::memory_order_seq_cst the weird outcome would not be allowed in C++11. Now, Herb Sutter said in his famous atomic<>-weapons talk @ 42:30 that std::memory_order_seq_cst is just like std::memory_order_acq_rel but std::memory_order_acquire-loads may not move before std::memory_order_release-writes. I cannot see how this additional constraint in the above example would prevent the weird outcome. Can anyone explain?

Cowfish answered 6/1, 2015 at 21:1 Comment(14)
Changing all std::memory_order_acq_rel won't make any difference if you don't have any std::memory_order_acq_rel in your code. Did you leave something relevant out of your question?Hardener
@hvd I mean std::memory_order_acq_rel to represent both the std::memory_order_acquire's and the std::memory_order_release's. Maybe I shall change this...Cowfish
The outcome is certainly allowed according to the C++ memory model. There's no ordering between threads 1 and 2. You can imagine the memory changes propagating differently fast to different cores. Synchronisation is only about what happens if you read the new value. There's no guarantee that you will read the new value.Expression
@KerrekSB I had the same thought. Note though that on x86 it is not allowed, and Intel is mighty. :)Cowfish
@TobiasBrüll Surely that depends on what assembly winds up getting generated, which is certainly not guaranteed by any standard.Cyler
@DavidSchwartz I meant: if all stores and loads generate simple MOV instructions which virtually all compilers should end up doing, cf. the linked SO question.Cowfish
@TobiasBrüll Right, you have to make assumptions, it's not guaranteed.Cyler
I've swapped the read order around in thread 4, since your original question didn't make much sense: both threads were reading the x and y in the same order so they couldn't detect writes occurring in the opposite order: you need to swap the read order to do that. As the accepted answer points out, there is trivially a seq cst order that allows the values you put with the original form of the question.Skin
@Skin Shouldn't you then also edit the answer accordingly? Otherwise it may be a bit confusing.Cowfish
@TobiasBrüll - yes, I probably should have, since now the first part of the answer isn't relevant any more, but Peter went ahead and did it. Do you confirm that the updated question is what you were originally getting at?Skin
@Skin Yeah, absolutely! It's probably better now. (thumbsup)Cowfish
Maybe a duplicate of Acquire/release semantics with 4 threadsBreedlove
@TobyBrull: As I understand it, all seq_cst accesses to all variables form a single total order for any given run of the application (ref: thecodingforums.com/threads/…), is that why you mention "We all agree, that with std::memory_order_seq_cst the weird outcome would not be allowed in C++11." in your bonus question?Jamboree
@Jamboree Yes, that was the reason.Cowfish
P
8

The updated1 code in the question (with loads of x and y swapped in Thread 4) does actually test that all threads agree on a global store order.

Under the C++11 memory model, the outcome r1==1, r2==0, r3==2, r4==0 is allowed and in fact observable on POWER.

On x86 this outcome is not possible, because there "stores are seen in a consistent order by other processors". This outcome is also not allowed in a sequential consistent execution.


Footnote 1: The question originally had both readers read x then y. A sequentially consistent execution of that is:

-- Initially --
std::atomic<int> x{0};
std::atomic<int> y{0};

-- Thread 4 --
int r3 = x.load(std::memory_order_acquire);

-- Thread 1 --
x.store(1, std::memory_order_release);

-- Thread 3 --
int r1 = x.load(std::memory_order_acquire);
int r2 = y.load(std::memory_order_acquire);

-- Thread 2 --
y.store(2, std::memory_order_release);

-- Thread 4 --
int r4 = y.load(std::memory_order_acquire);

This results in r1==1, r2==0, r3==0, r4==2. Hence, this is not a weird outcome at all.

To be able to say that each reader saw a different store order, we need them to read in opposite orders to rule out the last store simply being delayed.

Para answered 8/1, 2015 at 18:19 Comment(3)
Wow. That was super helpful for me. Many thanks. Because, I can now conclude that the additional constraint mentioned in the Bonus question is by itself indeed not sufficient to enforce sequential consistency. In the words of @yohjp: "[it is] one aspect of constraints of sequential consistency".Cowfish
What about changing it to std::memory_order_seq_cst? Would it still be allowed?Jamboree
@Jamboree No, it could not happen with std::memory_order_seq_cst. The answer says so as well.Cowfish
B
17

This kind of reordering test is called IRIW (Independent Readers, Independent Writers), where we're checking if two readers can see the same pair of stores appear in different orders. Related, maybe a duplicate: Acquire/release semantics with 4 threads


The very weak C++11 memory model does not require that all threads agree on a global order for stores, as @MWid's answer says.

This answer will explain one possible hardware mechanism that can lead to threads disagreeing about the global order of stores, which may be relevant when setting up tests for lockless code. And just because it's interesting if you like cpu-architecture1.

See A Tutorial Introduction to the ARM and POWER Relaxed Memory Models for an abstract model of what those ISAs: Neither ARM nor POWER guarantee of a consistent global store order seen by all threads. Actually observing this is possible in practice on POWER chips, and maybe possible in theory on ARM but maybe not on any actual implementations.

(Other weakly-ordered ISAs like Alpha also allow this reordering, I think. ARM used to allow it on-paper, but probably no real implementations did this reordering. ARMv8 even strengthened their on-paper model to disallow this even for future hardware.)

In computer science, the term for a machine where stores become visible to all other threads at the same time (and thus there is a single global order of stores) is "multiple-copy atomic" or "multi-copy atomic". x86 and SPARC's TSO memory models have that property, but ARM and POWER don't require it.


Current SMP machines use MESI to maintain a single coherent cache domain so that all cores have the same view of memory. Stores become globally visible when they commit from the store buffer into L1d cache. At that point a load from any other core will see that store. There is a single order of all stores committing to cache, because MESI maintains a single coherency domain. With sufficient barriers to stop local reordering, sequential consistency can be recovered.

A store can become visible to some but not all other cores before it becomes globally visible.

POWER CPUs use Simultaneous MultiThreading (SMT) (the generic term for hyperthreading) to run multiple logical cores on one physical core. The memory-ordering rules we care about are for logical cores that threads run on, not physical cores.

We normally think of loads as taking their value from L1d, but that's not the case when reloading a recent store from the same core and data is forwarded directly from the store buffer. (Store-to-load forwarding, or SLF). It's even possible for a load to get a value that was never present in L1d and never will be, even on strongly-ordered x86, with partial SLF. (See my answer on Globally Invisible load instructions).

The store buffer tracks speculative stores before the store instruction has retired, but also buffers non-speculative stores after they retire from the out-of-order-execution part of the core (the ROB / ReOrder Buffer).

The logical cores on the same physical core share a store buffer. Speculative (not-yet-retired) stores must stay private to each logical core. (Otherwise that would couple their speculation together and require both to roll-back if a mis-speculation were detected. That would defeat part of the purpose of SMT, of keeping the core busy while one thread is stalled or recovering from a branch mispredict).

But we can let other logical cores snoop the store buffer for non-speculative stores that will definitely commit to L1d cache eventually. Until they do, threads on other physical cores can't see them, but logical cores sharing the same physical core can.

(I'm not sure this is exactly the HW mechanism that allows this weirdness on POWER, but it's plausible).

This mechanism makes stores visible to SMT sibling cores before they're globally visible to all cores. But it's still local within the core, so this reordering can be cheaply avoided with barriers that just affect the store buffer, without actually forcing any cache interactions between cores.

(The abstract memory model proposed in the ARM/POWER paper models this as each core having its own cached view of memory, with links between caches that let them sync. But in typical physical modern hardware, I think the only mechanism is between SMT siblings, not between separate cores.)


Note that x86 can't allow other logical cores to snoop the store buffer at all because that would violate x86's TSO memory model (by allowing this weird reordering). As my answer on What will be used for data exchange between threads are executing on one Core with HT? explains, Intel CPUs with SMT (which Intel calls Hyperthreading) statically partition the store buffer between logical cores.


Footnote 1: An abstract model for C++, or for asm on a particular ISA, is all you really need to know to reason about memory ordering.

Understanding the hardware details isn't necessary (and can lead you into a trap of thinking something's impossible just because you can't imagine a mechanism for it).

Breedlove answered 4/6, 2018 at 11:11 Comment(5)
ARM has decided to go mulitcopy atomic in ARMv8, and presumably also "in practice" pre v8 architectures since I don't think non-multicopy atomic behavior ever occurred. See Simplifying ARM Concurrency: Multicopy-AtomicAxiomatic and Operational Models for ARMv8. I don't know if it's official yet but it seems like it is going to happen.Skin
Great answer! I've been curious about how ARM doesn't have a global store order, given that it has a coherent cache. Now this answer gave a reasonable explanation.Brodeur
@zanmato: Yeah, sometimes ISAs leave guarantees weaker on paper than real hardware to leave room for future designs to do interesting things. (Don't forget to upvote, if you haven't already used up your daily vote limit. That way you can let future readers know there's something worth reading here if they're sorting by votes in a search across questions.)Breedlove
I gave my upvote several days ago when I first hit this answer. And this answer is surely worths it!Brodeur
Fascinating in-depth answer! I have a related but dissimilar question regarding hardware details in cache coherence protocol after reading Paul McKenny's "Memory Barriers: a Hardware View for Software Hackers". @Peter-cordes could you take a look as well? stackoverflow.com/questions/74698284/…Pernell
P
8

The updated1 code in the question (with loads of x and y swapped in Thread 4) does actually test that all threads agree on a global store order.

Under the C++11 memory model, the outcome r1==1, r2==0, r3==2, r4==0 is allowed and in fact observable on POWER.

On x86 this outcome is not possible, because there "stores are seen in a consistent order by other processors". This outcome is also not allowed in a sequential consistent execution.


Footnote 1: The question originally had both readers read x then y. A sequentially consistent execution of that is:

-- Initially --
std::atomic<int> x{0};
std::atomic<int> y{0};

-- Thread 4 --
int r3 = x.load(std::memory_order_acquire);

-- Thread 1 --
x.store(1, std::memory_order_release);

-- Thread 3 --
int r1 = x.load(std::memory_order_acquire);
int r2 = y.load(std::memory_order_acquire);

-- Thread 2 --
y.store(2, std::memory_order_release);

-- Thread 4 --
int r4 = y.load(std::memory_order_acquire);

This results in r1==1, r2==0, r3==0, r4==2. Hence, this is not a weird outcome at all.

To be able to say that each reader saw a different store order, we need them to read in opposite orders to rule out the last store simply being delayed.

Para answered 8/1, 2015 at 18:19 Comment(3)
Wow. That was super helpful for me. Many thanks. Because, I can now conclude that the additional constraint mentioned in the Bonus question is by itself indeed not sufficient to enforce sequential consistency. In the words of @yohjp: "[it is] one aspect of constraints of sequential consistency".Cowfish
What about changing it to std::memory_order_seq_cst? Would it still be allowed?Jamboree
@Jamboree No, it could not happen with std::memory_order_seq_cst. The answer says so as well.Cowfish
C
5

The short answer is no. The standard doesn't say they must be, and therefore they don't have to be. It doesn't matter whether you can or can't imagine a specific way for this to happen.

Cyler answered 6/1, 2015 at 21:28 Comment(0)
R
1

Is the weird outcome r1==1, r2==0 and r3==0, r4==2 possible in this case under the C++11 memory model?

Yes. C++ memory model allows such weird outcome.

What if I were to replace all std::memory_order_acq_rel by std::memory_order_relaxed?

If you replace all memory_order_acquire and memory_order_release by memory_order_relaxed, nothing changed for your code.

std::memory_order_seq_cst is just like std::memory_order_acq_rel but std::memory_order_acquire-loads may not move before std::memory_order_release-writes. I cannot see how this additional constraint in the above example would prevent the weird outcome.

"acquire-loads may not move before release-writes." shows one aspect of constraints of sequential consistency (memory_order_seq_cst).

In C++ memory model, it only guarantees that seq_cst has acq_rel semantics and all seq_cst atomic access has some "total order" no more and no less. When such "total order" is exist, we can't get weird outcome because all seq_cst atomic access are executed as if in any interleaved order on single thread.

Your previous question treats "coherency" of single atomic variable, and this question asks "consistency" of all atomic variables. C++ memory model guarantees intuitive coherency for single atomic variable even weakest ordering (relaxed), and "sequential consistency" for different atomic variables as long as default ordering (seq_cst). When you use explicitly non-seq_cst ordering atomic access, it's may be weird outcome as you pointed out.

Rewire answered 7/1, 2015 at 6:31 Comment(2)
Thanks for the clarification. But I am a bit confused by your statement "no more and no less". seq_cst-load-and-stores still have all the guarantees of acq_rel-load-and-stores, right?Cowfish
You quoted this: "When such "total order" is exist, we can't get weird outcome because all seq_cst atomic access are executed as if in any interleaved order on single thread", but then why do you say "If you replace all memory_order_acquire and memory_order_release by memory_order_relaxed, nothing changed for your code."? If there is some total order, then the weird outcome should not happen?Jamboree

© 2022 - 2024 — McMap. All rights reserved.