Is memory ordering in C++11 about main memory flush ordering?

Asked 16/4, 2015 at 8:48 Answered 17/5, 2015 at 23:58

c++atomic memory-barriers memory-model stdatomic

I'm not sure i fully understand (and i may have all wrong) the concepts of atomicity and memory ordering in C++11. Let's take this simple example single threaded :

int main()
{
    std::atomic<int> a(0);
    std::atomic<int> b(0);
    a.store(16);
    b.store(10);

    return 0;
}

In this single threaded code, if a and b were not atomic types, the compiler could have reordered the instruction in a way that in the assembly code, i have for instance a move instruction to assigned 10 to 'b' before a move instruction to assigned 16 to 'a'. So for me, being atomic variables, it guarantees me that i'd have the "a move instruction" before the "b move instruction" as i stated in my source code. After that, there is the processor with his execution unit, prefetching instructions, and with his out-of-order box. And this processor can process the "b instruction" before the "a instruction", whatever is the instruction ordering in the assembly code. So i can have 10 stored in a register or in the store buffer of a processor or in cache memory before i have 16 stored in a register / store buffer or in cache.

And with my understanding, it's where memory ordering model come out. From that moment, if i let the default model sequentially consistent. One guarantees me that flush out these values (10 and 16) in main memory will respect the order i did the store in my source code. So that the processor will start flushing out the register or cache where 16 is stored into main memory for update 'a' and after that it will flush 10 in the main memory for 'b'.

So that behavior does allow me to understand that if i use a relaxed memory model. Only the last part is not guarantee so that the flush in main memory can be in total disorder.

Sorry if you get trouble to read me, my english is still poor. But thank you guys for your time.

Jilolo answered 16/4, 2015 at 8:48 Comment(1)

In your code the variables are provably not visible by any other thread (or even any other function), so they can be compiled exactly like non atomic variables. – Tiros 10/12, 2019 at 4:30

The C++ memory model is about the abstract machine and value visibility, not about concrete things like "main memory", "write queues" or "flushing".

In your example, the memory model states that since the write to a happens-before the write to b, any thread that reads the 10 from b must, on subsequent reads from a, see 16 (unless this has since been overwritten, of course).

The important thing here is establishing happens-before relationships and value visibility. How this maps to caches and memory is up to the compiler. In my opinion, it's better to stay on that abstract level instead of trying to map the model to your understanding of the hardware, because

Your understanding of the hardware might be wrong. Hardware is even more complicated than the C++ memory model.
Even if your understanding is correct now, a later version of the hardware might have a different model, at least in subsystems.
By mapping to a hardware model, you might then make wrong assumptions about the implications for a different hardware model. E.g. if you understand how the memory model maps to x86 hardware, you will not understand the subtle difference between consume and acquire on PowerPC.
The C++ model is very well suited for reasoning about correctness.

Yasukoyataghan answered 16/4, 2015 at 9:39 Comment(4)

Thanks for answering. You're right, i should not think from a hardware point of view. And for atomicity, am i right to think that atomicity prevents compiler from reordering instructions ? Is it like he puts some barriers (fences) ? – Jilolo 16/4, 2015 at 10:7

Reordering is prevented to some extent. Some reorderings are fine, same as it is valid to move a load from before a mutex lock to within the critical section. – Yasukoyataghan 16/4, 2015 at 11:21

You raise good points, but it's always good to know what your HW is doing, even for the sake of optimization. As long as you let the compiler translate the memory ordering model to the HW you should be safe. – Nemato 18/5, 2015 at 0:1

The wording of standardese is very hard to parse if you don't already understand the kinds of things that it's trying to allow and forbid. Understanding acq/rel in terms of one-way barriers is so much easier than what the standard says about establishing synchronizes-with relationships. That makes sense once you grok it, but IMO a HW-centric understanding is a useful building block for reading the standard. Jeff Preshing's articles are a good mix of abstract C++-level with HW-centric thinking, IMO. – Daph 30/11, 2019 at 5:31

You didn't specify which architecture you work with, but basically each has its own memory ordering model (some times more than one that you can choose from), and that serves as a "contract". The compiler should be aware of that and use lightweight or heavyweight instructions accordingly to guarantee what it needs in order to provide the memory model of the language.

The HW implementation under the hood can be quite complicated, but in a nutshell - you don't need to flush in order to get global visibility. Modern cache systems provide snooping capabilities, so that a value can be globally visible and globally ordered while still residing in some private core cache (and having stale copies in lower cache levels), the MESI protocols control how this is handled correctly.

The life cycle of a write begins in the out of order engine, where it is still speculative (i.e. - can be cleared due to an older branch misprediction or fault). Naturally, during that time the write can not be seen from the outside, so out-of-order execution here is not relevant. Once it commits, if the system guarantees store ordering (like x86), it still has to wait in line for its turn to become visible, so it is buffered. Other cores can't see it since its observation time hasn't reached yet (although local loads in that core might see it in some implementations of x86 - that's one of the differences between TSO and real sequential consistency). Once the older stores are done, the store may become globally visible - it doesn't have to go anywhere outside of the core for that, it can remain cached internally. In fact, some CPUs may even make it observable while still in the store buffer, or write it to the cache speculatively - the actual decision point is when to make it respond to external snoops, the rest is implementation details. Architectures with more relaxed ordering may change the order unless explicitly blocked by a fence/barrier.

Based on that, your code snippet can not reorder stores on x86 since stores don't reorder with each other there, but it may be able to do so on arm for example. If the language requires strong ordering in that case, the compiler will have to decide if it can rely on the HW, or add a fence. Either way, anyone reading this value from another thread (or socket) will have to snoop for it, and can only see the writes that respond.

Nemato answered 17/5, 2015 at 23:58 Comment(1)

Fun fact: some ISAs in theory, and POWER in practice, allow a store to become visible to some other threads before committing to coherent L1d cache and becoming globally visible. The mechanism is store-forwarding of retired (i.e. non-speculative) stores between logical cores on the SMT same physical core. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?. ISO C++ allows this IRIW reordering/inconsistency for mo_relaxed but not mo_seq_cst – Daph 30/11, 2019 at 5:26

Recommended topics

Hot tags