How to understand RELAXED ORDERING in std::memory_order (C++)

Asked 14/4, 2019 at 22:39 Answered 23/1, 2021 at 10:14

I have read a lot of posts and watched several Youtube videos C++ atomic and memory model (ConCpp 17, 14).

When I read the book Concurrency In Action, section 5.3.3, RELAXED ORDERING, I still cannot understand the example provided by the author under his assumptions.

Assumptions by the author

It’s not just that the compiler can reorder the instructions. Even if the threads are running the same bit of code, they can disagree on the order of events because of operations in other threads in the absence of explicit ordering constraints, because the different CPU caches and internal buffers can hold different values for the same memory. It’s so important I’ll say it again: threads don’t have to agree on the order of events. Not only do you have to throw out mental models based on interleaving operations, you also have to throw out mental models based on the idea of the compiler or processor reordering the instructions.

Suppose that the code we see is not reordered.

The example code:

#include <atomic>
#include <thread>
#include <assert.h>

std::atomic<bool> x,y;
std::atomic<int> z;

void write_x_then_y()
{
    x.store(true,std::memory_order_relaxed); // 1
    y.store(true,std::memory_order_relaxed); // 2
}

void read_y_then_x()
{
    while(!y.load(std::memory_order_relaxed)); // 3
    if(x.load(std::memory_order_relaxed))      // 4
        ++z;
}

int main() {
    x=false;
    y=false;
    z=0;

    std::thread a(write_x_then_y);
    std::thread b(read_y_then_x);
    a.join();
    b.join();

    assert(z.load()!=0); // 5
}

from this link: https://www.developerfusion.com/article/138018/memory-ordering-for-atomic-operations-in-c0x/

Why x.load(relaxed) return false but y.load(relaxed) return true?

The conclusion by the author

This time the assert (5) can fire, because the load of x (4) can read false, even though the load of y (3) reads true and the store of x (1) happens-before the store of y (2). x and y are different variables, so there are no ordering guarantees relating to the visibility of values arising from operations on each.

Q. Why load of x can be false?

The author concludes that assert can fire. So, z can be 0. So, if(x.load(std::memory_order_relaxed)) : x.load(std::memory_order_relaxed) is false.

But anyway, while(!y.load(std::memory_order_relaxed)); makes y true.

If we don't reorder the code sequence of (1) and (2), how is it possible that y is true but x is still not be stored?

How to understand the figure provided by the author?

Based on the store of x (1) happens-before the store of y (2), if x.store(relaxed) happen-before y.store(relaxed), x should be true now. But why x is still false even y is true?

Kaule answered 14/4, 2019 at 22:39 Comment(6)

You don't appear to have an actual question here. If you do, make it into the question title it at least the first paragraph. Because to me it seems as if you explained relaxed ordering by copying someone's book. – Summer 14/4, 2019 at 23:6

store can be asynchronous for example. cpu can begin store to x and not wait when it finished, just begin store to y. in which order this 2 store will be finished already undefined. as result y store can finish first and become visible to another cpu, while x store yet not finished – Basaltware 14/4, 2019 at 23:21

so relaxed not guarantee any order, in which operation result will be visible to other cpu, but guarantee atomicity of operation – Basaltware 14/4, 2019 at 23:23

@Basaltware I think the book is wrong. The order is changed in thread a. It becomes y.store(...), and then x.store(...) so that thread b can derive that conclusion. According to "there are no synchronization or ordering constraints imposed on other reads or writes, only this operation's atomicity is guaranteed" and the reasoning of the example provided here: en.cppreference.com/w/cpp/atomic/memory_order#Relaxed_ordering – Kaule 15/4, 2019 at 2:50

@Kaule - no, not need store to be reordered. begin store x, then begin store y (under begin can mean write to cpu cach), but end (write to shared memory) strore can in any order. y can end before x – Basaltware 15/4, 2019 at 7:44

@Basaltware "begin" means resolve address, check permissions and send to the cache? – Clinker 27/5, 2019 at 9:1

You and a friend both agree that x=false and y=false. One day, you send him a letter telling him that x=true. The next day you send him a letter telling him that y=true. You definitely send him the letters in the right order.

Sometime later, your friend receives a letter from you saying that y=true. Now what does your friend know about x? He probably already received the letter that told him x=true. But maybe the postal system temporarily lost it and he'll receive it tomorrow. So for him, x=false and x=true are both valid possibilities when he receives the y=true letter.

So, back to the silicon world. Memory between threads has no guarantee at all that writes from other threads turn up in any particular order, and so the 'delayed x' is totally a possibility. All adding an atomic and using relaxed does is stop two threads racing on a single variable from becoming undefined behaviour. It makes no guarantees at all to ordering. Thats what the stronger orderings are for.

Or, in a slightly more crude way, behold my MSPaint skills:

In this case, the purple arrow which is the flow of 'x' from the first thread to the second thread comes too late, whereas the green arrow (y crossing over) happens fast.

Menides answered 14/4, 2019 at 23:7 Comment(8)

Do you think the possible hardware design is like this: drive.google.com/file/d/1FZoCFS_XYAuSLAlYZ7vWDPOVB6eUHi03/… (behold my Paint skills too..) In this figure, the x.load delayed because of the read buffer. – Kaule 14/4, 2019 at 23:23

If you are going into why this happens in hardware there are plenty of reasons. The main one is that almost none of reads/writes happenning on different threads have any reason at all to have any kind of ordering (I could have photoshop running on one thread and excel on another say and therefore there'd be no point at all in synchronisation). Because of this, anything at all that would make CPU slower by handling ordering by default is a bad thing. So we're in a world where silicon should be as fast as possible even by breaking 'expected' ordering. – Menides 14/4, 2019 at 23:28

However, the one place I saw this happen is in this case: Generally CPUs have cachelines of data and whole lines are flushed (or fetched) from RAM at once. If you've already got a 'dirty line' which is at the top of the flush queue and then you write some other data to another part of that line, then the new data goes to the 'top' of the flush queue for free as that line was already going to be written. Hence it might overtake something else in the "middle" of the queue – Menides 14/4, 2019 at 23:29

As for your diagram, I have no idea as I'm not a CPU engineer sorry :-) – Menides 14/4, 2019 at 23:31

Would recommend preshing.com/archives for his blog on memory ordering - there's 5-6 articles there on ordering which he's written over a few years. May take time to track them down but well worth it. Search for 'order' on that page and read from oldest first. – Menides 14/4, 2019 at 23:35

I think the book is wrong. The order is changed in thread a. It becomes y.store(...), and then x.store(...) so that thread b can derive that conclusion. According to "there are no synchronization or ordering constraints imposed on other reads or writes, only this operation's atomicity is guaranteed" and the reasoning of the example provided here: en.cppreference.com/w/cpp/atomic/memory_order#Relaxed_ordering – Kaule 15/4, 2019 at 2:25

@Kaule There is no global order on modification of different variables. – Clinker 26/5, 2019 at 21:57

@MikeVine "(I could have photoshop running on one thread and excel on another say" These two would be to independent processes (with probably many threads in each) not two threads of one process. Of course that doesn't change the conclusion that Photoshop threads and Excel threads don't need synchronization. – Clinker 27/5, 2019 at 8:52

If we don't reorder the code sequence of (1) and (2), how is it possible that y is true but x is still not be stored?

The answer is partially in the first quote:

because the different CPU caches and internal buffers can hold different values for the same memory.

In other words, other threads can see stale values. So x and y can be stored, but not yet propagated to other threads. And cache propagation order may be different than their store order.

For example, three threads, first one modifies x and y, cache propagates in different order to different threads:

 x == 0          x == 0          x == 0 
 y == 0          y == 0          y == 0
+-------+       +-------+
| x = 1 | ----> | x = 1 |
+-------+       +-------+
+-------+                      +-------+
| y = 1 | -------------------> | y = 1 |
+-------+                      +-------+
  x == 1         x == 1          x == 0
  y == 1         y == 0          y == 1

Authorized answered 23/1, 2021 at 10:14 Comment(0)

Assumptions by the author

The example code:

The conclusion by the author

Q. Why load of x can be false?

Recommended topics

Hot tags