StoreStore reordering happens when compiling C++ for x86

while(true) {
    int x(0), y(0);

    std::thread t0([&x, &y]() {
        x=1;
        y=3;
    });
    std::thread t1([&x, &y]() {
        std::cout << "(" << y << ", " <<x <<")" << std::endl;

    });


    t0.join();
    t1.join();
}

Firstly, I know that it is UB because of the data race. But, I expected only the following outputs:

(3,1), (0,1), (0,0)

I was convinced that it was not possible to get (3,0), but I did. So I am confused- after all x86 doesn't allow StoreStore reordering.

So x = 1 should be globally visible before y = 3

I suppose that from theoretical point of view the output (3,0) is impossible because of the x86 memory model. I suppose that it appeared because of the UB. But I am not sure. Please explain.

What else besides StoreStore reordering could explain getting (3,0)?

You're writing in C++, which has a weak memory model. You didn't do anything to prevent reordering at compile-time.

If you look at the asm, you'll probably find that the stores happen in the opposite order from the source, and/or that the loads happen in the opposite order from what you expect.

The loads don't have any ordering in the source: the compiler can load x before y if it wants to, even if they were std::atomic types:

            t2 <- x(0)
t1 -> x(1)
t1 -> y(3)
            t2 <- y(3)

This isn't even "re"ordering, since there was no defined order in the first place:

std::cout << "(" << y << ", " <<x <<")" << std::endl; doesn't necessarily evaluate y before x. The << operator has left-to-right associativity, and operator overloading is just syntactic sugar for

op<<( op<<(op<<(y),x), endl);  // omitting the string constants.

Since the order of evaluation of function arguments is undefined (even if we're talking about nested function calls), the compiler is free to evaluate x before evaluating op<<(y). IIRC, gcc often just evaluates right to left, matching the order of pushing args onto the stack if necessary. The answers on the linked question indicate that that's often the case. But of course that behaviour is in no way guaranteed by anything.

The order they're loaded is undefined even if they were std::atomic. I'm not sure if there's a sequence point between the evaluation of x and y. If not, then it would be the same as if you evaluated x+y: The compiler is free to evaluate the operands in any order because they're unsequenced. If there is a sequence point, then there is an order but it's undefined which order (i.e. they're indeterminately sequenced).

Slightly related: gcc doesn't reorder non-inline function calls in expression evaluation, to take advantage of the fact that C leaves the order of evaluation unspecified. I assume after inlining it does optimize better, but in this case you haven't given it any reason to favour loading y before x.

How to do it correctly

The key point is that it doesn't matter exactly why the compiler decided to reorder, just that it's allowed to. If you don't impose all the necessary ordering requirements, your code is buggy, full-stop. It doesn't matter if it happens to work with some compilers with some specific surrounding code; that just means it's a latent bug.

See http://en.cppreference.com/w/cpp/atomic/atomic for docs on how/why this works:

// Safe version, which should compile to the asm you expected.

while(true) {
    int x(0);                  // should be atomic, too, because it can be read+written at the same time.  You can use memory_order_relaxed, though.
    std::atomic<int> y(0);

    std::thread t0([&x, &y]() {
        x=1;
        // std::atomic_thread_fence(std::memory_order_release);  // A StoreStore fence is an alternative to using a release-store
        y.store(3, std::memory_order_release);
    });
    std::thread t1([&x, &y]() {
        int tx, ty;
        ty = y.load(std::memory_order_acquire);
        // std::atomic_thread_fence(std::memory_order_acquire);  // A LoadLoad fence is an alternative to using an acquire-load
        tx = x;
        std::cout << ty + tx << "\n";   // Don't use endl, we don't need to force a buffer flush here.
    });

    t0.join();
    t1.join();
}

For Acquire/Release semantics to give you the ordering you want, the last store has to be the release-store, and the acquire-load has to be the first load. That's why I made y a std::atomic, even though you're setting x to 0 or 1 more like a flag.

If you don't want to use release/acquire, you could put a StoreStore fence between the stores and a LoadLoad fence between the loads. On x86, this would just prevent compile-time reordering, but on ARM you'd get a memory-barrier instruction. (Note that y still technically needs to be atomic to obey C's data-race rules, but you can use std::memory_order_relaxed on it.)

Actually, even with Release/Acquire ordering for y, x should be atomic as well. The load of x still happens even if we see y==0. So reading x in thread 2 is not synchronized with writing y in thread 1, so it's UB. In practice, int loads/stores on x86 (and most other architectures) are atomic. But remember that std::atomic implies other semantics, like the fact that the value can be changed asynchronously by other threads.

The hardware-reordering test could run a lot faster if you looped inside one thread storing i and -i or something, and looped inside the other thread checking that abs(y) is always >= abs(x). Creating and destroying two threads per test is a lot of overhead.

Of course, to get this right, you have to know how to use C to generate the asm you want (or write in asm directly).

How to do it correctly

Recommended topics

Hot tags