How to write observable example for instruction reorder?

#include <thread> #include <iostream> int main() { int a = 0; volatile int flag = 0; std::thread t1([&]() { while (flag != 1); int b = a; std::cout << "b = " << b << std::endl; }); std::thread t2([&]() { a = 5; flag = 1; }); t1.join(); t2.join(); return 0; }

First of all: You are in the land of UB because there is a race condition: Both flag and a are written and read from different threads without proper synchronization - this is always a data race. The C++ standard does not impose any requirements on implementations when you give them such a program.

There is therefore no way to "guarantee" a specific behavior.

However, we can look at assembly output to determine what a given compiled program can or cannot do. I was not successful at using reordering alone to show the problem with volatile as synchronization mechanism, but below is a demonstration using a related optimization.

Here is an example of a program that has no data races:

std::atomic<int> a = 0;
std::atomic<int> flag = 0;

std::thread t1([&]() {
    while (flag != 1);

    int b = a;
    std::cout << "b = " << b << std::endl;
});

std::thread t2([&]() {
    a = 5;
    int x = 1000000;
    while (x-- > 1) flag = 0;
    flag = 1;
    x = 1000000;
    while (x-- > 1) flag = 1;
    flag = 0;
    a = 0;
});

t1.join();
t2.join();

https://wandbox.org/permlink/J1aw4rJP7P9o1h7h

Indeed, the usual output of this program is b = 5 (other outputs are possible, or the program might not terminate at all with "unlucky" scheduling, but there is no UB).

If we use improper synchronization instead, we can see in the assembly that this output is no longer in the realm of possibility (given the guarantees of the x86 platform):

int a = 0;
volatile int flag = 0;

std::thread t1([&]() {
    while (flag != 1);

    int b = a;
    std::cout << "b = " << b << std::endl;
});

std::thread t2([&]() {
    a = 5;
    int x = 1000000;
    while (x-- > 1) flag = 0;
    flag = 1;
    x = 1000000;
    while (x-- > 1) flag = 1;
    flag = 0;
    a = 0;
});

t1.join();
t2.join();

The assembly for the second thread body, as per https://godbolt.org/z/qsjca1:

std::thread::_State_impl<std::thread::_Invoker<std::tuple<main::{lambda()#2}> > >::_M_run():
        mov     rcx, QWORD PTR [rdi+8]
        mov     rdx, QWORD PTR [rdi+16]
        mov     eax, 999999
.L4:
        mov     DWORD PTR [rdx], 0
        sub     eax, 1
        jne     .L4
        mov     DWORD PTR [rdx], 1
        mov     eax, 999999
.L5:
        mov     DWORD PTR [rdx], 1
        sub     eax, 1
        jne     .L5
        mov     DWORD PTR [rdx], 0
        mov     DWORD PTR [rcx], 0
        ret

Notice how a = 5; has been completely optimized away. Nowhere in the compiled program does a get a chance to take the value 5.

As you can see in https://wandbox.org/permlink/Pnbh38QpyqKzIClY, the program will always output 0 (or not terminate), even though the original C++ code for thread 2 would - in a "naive" interpretation - always have a == 5 while flag == 1.

The while loops are of course to "burn time" and give the other thread a chance to interleave - sleep or other system calls would generally constitute a memory barrier for the compiler and might break the effect of the second snippet.

Recommended topics

Hot tags