Multithreading program stuck in optimized mode but runs normally in -O0
Asked Answered
R

3

71

I wrote a simple multithreading programs as follows:

static bool finished = false;

int func()
{
    size_t i = 0;
    while (!finished)
        ++i;
    return i;
}

int main()
{
    auto result=std::async(std::launch::async, func);
    std::this_thread::sleep_for(std::chrono::seconds(1));
    finished=true;
    std::cout<<"result ="<<result.get();
    std::cout<<"\nmain thread id="<<std::this_thread::get_id()<<std::endl;
}

It behaves normally in debug mode in Visual studio or -O0 in gcc and print out the result after 1 seconds. But it stuck and does not print anything in Release mode or -O1 -O2 -O3.

Rillis answered 23/10, 2019 at 5:31 Comment(1)
Comments are not for extended discussion; this conversation has been moved to chat.Dyna
V
105

Two threads, accessing a non-atomic, non-guarded variable are U.B. This concerns finished. You could make finished of type std::atomic<bool> to fix this.

My fix:

#include <iostream>
#include <future>
#include <atomic>

static std::atomic<bool> finished = false;

int func()
{
    size_t i = 0;
    while (!finished)
        ++i;
    return i;
}

int main()
{
    auto result=std::async(std::launch::async, func);
    std::this_thread::sleep_for(std::chrono::seconds(1));
    finished=true;
    std::cout<<"result ="<<result.get();
    std::cout<<"\nmain thread id="<<std::this_thread::get_id()<<std::endl;
}

Output:

result =1023045342
main thread id=140147660588864

Live Demo on coliru


Somebody may think 'It's a bool – probably one bit. How can this be non-atomic?' (I did when I started with multi-threading myself.)

But note that lack-of-tearing is not the only thing that std::atomic gives you. It also makes concurrent read+write access from multiple threads well-defined, stopping the compiler from assuming that re-reading the variable will always see the same value.

Making a bool unguarded, non-atomic can cause additional issues:

  • The compiler might decide to optimize variable into a register or even CSE multiple accesses into one and hoist a load out of a loop.
  • The variable might be cached for a CPU core. (In real life, CPUs have coherent caches. This is not a real problem, but the C++ standard is loose enough to cover hypothetical C++ implementations on non-coherent shared memory where atomic<bool> with memory_order_relaxed store/load would work, but where volatile wouldn't. Using volatile for this would be UB, even though it works in practice on real C++ implementations.)

To prevent this to happen, the compiler must be told explicitly not to do.


I'm a little bit surprised about the evolving discussion concerning the potential relation of volatile to this issue. Thus, I'd like to spent my two cents:

Veracity answered 23/10, 2019 at 5:39 Comment(26)
Thanks, I understand that might be the reason why it works only in Debug mode. Another question I need some clarification. Does a resource need to be protected even when other threads are only reading it, like the global bool I have here?Rillis
@szppeter It need not to be guarded if all involved threads are reading only. In your case, one thread (the main) writes. The other thread reads but it doesn't recognize the modification. That it works in debug mode is just an unhappy accident. U.B. is U.B. "It works" is one kind of U.B. (the most underhanded). ;-)Professorate
@szppeter For a better learning effect, I would suggest you, to have a deep look into mutex, locks and so on... Atomic does quite a view things, that a key for multitasking, that you should understand.Lucchesi
I took one look at func() and thought "I could optimise that away" The optimiser doesn't care for threads at all, and will detect the infinite loop, and will happily turn it into a "while(True)" If we look at godbolt.org/z/Tl44iN we can see this. If finished is True it returns. If not, it goes into an unconditional jump back to itself (an infinite loop) at label .L5Stamin
You probably should never actually do this, but would making variable volatile fix this too at less cost?Plaintive
@val No. SO: When to use volatile with multi threading?Professorate
The variable might be cached for a CPU core. Possible on a hypothetical system with non-coherent shared memory, but in real life nobody runs multi-threaded C++ programs across non-coherent shared memory. (You use MPI or whatever to use it for low-overhead message passing with explicit flushes of specific regions). See software.rajivprab.com/2018/04/29/…. It's not actual CPU caches that are the problem, it's that assuming no UB allows the compiler to hoist the load out of the loop and make asm like if(!finished) { infinite_loop() }Bevy
In other words, it's not atomicity per-se (lack of tearing) that's the problem here, it's that (assumption of no) data-race UB allows the compiler to hoist and CSE loads of globals. So yes, @val, the part of std::atomic<T> that we need here is exactly the same as what volatile gives us in practice on real implementations, and how code like this was written before C++11, when threads weren't part of the language. In C++11, volatile doesn't prevent data-race UB, but on real implementations it's pretty much atomic<T> with memory_order_relaxed, and operations like ++ aren't atomic-RMWBevy
@val: there's basically no reason to abuse volatile in C++11 because you can get identical asm with atomic<T> and std::memory_order_relaxed. It does work though on real hardware: caches are coherent so a load instruction can't keep reading a stale value once a store on another core commits to cache there. (MESI)Bevy
@PeterCordes Using volatile is still UB though. You really should never assume something that is definitely and clearly UB is safe just because you can't think of a way it could go wrong and it worked when you tried it. That has gotten people burned over and over.Vacation
@DavidSchwartz: Yes, of course it is. I thought my comments were pretty clearly saying that you shouldn't actually use volatile this way, just that this part of the explanation is misleading for people's understanding of how CPUs work. There is a widespread misconception that reading stale data from cache is a real problem on mainstream CPUs. It's not, only when the compiler caches a value in a register (per-thread private) or CSEs away multiple accesses entirely. Anyway, I added an answer on When to use volatile with multi threading?.Bevy
@DavidSchwartz: In C, the semantics of volatile have always been implementation-defined, and at the time the Standard was written the simplest (and thus most common) treatment was to regard a volatile load or store as though it might affect any object that wasn't qualified register. While C++ is a different language, it inherited many concepts from C without bothering to fully re-specify all of them.Penland
While this answer is good insofar as it makes perfectly correct claims (data race / UB, 4.7.1/20.1) and provides a correct, working solution, I believe that the real reason why it works in debug and doesn't work in optimized build is a different one. The reason is that both setting finished to whatever value within main and checking for it in func is entirely inconsequential on a per-thread scope, so the compiler simply doesn't do it at all.Misinterpret
@Misinterpret Why U.B. does work under certain conditions is interesting and exciting but IMHO less relevant than how are things done right. There are a lot of cases where U.B. is simply unnoticed by causing no visual effect on your common platform (sometimes, until the one day before release when it suddenly starts to pain you). ;-) I was under impression that the other answer covered that part, didn't it? (The 2nd link, I just added at the bottom, might provide deeper insights concerning this as well.)Professorate
That's right, what I want to point out (and maybe failed to do) is that UB sounds like the well-known "demons fly out of your nose" thing. Whereas in this case (and many cases) UB is rather a harmless "unspecified, don't care". Although UB, it would actually "work" (...and fail sporadically, but not systematically). It does however not matter here whether the bool is being update atomically or whether the happens-before guarantees are strictly met or not. Inside func there is no way the value could change, and inside main there is no consequence of setting it. If a mutex was used...Misinterpret
... (cont) then the UB would be gone (without using an atomic variable), but it would still not work. Because the UB per se isn't the reason it doesn't work, it's because the optimizer is (correctly) stripping out stuff that is entirely irrelevant.Misinterpret
@Misinterpret There is the distinction between Undefined Behavior vs. Implementation-Defined Behavior. That U.B. results in implementation defined behavior on a specific real system doesn't mean that this might be true in general as well as this might not change in the future. IMHO, this can be reduced to "Is this a data race?" (I think so, Race Conditions versus Data Races) and "Is a data race U.B?" (I found Wait, data races are undefined behavior?)Professorate
@Misinterpret Mutexes have release/acquire semantics. The compiler is not allowed to optimize the read away if a mutex was locked before, so protecting finished with a std::mutex works (without volatile or atomic). In fact, you can replace all atomics with a "simple" value + mutex scheme; it would still work and just be slower. atomic<T> is allowed to use an internal mutex; only atomic_flag is guaranteed lock-free.Deandreadeane
Here is an example of using a mutex instead of atomic that I think is correct an does work with GCC and Clang and -O2 (but is probably slower).Deandreadeane
@Erlkoenig: I'm pretty sure that isn't true, though it might accidentially "work", of course. That's the funny thing with UB, it can actually "work" despite being broken. Acquire/release semantics are a barrier which do not allow a read or write operation to be moved across the barrier. Still, they do not require an operation to happen at all, if -- within the thread's scope -- it's an useless operation, such as reading a value which is "obviously" constant. Same for writing a value which is "obviously" never used.Misinterpret
@Misinterpret I know about UB, but using (only) mutexes with shared data isn't it. Any access to data shared between threads might be "obviously" useless and only become relevant in cooperation with other thread. If this were UB, then you'd have to use atomics all the time, and all the classic patterns of protecting shared data with mutexes wouldn't work. That can't be right; atomics "just" make things faster, but they are not required. Replace the bool finished with a std::string finished that is either "yes" or "no" - would you suggest you'd have to use std::atomic<std::string>?Deandreadeane
@Misinterpret What is an obvious constant value?Detritus
@curiousguy: A value for which no codepath exists that modifies it (such as in the question) is "obviously" constant. There is no way finished could possibly ever have any value other than false. At least, not in any constellation where func is actually being called (and that's all that matters). The compiler couldn't possibly divine that some external magic alters the value, without being explicitly hinted to it (and it probably shouldn't even care, since enforcing global consistency would be a very serious pessimization for 99.999% of all programs).Misinterpret
@Misinterpret No path where? In 1 function? 1 TU? the program?Detritus
What is a "non-guarded" variable? One that's not inside some sort scope that's protected by a mutex/lock?Coen
@Coen Yesss. For non-atomic variables which are shared in multiple threads, the read/write accesses have to be mutual exclusive. (If at least, one thread writes, this affects all read and write accesses. For only read accesses no locking would be needed.) This is achieved by a lock (e.g. a mutex). With locking, a critical section starts and ends with unlock. In C++, there is RAII, which makes it as easy as safe. Hence, the std::lock_guard definition starts a lock while the unlock happens in its destructor. This is where the scope comes into play which is otherwise not related to this.Professorate
S
45

Scheff's answer describes how to fix your code. I thought I would add a little information on what is actually happening in this case.

I compiled your code at godbolt using optimisation level 1 (-O1). Your function compiles like so:

func():
  cmp BYTE PTR finished[rip], 0
  jne .L4
.L5:
  jmp .L5
.L4:
  mov eax, 0
  ret

So, what is happening here? First, we have a comparison: cmp BYTE PTR finished[rip], 0 - this checks to see if finished is false or not.

If it is not false (aka true) we should exit the loop on the first run. This accomplished by jne .L4 which jumps when not equal to label .L4 where the value of i (0) is stored in a register for later use and the function returns.

If it is false however, we move to

.L5:
  jmp .L5

This is an unconditional jump, to label .L5 which just so happens to be the jump command itself.

In other words, the thread is put into an infinite busy loop.

So why has this happened?

As far as the optimiser is concerned, threads are outside of its purview. It assumes other threads aren't reading or writing variables simultaneously (because that would be data-race UB). You need to tell it that it cannot optimise accesses away. This is where Scheff's answer comes in. I won't bother to repeat him.

Because the optimiser is not told that the finished variable may potentially change during execution of the function, it sees that finished is not modified by the function itself and assumes that it is constant.

The optimised code provides the two code paths that will result from entering the function with a constant bool value; either it runs the loop infinitely, or the loop is never run.

at -O0 the compiler (as expected) does not optimise the loop body and comparison away:

func():
  push rbp
  mov rbp, rsp
  mov QWORD PTR [rbp-8], 0
.L148:
  movzx eax, BYTE PTR finished[rip]
  test al, al
  jne .L147
  add QWORD PTR [rbp-8], 1
  jmp .L148
.L147:
  mov rax, QWORD PTR [rbp-8]
  pop rbp
  ret

therefore the function, when unoptimised does work, the lack of atomicity here is typically not a problem, because the code and data-type is simple. Probably the worst we could run into here is a value of i that is off by one to what it should be.

A more complex system with data-structures is far more likely to result in corrupted data, or improper execution.

Stamin answered 23/10, 2019 at 16:33 Comment(11)
C++11 does make threads and a thread-aware memory model part of the language itself. This means compilers can't invent writes even to non-atomic variables in code that doesn't write those variables. e.g. if (cond) foo=1; can't be transformed to asm that's like foo = cond ? 1 : foo; because that load+store (not an atomic RMW) could step on a write from another thread. Compilers were already avoiding stuff like that because they wanted to be useful for writing multi-threaded programs, but C++11 made it official that compilers had to not break code where 2 threads write a[1] and a[2]Bevy
But yes, other than that overstatement about how compilers aren't aware of threads at all, your answer is correct. Data-race UB is what allows hoisting loads of non-atomic variables including globals, and the other aggressive optimizations we want for single-threaded code. MCU programming - C++ O2 optimization breaks while loop on electronics.SE is my version of this explanation.Bevy
@PeterCordes: Who's "we"? There are many forms of hoisting optimizations that are useful, whose effects may be observable in weird corner cases, but there is also value in allowing certain forms of benign data race. Consider, for example, the behavior of string hashing in Java. If a string has never been hashed and Thread1 tries to hash it, it will compute the hash value and store it in the string object. If Thread2 then asks for the hash, it may or may not see the value; if it doesn't, it will compute the hash and store it. The reads and writes in Thread2 have data races...Penland
...but the cost of possible redundant hash computation will almost always be less than the complexity needed to avoid it. A behavioral model that allows certain transformations in the absence of constructs that would block them could maximize the efficiency and safety with which various useful tasks could be accomplished, compared with requiring programmers to jump through hoops to accommodate a needlessly-inconvenient execution model.Penland
@supercat: That's a good example. Standard library implementers on a platform where it's safe in practice could still do that with acquire/release semantics for an atomic already-hashed flag. But yes the hash data itself could still be written the same way twice, which is UB. Although that. Similar problem as a SeqLock: no UB-safe way to express a pattern where we can detect possible tearing and ignore. Even if the has is small enough to be lockfree as a relaxed atomic, that's still optimization defeating. And so is an atomic flag, maybe; maybe fake it with memory barriers?Bevy
@PeterCordes: Is there any zero-overhead way of creating an object in C++ with relaxed atomic semantics, whose sole effect would be to require that compilers not add reads or writes? From what I understand, on some platforms, constructing and destructing objects of an atomic integer type may require memory barriers, even if the object itself is never used for anything other than relaxed semantics.Penland
@PeterCordes: One advantage of Java using a GC is that memory for objects won't be recycled without an intervening global memory barrier between the old and new usage, which means that any core that examines an object will always see some value that it has held at some time after the reference was first published. While global memory barriers can be very expensive if they're used frequently, they can greatly reduce the need for memory barriers elsewhere even when used sparingly.Penland
@PeterCordes i didn't say the compiler doesn't know about threads. just that the optimiser doesn't account for possible thread interactions unless we specifically tell it.Stamin
Yes, I knew that's what you were trying to say, but I don't think your wording 100% means that. Saying the optimizer "completely ignores them." isn't quite right: it's well known that truly ignoring threading when optimizing can involve things like word load / modify a byte in the word / word store, which in practice has caused bugs where one thread's access to a char or bitfield steps on a write to an adjacent struct member. See lwn.net/Articles/478657 for the full story, and how only the C11 / C++11 memory model makes such an optimization illegal, not just undesired in practice.Bevy
"effectively" was a good change, but I think there's room to be more explicit about how the C++ rules work in that sentence without confusing or distracting readers. Feel free to roll back if you liked your wording better, I offer my edit merely as a suggestion.Bevy
No, that's good.. Thanks @PeterCordes. I appreciate the improvement.Stamin
K
5

For the sake of completeness in the learning curve; you should avoid using global variables. You did a good job though by making it static, so it will be local to the translation unit.

Here is an example:

class ST {
public:
    int func()
    {
        size_t i = 0;
        while (!finished)
            ++i;
        return i;
    }
    void setFinished(bool val)
    {
        finished = val;
    }
private:
    std::atomic<bool> finished = false;
};

int main()
{
    ST st;
    auto result=std::async(std::launch::async, &ST::func, std::ref(st));
    std::this_thread::sleep_for(std::chrono::seconds(1));
    st.setFinished(true);
    std::cout<<"result ="<<result.get();
    std::cout<<"\nmain thread id="<<std::this_thread::get_id()<<std::endl;
}

Live on wandbox

Kaitlinkaitlyn answered 23/10, 2019 at 17:18 Comment(3)
Could also declare finished as static within the function block. It will still be initialized only once, and if it’s initialized to a constant, this does not require locking.Raimund
The accesses to finished could also use cheaper std::memory_order_relaxed load and stores; there's no required ordering wrt. other variables in either thread. I'm not sure @Davislor's suggestion of static makes sense, though; if you had multiple spin-count threads you wouldn't necessary want to stop them all with the same flag. You do want to write the initialization of finished in a way that compiles to just initialization, not an atomic store, though. (Like you're doing with the finished = false; default initializer C++17 syntax. godbolt.org/z/EjoKgq).Bevy
@PeterCordes Putting the flag in an object does allow there to be more than one, for different thread pools, as you say. The original design had a single flag for all threads, however.Raimund

© 2022 - 2024 — McMap. All rights reserved.