All 3 major compilers (msvc, gcc, clang) optimize tls accesses like in your example, based on the assumption that the executing thread never changes.
It is even worse than it looks like - tls accesses can also be optimized across function call boundaries thanks to inlining and CSE.
What you would need for this to work is fiber-safe thread-local storage.
(i.e. tls accesses need to re-evaluate the index each time they are accessed)
Unfortunately MSVC is the only compiler that currently provides an official way to do that with the /GT
compiler switch.
gcc and clang don't offer any official way to get that behaviour, and don't plan to do so either judging from their issues:
You aren't the first one to run into those problems either; lots of other projects that use coroutines / fibers that can switch between threads encountered the same problem.
Just to name a few:
gcc & clang workaround
The suggested workaround for gcc & clang is to use noinline-functions that wrap access to the thread-local variable, e.g.:
godbolt
thread_local int* tls = nullptr;
[[gnu::noinline]] int* getTls() {
asm volatile("");
return tls;
}
[[gnu::noinline]] void setTls(int* val) {
asm volatile("");
tls = val;
}
noinline
prevents the compiler from directly inlining the function
asm volatile("");
is required due to both functions not having any side-effects and serves as a special side effect to prevent the compiler from optimizing away calls to that function. (see gcc noinline docs)
This will obviously slow down your tls accesses quite a bit (each access now requires an extra function call and needs to re-evaluate the tls index each time) - but at least it'll work correctly.
(qemu has a neat macro for this)
Note though that this'll only fix the issue for your own thread-local variables.
Most implementations also use thread-local variables internally (for example errno
, pthread_self()
, std::this_thread::get_id()
, etc...), those will experience the same tls caching issue.
(which can also result in race conditions, e.g. if one thread attempts to write into the tls index of errno
of another thread...)
There's unfortunately no workaround for those thread locals (due to them being hidden within library code), so you're unfortunately on your own for those ones (at least on clang & gcc).
the future
With C++20 we got native coroutine support, which also makes switching between threads straightforward.
So a lot more users had this exact issue with native C++ coroutines - for those clang implemented a fix in trunk:
However this fix only applies to native C++ coroutines; it doesn't apply to libcontext, boost.context, etc... (at least for now; maybe we'll get some function-attributes to handle this in the future)
So if you're able to switch to native C++ coroutines then this could be a potential solution.
Small coroutine example:
godbolt
#include <coroutine>
#include <iostream>
#include <thread>
auto switch_to_new_thread()
{
struct awaitable
{
bool await_ready() {
return false;
}
void await_suspend(std::coroutine_handle<> h) {
std::thread([h] { h.resume(); }).detach();
}
void await_resume() {
}
};
return awaitable{};
}
struct task
{
struct promise_type
{
task get_return_object() { return {}; }
std::suspend_never initial_suspend() { return {}; }
std::suspend_never final_suspend() noexcept { return {}; }
void return_void() {}
void unhandled_exception() {}
};
};
task my_coroutine() {
std::cout << "Running on thread "
<< std::this_thread::get_id()
<< std::endl;
co_await switch_to_new_thread();
std::cout << "Running on thread "
<< std::this_thread::get_id()
<< std::endl;
co_await switch_to_new_thread();
std::cout << "Running on thread "
<< std::this_thread::get_id()
<< std::endl;
}
int main() {
my_coroutine();
std::this_thread::sleep_for(std::chrono::seconds(1));
return 0;
}
- When compiled with clang 15 with
-O0
: godbolt
(correct output - 3 different thread ids):
Running on thread 139806754031424
Running on thread 139806754027264
Running on thread 139806745634560
- With clang 15
-O2
we see the original bug: godbolt
(wrong output - three times the same thread id):
Running on thread 140037315024704
Running on thread 140037315024704
Running on thread 140037315024704
- With clang trunk
-O2
the fix is working: godbolt
(correct output - 3 different thread ids):
Running on thread 140633090672448
Running on thread 140633090668288
Running on thread 140633082275584