How to disable clang expression elimination for thread_local variable
Asked Answered
H

1

9
thread_local int* tls = nullptr;
// using libcontext to jump stack.
void jump_stack();
void* test() {
    // before jump_stack, assume we are at thread 1.
    int *cur_tls = tls;
    jump_stack();
    // after jump stack, we are at thread 2.
    // we need to reload tls.
    cur_tls = tls;
}

OSX: Darwin Kernel Version 22.1.0 (Apple M1 chip)
Clang:Apple clang version 14.0.0 (clang-1400.0.29.202)

clang++ -c test.cpp --std=c++11 -g -O0
; void* test() {
       0: ff c3 00 d1   sub     sp, sp, #48
       4: fd 7b 02 a9   stp     x29, x30, [sp, #32]
       8: fd 83 00 91   add     x29, sp, #32
       c: 00 00 00 90   adrp    x0, 0x0 <ltmp0+0xc>
      10: 00 00 40 f9   ldr     x0, [x0]
      14: 08 00 40 f9   ldr     x8, [x0]
      18: 00 01 3f d6   blr     x8
      1c: e0 07 00 f9   str     x0, [sp, #8]
;       int *cur_tls = tls;
      20: 08 00 40 f9   ldr     x8, [x0]
      24: e8 0b 00 f9   str     x8, [sp, #16]
;       jump_stack();
      28: 00 00 00 94   bl      0x28 <ltmp0+0x28>
      2c: e0 07 40 f9   ldr     x0, [sp, #8]
;       cur_tls = tls;
      30: 08 00 40 f9   ldr     x8, [x0]
      34: e8 0b 00 f9   str     x8, [sp, #16]
; }
      38: a0 83 5f f8   ldur    x0, [x29, #-8]
      3c: fd 7b 42 a9   ldp     x29, x30, [sp, #32]
      40: ff c3 00 91   add     sp, sp, #48
      44: c0 03 5f d6   ret

before jump_stack, the tls has cached into [sp, #16], after jump_stack then reload [sp, #16] into cur_tls which the tls belong to the thread 1 not the thread 2.

Is there are any clang options to disable this optimization to reload thread_local variable always belong to current thread.

Hiss answered 28/2, 2023 at 12:24 Comment(0)
Q
8

All 3 major compilers (msvc, gcc, clang) optimize tls accesses like in your example, based on the assumption that the executing thread never changes.
It is even worse than it looks like - tls accesses can also be optimized across function call boundaries thanks to inlining and CSE.

What you would need for this to work is fiber-safe thread-local storage.
(i.e. tls accesses need to re-evaluate the index each time they are accessed)

Unfortunately MSVC is the only compiler that currently provides an official way to do that with the /GT compiler switch.

gcc and clang don't offer any official way to get that behaviour, and don't plan to do so either judging from their issues:


You aren't the first one to run into those problems either; lots of other projects that use coroutines / fibers that can switch between threads encountered the same problem.
Just to name a few:


gcc & clang workaround

The suggested workaround for gcc & clang is to use noinline-functions that wrap access to the thread-local variable, e.g.:

godbolt

thread_local int* tls = nullptr;

[[gnu::noinline]] int* getTls() {
    asm volatile("");
    return tls;
}

[[gnu::noinline]] void setTls(int* val) {
    asm volatile("");
    tls = val;
}
  • noinline prevents the compiler from directly inlining the function
  • asm volatile(""); is required due to both functions not having any side-effects and serves as a special side effect to prevent the compiler from optimizing away calls to that function. (see gcc noinline docs)

This will obviously slow down your tls accesses quite a bit (each access now requires an extra function call and needs to re-evaluate the tls index each time) - but at least it'll work correctly.

(qemu has a neat macro for this)


Note though that this'll only fix the issue for your own thread-local variables.
Most implementations also use thread-local variables internally (for example errno, pthread_self(), std::this_thread::get_id(), etc...), those will experience the same tls caching issue.
(which can also result in race conditions, e.g. if one thread attempts to write into the tls index of errno of another thread...)

There's unfortunately no workaround for those thread locals (due to them being hidden within library code), so you're unfortunately on your own for those ones (at least on clang & gcc).


the future

With C++20 we got native coroutine support, which also makes switching between threads straightforward.

So a lot more users had this exact issue with native C++ coroutines - for those clang implemented a fix in trunk:

However this fix only applies to native C++ coroutines; it doesn't apply to libcontext, boost.context, etc... (at least for now; maybe we'll get some function-attributes to handle this in the future)

So if you're able to switch to native C++ coroutines then this could be a potential solution.

Small coroutine example: godbolt

#include <coroutine>
#include <iostream>
#include <thread>
 
auto switch_to_new_thread()
{
    struct awaitable
    {
        bool await_ready() {
            return false;
        }
        void await_suspend(std::coroutine_handle<> h) {
            std::thread([h] { h.resume(); }).detach();
        }
        void await_resume() {
        }
    };

    return awaitable{};
}
 
struct task
{
    struct promise_type
    {
        task get_return_object() { return {}; }
        std::suspend_never initial_suspend() { return {}; }
        std::suspend_never final_suspend() noexcept { return {}; }
        void return_void() {}
        void unhandled_exception() {}
    };
};


task my_coroutine() {
    std::cout << "Running on thread "
              << std::this_thread::get_id()
              << std::endl;

    co_await switch_to_new_thread();

    std::cout << "Running on thread "
              << std::this_thread::get_id()
              << std::endl;

    co_await switch_to_new_thread();

    std::cout << "Running on thread "
              << std::this_thread::get_id()
              << std::endl;
}

int main() {
    my_coroutine();
    
    std::this_thread::sleep_for(std::chrono::seconds(1));
    return 0;
}
  • When compiled with clang 15 with -O0: godbolt
    (correct output - 3 different thread ids):
    Running on thread 139806754031424
    Running on thread 139806754027264
    Running on thread 139806745634560
    
  • With clang 15 -O2 we see the original bug: godbolt
    (wrong output - three times the same thread id):
    Running on thread 140037315024704
    Running on thread 140037315024704
    Running on thread 140037315024704
    
  • With clang trunk -O2 the fix is working: godbolt
    (correct output - 3 different thread ids):
    Running on thread 140633090672448
    Running on thread 140633090668288
    Running on thread 140633082275584
    
Quatrefoil answered 3/3, 2023 at 2:46 Comment(1)
For anyone else still experiencing this in clang, the bug was migrated to: github.com/llvm/llvm-project/issues/19551Guthrun

© 2022 - 2024 — McMap. All rights reserved.