In Visual Studio, `thread_local` variables' destructor not called when used with std::async, is this a bug?

Asked 17/6, 2018 at 14:53 Answered 18/6, 2018 at 2:18

Solved c++multithreading visual-studio memory-leaks

The following code

#include <iostream>
#include <future>
#include <thread>
#include <mutex>

std::mutex m;

struct Foo {
    Foo() {
        std::unique_lock<std::mutex> lock{m};
        std::cout <<"Foo Created in thread " <<std::this_thread::get_id() <<"\n";
    }

    ~Foo() {
        std::unique_lock<std::mutex> lock{m};
        std::cout <<"Foo Deleted in thread " <<std::this_thread::get_id() <<"\n";
    }

    void proveMyExistance() {
        std::unique_lock<std::mutex> lock{m};
        std::cout <<"Foo this = " << this <<"\n";
    }
};

int threadFunc() {
    static thread_local Foo some_thread_var;

    // Prove the variable initialized
    some_thread_var.proveMyExistance();

    // The thread runs for some time
    std::this_thread::sleep_for(std::chrono::milliseconds{100}); 

    return 1;
}

int main() {
    auto a1 = std::async(std::launch::async, threadFunc);
    auto a2 = std::async(std::launch::async, threadFunc);
    auto a3 = std::async(std::launch::async, threadFunc);

    a1.wait();
    a2.wait();
    a3.wait();

    std::this_thread::sleep_for(std::chrono::milliseconds{1000});        

    return 0;
}

Compiled and run width clang in macOS:

clang++ test.cpp -std=c++14 -pthread
./a.out

Got result

Foo Created in thread 0x70000d9f2000
Foo Created in thread 0x70000daf8000
Foo Created in thread 0x70000da75000
Foo this = 0x7fd871d00000
Foo this = 0x7fd871c02af0
Foo this = 0x7fd871e00000
Foo Deleted in thread 0x70000daf8000
Foo Deleted in thread 0x70000da75000
Foo Deleted in thread 0x70000d9f2000

Compiled and run in Visual Studio 2015 Update 3:

Foo Created in thread 7180
Foo this = 00000223B3344120
Foo Created in thread 8712
Foo this = 00000223B3346750
Foo Created in thread 11220
Foo this = 00000223B3347E60

Destructor are not called.

Is this a bug or some undefined grey zone?

P.S.

If the sleep std::this_thread::sleep_for(std::chrono::milliseconds{1000}); at the end is not long enough, you may not see all 3 "Delete" messages sometimes.

When using std::thread instead of std::async, the destructors get called on both platform, and all 3 "Delete" messages will always be printed.

Pesce answered 17/6, 2018 at 14:53 Comment(6)

IIRC MSVC uses a thread pool for std::async. That could mean that the thread the function was run on is still alive after the future is available. (Only speculating here) – Lanate 17/6, 2018 at 15:54

The issue is discussed here – Pomace 18/6, 2018 at 3:28

@Pomace Yes, thanks, I saw that yesterday but didn't have time to read it properly. I think it might be a bit out of date now. – Beamer 18/6, 2018 at 4:38

@PaulSanders It seems to me that the issues raised there have not been resolved (as evinced by this question) – Pomace 18/6, 2018 at 4:39

@Pomace Yes I agree, it's a complex issue and the standard doesn't really address it properly (it makes too many rather blasé promises which are not a good fit for thread pools). I plan to investigate the MSVC implementation in a bit more detail later today and will post my findings back to my answer. I think Linux ducks the issue entirely by just wrapping std::thread - maybe MS tried a bit too hard here. – Beamer 18/6, 2018 at 4:54

Tracking bug in msvc STL implementationd github.com/microsoft/STL/issues/949 – Euphonic 3/7, 2020 at 8:29

Introductory Note: I have now learned a lot more about this and have therefore re-written my answer. Thanks to @super, @M.M and (latterly) @DavidHaim and @NoSenseEtAl for putting me on the right track.

tl;dr Microsoft's implementation of std::async is non-conformant, but they have their reasons and what they have done can actually be useful, once you understand it properly.

For those who don't want that, it is not too difficult to code up a drop-in replacement replacement for std::async which works the same way on all platforms. I have posted one here.

Edit: Wow, how open MS are being these days, I like it, see: https://github.com/MicrosoftDocs/cpp-docs/issues/308

Let's being at the beginning. cppreference has this to say (emphasis and strikethrough mine):

The template function async runs the function f asynchronously (~~potentially~~ optionally in a separate thread which may be part of a thread pool).

However, the C++ standard says this:

If launch::async is set in policy, [std::async] calls [the function f] as if in a new thread of execution ...

So which is correct? The two statements have very different semantics as the OP has discovered. Well of course the standard is correct, as both clang and gcc show, so why does the Windows implementation differ? And like so many things, it comes down to history.

The (oldish) link that M.M dredged up has this to say, amongst other things:

... Microsoft has its implementation of [std::async] in the form of PPL (Parallel Pattern Library) ... [and] I can understand the eagerness of those companies to bend the rules and make these libraries accessible through std::async, especially if they can dramatically improve performance...

... Microsoft wanted to change the semantics of std::async when called with launch_policy::async. I think this was pretty much ruled out in the ensuing discussion ... (rationale follows, if you want to know more then read the link, it's well worth it).

And PPL is based on Windows' built-in support for ThreadPools, so @super was right.

So what does the Windows thread pool do and what is it good for? Well, it's intended to manage frequently-sheduled, short-running tasks in an efficient way so point 1 is don't abuse it, but my simple tests show that if this is your use-case then it can offer significant efficiencies. It does, essentially, two things

It recycles threads, rather than having to always start a new one for each asynchronous task you launch.
It limits the total number of background threads it uses, after which a call to std::async will block until a thread becomes free. On my machine, this number is 768.

So knowing all that, we can now explain the OP's observations:

A new thread is created for each of the three tasks started by main() (because none of them terminates immediately).
Each of these three threads creates a new thread-local variable Foo some_thread_var.
These three tasks all run to completion but the threads they are running on remain in existence (sleeping).
The program then sleeps for a short while and then exits, leaving the 3 thread-local variables un-destructed.

I ran a number of tests and in addition to this I found a few key things:

When a thread is recycled, the thread-local variables are re-used. Specifically, they are not destroyed and then re-created (you have been warned!).
If all the asynchonous tasks complete and you wait long enough, the thread pool terminates all the associated threads and the thread-local variables are then destroyed. (No doubt the actual rules are more complex than that but that's what I observed).
As new asynchonous tasks are submitted, the thread pool limits the rate at which new threads are created, in the hope that one will become free before it needs to perform all that work (creating new threads is expensive). A call to std::async might therefore take a while to return (up to 300ms in my tests). In the meantime, it's just hanging around, hoping that its ship will come in. This behaviour is documented but I call it out here in case it takes you by surprise.

Conclusions:

Microsoft's implementation of std::async is non-conformant but it is clearly designed with a specific purpose, and that purpose is to make good use of the Win32 ThreadPool API. You can beat them up for blantantly flouting the standard but it's been this way for a long time and they probably have (important!) customers who rely on it. I will ask them to call this out in their documentation. Not doing that is criminal.
It is not safe to use thread_local variables in std::async tasks on Windows. Just don't do it, it will end in tears.

Beamer answered 17/6, 2018 at 16:39 Comment(8)

For my case, I can't replace thread_local with a local variable. The code in the question is just an example that reproduces the issue. Well... now, this issue, if the thread never got destroyed before the process exits, and so thread_local variables never destructs... Without a destructor, it's not C++ anymore. I'd probably have to throw std::async into the deepest abyss as long as I'm targeting Windows. : ( So sad. – Pesce 17/6, 2018 at 19:12

@Rnmss using a thread pool for async is very common in most other languages because it's such an obvious win for the usual scenario (async under Linux cannot be used in many situations due to the high associated cost). You shouldn't rely on other implementations not doing the same - the reason is simply that right now the infrastructure does not exist under Linux. There are solutions for this, from explicitly passing state to explicitly handling the lifetime. – Canon 17/6, 2018 at 19:25

@Pesce The thing is, you can't have it both ways. Either, you accept and work with the semantics of std::async to get the efficiencies that [we hope] it brings or you use std::thread to ensure that your thread_local objects go away when you want them to. thread_local is absolutely not intended to be used with std::async for the reasons I gave. If you're doing that then you're sitting on a time bomb - on any platform. I think that's worth an upvote, personally, at least you now know where you stand. – Beamer 17/6, 2018 at 19:51

@voo I see that Linux has some sort of thread pool support, see : linux.die.net/man/3/cp_thread_pool. Is it any good, would you know? I see that the man page rather coyly says simply "a thread pool implementation". Could mean anything. – Beamer 17/6, 2018 at 19:55

@Paul First time I hear about it. Last time I checked std::async with gcc didn't use it at least, which made the whole idea of std::async rather pointless - if you don't have threadpool support, std::async is basically nothing but std::thread with some functions tacked on (which also meant I couldn't use it sadly). – Canon 17/6, 2018 at 20:30

@PaulSanders It seems the problem is not about thread_local any more... If I want a thread pool, I use a thread pool. But now if I use std::async (in Windows) , maybe just for initialization, it creates a thread pool and possess my RAM forever. (omg! I'm in rage! It didn't ask me! Efficiency does not work this way. It didn't even ask me the size of the thread pool. How could it be so sure the optimized pool size is the number, maybe, the number of cores my CPU has? CPU is often not the bottleneck.) – Pesce 18/6, 2018 at 4:19

@Pesce Some good points there, yes, but I just updated my post so please take a look, let me know what you think. I can't comment on the efficiency of Microsoft's implementation or whether it's suitable for your use case but if you want more control then wrapping the Win32 ThreadPool API in some abstraction of your own might indeed be the way to go. For Linux, see also this (no idea if it's actually any good). – Beamer 18/6, 2018 at 4:33

While I can follow your defense of Microsoft's implementation, it appears to me that the better solution would have been to add a launch::__pool policy, as an MSVC extension.The std::async design is solid enough to allow such extensions. – Wanids 6/9, 2019 at 8:28

Looks like just another of many bugs in VC++. Consider this quote from n4750

All variables declared with the thread_local keyword have thread storage duration . The storage for these entities shall last for the duration of the thread in which they are created. There is a distinct object or reference per thread, and use of the declared name refers to the entity associated with the current thread. 2 A variable with thread storage duration shall be initialized before its first odr-use (6.2) and, if constructed, shall be destroyed on thread exit.

+this

If the implementation chooses the launch::async policy, — (5.3) a call to a waiting function on an asynchronous return object that shares the shared state created by this async call shall block until the associated thread has completed, as if joined, or else time out (33.3.2.5);

I could be wrong("thread exit" vs "thread completed", but I feel this means that thread_local variables need to be destroyed before .wait() call unblocks.

Symonds answered 18/6, 2018 at 2:18 Comment(3)

Once you read that I think you will see that to characterise this as "just another of many bugs in VC++" is an oversimplification and somewhat insulting to MS, who, a bit of 'spin' aside, are making a big effort to become standards compliant (which is no small task). Thing is, the current definition of std::async is actually a bit of a mess so it's not surprising that people like us get led astray. – Beamer 18/6, 2018 at 5:16

I know about that link. Beside it being blatant PR lie I did not learn anything from it that I did not know. MSVC is still years behind clang and gcc in standards conformance. – Symonds 18/6, 2018 at 14:14

I snuffled around here and there and learned a lot more. I updated my answer, in case you're interested. And yes, you're right, basically, but there are mitigating factors as my answer explains. – Beamer 21/6, 2018 at 12:1

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags