Windows SuspendThread doesn't? (GetThreadContext fails)
Asked Answered
R

5

5

We have an Windows32 application in which one thread can stop another to inspect its state [PC, etc.], by doing SuspendThread/GetThreadContext/ResumeThread.

if (SuspendThread((HANDLE)hComputeThread[threadId])<0)  // freeze thread
   ThreadOperationFault("SuspendThread","InterruptGranule");
CONTEXT Context, *pContext;
Context.ContextFlags = (CONTEXT_INTEGER | CONTEXT_CONTROL);
if (!GetThreadContext((HANDLE)hComputeThread[threadId],&Context))
   ThreadOperationFault("GetThreadContext","InterruptGranule");

Extremely rarely, on a multicore system, GetThreadContext returns error code 5 (Windows system error code "Access Denied").

The SuspendThread documentation seems to clearly indicate that the targeted thread is suspended, if no error is returned. We are checking the return status of SuspendThread and ResumeThread; they aren't complaining, ever.

How can it be the case that I can suspend a thread, but can't access its context?

This blog http://www.dcl.hpi.uni-potsdam.de/research/WRK/2009/01/what-does-suspendthread-really-do/

suggests that SuspendThread, when it returns, may have started the suspension of the other thread, but that thread hasn't yet suspended. In this case, I can kind of see how GetThreadContext would be problematic, but this seems like a stupid way to define SuspendThread. (How would the call of SuspendThread know when the target thread was actually suspended?)

EDIT: I lied. I said this was for Windows.

Well, the strange truth is that I don't see this behavior under Windows XP 64 (at least not in the last week and I don't really know what happened before that)... but we have been testing this Windows application under Wine on Ubuntu 10.x. The Wine source for the guts of GetThreadContext contains an Access Denied return response on line 819 when an attempt to grab the thread state fails for some reason. I'm guessing, but it appears that Wine GetThreadStatus believes that a thread just might not be accessible repeatedly. Why that would be true after a SuspendThead is beyond me, but there's the code. Thoughts?

EDIT2: I lied again. I said we only saw the behavior on Wine. Nope... we have now found a Vista Ultimate system that seems to produce the same error (again, rarely). So, it appears that Wine and Windows agree on an obscure case. It also appears that the mere enabling of the Sysinternals Process monitor program aggravates the situation and causes the problem to appear on Windows XP 64; I suspect a Heisenbug. (The Process Monitor doesn't even exist on the Wine-tasting (:-) machine or the XP 64 system I use for development).

What on earth is it?

EDIT3: Sept 15 2010. I've added careful checking to the error return status, without otherwise disturbing the code, for SuspendThread, ResumeThread, and GetContext. I haven't seen any hint of this behavior on Windows systems since I did that. Haven't gotten back to the Wine experiment.

Nov 2010: Strange. It seems that if I compile this under VisualStudio 2005, it fails on Windows Vista and 7, but not earlier OSes. If I compile under VisualStudio 2010, it doesn't fail anywhere. One might point a finger at VisualStudio2005, but I'm suspicious of a location-sensitivve problem, and different optimizers in VS 2005 and VS 2010 place the code a slightly different places.

Nov 2012: Saga continues. We see this failure on a number of XP and Windows 7 machines, at a pretty low rate (once every several thousand runs). Our Suspend activities are applied to threads that mostly execute pure computational code but that sometimes make calls into Windows. I don't recall seeing this issue when the PC of the thread was in our computational code. Of course, I can't see the PC of the thread when it hangs because GetContext won't give it to me, so I can't directly confirm that the problem only happens when executing system calls. But, all our system calls are channeled through one point, and so far the evidence is that point was executed when we get the hang. So the indirect evidence suggests GetContext on a thread only fails if a system call is being executed by that thread. I haven't had the energy to build a critical experiment to test this hypothesis yet.

Rile answered 9/8, 2010 at 21:4 Comment(1)
The Nov 2010 "get different results when compiling with VS 2005 vs VS 2010" might have to do with an alignment constraint added(?) in later versions of the OS. See https://mcmap.net/q/644809/-getthreadcontext-fails-after-a-successful-suspendthread-in-windows-7/120163Rile
M
4

Let me quote from Richter/Nassare's "Windows via C++ 5Ed" which may shed some light:

DWORD SuspendThread(HANDLE hThread);

Any thread can call this function to suspend another thread (as long as you have the thread's handle). It goes without saying (but I'll say it anyway) that a thread can suspend itself but cannot resume itself. Like ResumeThread, SuspendThread returns the thread's previous suspend count. A thread can be suspended as many as MAXIMUM_SUSPEND_COUNT times (defined as 127 in WinNT.h). Note that SuspendThread is asynchronous with respect to kernel-mode execution, but user-mode execution does not occur until the thread is resumed.

In real life, an application must be careful when it calls SuspendThread because you have no idea what the thread might be doing when you attempt to suspend it. If the thread is attempting to allocate memory from a heap, for example, the thread will have a lock on the heap. As other threads attempt to access the heap, their execution will be halted until the first thread is resumed. SuspendThread is safe only if you know exactly what the target thread is (or might be doing) and you take extreme measures to avoid problems or deadlocks caused by suspending the thread.

...

Windows actually lets you look inside a thread's kernel object and grab its current set of CPU registers. To do this, you simply call GetThreadContext:

BOOL GetThreadContext( HANDLE hThread, PCONTEXT pContext);

To call this function, just allocate a CONTEXT structure, initialize some flags (the structure's ContextFlags member) indicating which registers you want to get back, and pass the address of the structure to GetThreadContext. The function then fills in the members you've requested.

You should call SuspendThread before calling GetThreadContext; otherwise, the thread might be scheduled and the thread's context might be different from what you get back. A thread actually has two contexts: user mode and kernel mode. GetThreadContext can return only the user-mode context of a thread. If you call SuspendThread to stop a thread but that thread is currently executing in kernel mode, its user-mode context is stable even though SuspendThread hasn't actually suspended the thread yet. But the thread cannot execute any more user-mode code until it is resumed, so you can safely consider the thread suspended and GetThreadContext will work.

My guess is that GetThreadContext may fail if you just called SuspendThread, while the thread is in kernel mode, and the kernel is locking the thread context block at this time.

Maybe on multicore systems, one core is handling the kernel-mode execution of the thread that it's user mode was just suspended, keep locking the CONTEXT structure of the thread, exactly when the other core is calling GetThreadContext.

Since this behaviour is not documented, I suggest contacting microsoft.

Max answered 18/8, 2010 at 18:53 Comment(0)
G
3

There are some particular problems surrounding suspending a thread that owns a CriticalSection. I can't find a good reference to it now, but there is one mention of it on Raymond Chen's blog and another mention on Chris Brumme's blog. Basically, if you are unlucky enough to call SuspendThread while the thread is accessing an OS lock (e.g., heap lock, DllMain lock, etc.), then really strange things can happen. I would assume that this is the case that you are running into extremely rarely.

Does retrying the call to GetThreadContext work after a processor yield like Sleep(0)?

Graven answered 9/8, 2010 at 21:33 Comment(7)
AFAIK, it doesn't matter if a thread owns a CriticalSection. If you suspend it, you suspend it owning the CriticalSection; that's no worse than suspending owning another resource (e.g., a block of dynamically allocated storage) unless the suspender attempts to use that resource. We aren't doing that.Rile
... Which thread are you suggesting is doing the Sleep(0), the suspender or the the suspendee? I can't see the point of the suspender doing Sleep(0), and the suspender can't make the suspendee do a Sleep(0) at his convenience, so I don't understand what is being suggested.Rile
I looked at Chen's blog. Yes, if the suspender uses the same resource (including dynamic allocation) one can get deadlock. Our inspection thread doesn't do that (2 lines of code between SuspendThread and GetThreadContext, to set Context to what we want; see example coded added to my question). And, we aren't seeing deadlock; rather, we are seeing GetThreadContext produce error 5, which makes no sense.Rile
After your latest comments, my guess is that one of the IO calls in the guts of send_request or wait_reply in wine/dlls/ntdll/server.c is failing. Use a tracing tool like strace to trace the system calls and see which one is failing and why.Graven
@Shawley: Hmm. strace might give some insight. I'm pretty worried that changing the timing of the calls will change the behaviour, since the problem appears to threading-stop related, but it appears the experiment might be relatively easy. I'll look at giving it a try.Rile
@Shawley: We struggled with strace and Wine. First, it produces an immense amount of output (100s of MB) just starting up Wine, but that's just an annoyance; it doesn't appear to produce any output from our Wine-emulated program. We're guessing that's because Wine forks a subprocess. We attempted to use -f with strace (to trace the fork) but we never see the start our program execution; Wine jsut hangs. So strace is unable to show us what is happening. (Wine will normally run our emulated program just fine modulo the occasional Access Denied response I've described). This is under Ubuntu 10.Rile
Are you sure that you are not using the CRT in your suspending code? Otherwise you might dead-lock; See also: blog.kalmbachnet.de/?postid=16Cordless
L
3

Old issue but good to see you still kept it updated with status changes after experiencing the issue for another more than 2 years.

The cause of your problem is that there is a bug in the translation layer of the x64 version of WoW64, as per:

http://social.msdn.microsoft.com/Forums/en/windowscompatibility/thread/1558e9ca-8180-4633-a349-534e8d51cf3a

There is a rather critical bug in GetThreadContext under WoW64 which makes it return stale contents which makes it unusable in many situations. The contents is stored in user-mode This is why you think the value is not-null but in the stale contents it is still null.

This is why it fails on newer OS but not older ones, try running it on Windows 7 32bit OS.

As for why this bug seems to happen less often with solutions built on Visual Studio 2010 / 2012 it is likely that there is something the compiler is doing which is mitigating most of the problem, for this you should inspect the IL generated from both 2005 and 2010 and see what the differences are. For example does the problem happen if the project is built without optimizations perhaps?

Finally, some further reading:

http://www.nynaeve.net/?p=129

Leaving answered 7/4, 2013 at 7:38 Comment(4)
Yes, we ran into that problem in addition to this one. This problem's symptom are that GetThreadContext returns an error, basically telling you didn't get the context. That is annoying but not fatal. The problem that Zach uncovered is that GetThreadContext under Wow64 will simply lie to you about that context you get; you just get wrong values. That's fatal if your application depends on discovering the PC of an interrupted thread, as ours does. This is as close to technical crime as you can get; any OS designer would look at you in horror if you told him you couldn't get thread state.Rile
NEWS: I now can stably control when I ask for contexts and when I change them, and I have a chance to run thousands of trials on different systems. It appears that under WinXP-64, the behavior of GetThreadContext seems to be just fine. Too bad MS has discontinued support for WinXP-64; it actually works. I don't have a 64 bit Windows Vista sytem. Several Windows 7 systems fail once every few thousand tries, which mean long running programs that use this call die unpredictably once in a while. Windows 8 claims to have a fix, which I have installed, but it does not appear to work.Rile
@Ira Baxter: Do you happen to know if there's a work-around for this, like spinning in a loop with a short sleep, retrying the GetThreadContext call until it succeeds/returns non-zero registers? In my case the code isn't time-critical when I need to do this (in order to generate a stack trace from a debugger).Senskell
I had trouble with "Suspendthread" before this, where it returned a failed status. I was successful as just ignoring and trying again "later". MS claims to have fixed the problem with GetThreadContext with Windows 8.1; now you have to check the returned result, and if it says it failed, presumably you can try again. I have not had success with this, but I haven't pursued it very hard.Rile
F
0

Maybe a thread safety issue. Are you sure that the hComputeThread struct isn't changing out from under you? Maybe the thread was exiting when you called suspend? This may cause suspend to succeed, but by the time you call get context it is gone and the handle is invalid.

Febrile answered 20/8, 2010 at 17:15 Comment(2)
None of the answers seems to pan out. I'm handing you the points for at least a plausible explanation. I don't actually believe I have this problem but I'm adding a "GetHandleProperties" check to see if GetHandle complains.Rile
(Nov 2010) 2 months since I added GetHandleProperties check-for-null-handle. Never triggers.Rile
F
0

Calling SuspendThread on a thread that owns a synchronization object, such as a mutex or critical section, can lead to a deadlock if the calling thread tries to obtain a synchronization object owned by a suspended thread. - MSDN

Formenti answered 4/2, 2014 at 14:25 Comment(2)
Yes, that's classical and obvious. That's not the problem. The only thing done by the thread that does the suspend is to inspect or change the state of the suspended thread, and do local computation. No other locks or OS calls of any kind.Rile
Are you sure that you are not using the CRT in your suspending code? Otherwise you might dead-lock; See also: blog.kalmbachnet.de/?postid=16Cordless

© 2022 - 2024 — McMap. All rights reserved.