Debug Win32 application hang
Asked Answered
G

3

7

I'm having trouble finding the cause for a hang in a Win32 application. The software renders some data to an OpenGL visual in a tight loop:

std::vector<uint8_t> indices;
glPolygonMode(GL_FRONT_AND_BACK, GL_FILL);
glEnableClientState(GL_VERTEX_ARRAY);
glVertexPointer(2, GL_DOUBLE, 0, vertexDataBuffer);
while (...) {
    // get index type (1, 2, 4) and index count
    indices.resize(indexType * count);

    // get indices into "indices" buffer
    getIndices(indices.data(), indices.size()); //< seems to hang here!

    // draw (I'm using the correct parameters)
    glDrawElements(GL_TRIANGLES_*, count, GL_UNSIGNED_*);
}
glDisableClientState(GL_VERTEX_ARRAY);

The code is compiled using VC11 Update 1 (CTP 3). When running the optimized binary, it hangs inside the call to getIndices() (more about this below) after a few of those loops. I already have...

  • triple validated all buffers, even appended CRCs to make sure I'm not having any buffer overruns
  • Added a call to HeapValidate() inside the loop to ensure the heap is not corrupt
  • used ApplicationVerifier
  • Enabled heap allocation monitoring using GFlags and PageHeap.
  • broke into WinDbg when the application locks up

I did not find any problems with the code accessing the allocated buffer, nor any heap corruption. However, if I disable the low-fragmentation heap, the issue vanishes. It also vanishes, if I use a separate (low-fragmentation) heap for the indices buffer.

Anyway, here is the stack trace leading to the dead-lock:

0:000> kb
ChildEBP RetAddr  Args to Child              
0034e328 77b039c3 00000000 0034e350 00000000 ntdll!ZwWaitForKeyedEvent+0x15
0034e394 77b062bc 77b94724 080d36a8 0034e464 ntdll!RtlAcquireSRWLockExclusive+0x12e
0034e3c0 77aeb652 0034e464 0034e4b4 00000000 ntdll!RtlpCallVectoredHandlers+0x58
0034e3d4 77aeb314 0034e464 0034e4b4 77b94724 ntdll!RtlCallVectoredExceptionHandlers+0x12
0034e44c 77aa0133 0034e464 0034e4b4 0034e464 ntdll!RtlDispatchException+0x19
0034e44c 77b062c5 0034e464 0034e4b4 0034e464 ntdll!KiUserExceptionDispatcher+0xf
0034e7bc 77aeb652 0034e860 0034e8b0 00000000 ntdll!RtlpCallVectoredHandlers+0x61
0034e7d0 77aeb314 0034e860 0034e8b0 0034ec28 ntdll!RtlCallVectoredExceptionHandlers+0x12
0034e848 77aa0133 0034e860 0034e8b0 0034e860 ntdll!RtlDispatchException+0x19
0034e848 1c43c666 0034e860 0034e8b0 0034e860 ntdll!KiUserExceptionDispatcher+0xf
0034ebe8 1c43c4e5 0034ec28 080d35d0 080d35d6 lcdb4!lc::db::PackedIndices::unpackIndices<unsigned char>+0x86
0034ec14 1c45922d 0034ec28 080d35d0 00000006 lcdb4!lc::db::PackedIndices::unpack+0xb5
...
xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx getIndices

For completeness, I posted the code of lc::db::PackedIndices::unpackIndices(), including all code added for debugging, to http://ideone.com/sVVXX7.

The code triggering the call to KiUserExceptionDispatcher is (*p++) = static_cast<T>(index); (mov dword ptr [esp+10h],eax).

I just can't seem to figure out what's going on. An exception seems to have been thrown, but none of my exception handlers are called. The application just hangs. I checked for any deadlocked critical sections (!lock) but found none. Furthermore, I don't see why an exception should be raised, as the memory locations are all valid. Could anyone give me some hints?

Update

I tried to find the type of exception being thrown:

0:000> s -d esp L1000 1003f
0028ebdc  0001003f 00000000 00000000 00000000  ?...............
0028efd8  0001003f 00000000 00000000 00000000  ?...............
0:000> .cxr 0028ebdc
eax=77b94724 ebx=0804be30 ecx=00000002 edx=00000004 esi=77b94724 edi=0804be28
eip=77b062c5 esp=0028eec4 ebp=0028eee4 iopl=0         nv up ei ng nz na pe cy
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010287
ntdll!RtlpCallVectoredHandlers+0x61:
77b062c5 ff03            inc     dword ptr [ebx]      ds:002b:0804be30=00000001
0:000> .cxr 0028efd8
eax=0000003b ebx=00000001 ecx=0804bd98 edx=0028f340 esi=0028f340 edi=04b77580
eip=1c43c296 esp=0028f2c0 ebp=0028f2fc iopl=0         nv up ei pl nz na po nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010202
lcdb4!lc::db::PackedIndices::unpackIndices<unsigned char>+0x36:
1c43c296 8801            mov     byte ptr [ecx],al          ds:002b:0804bd98=3e
Garin answered 2/11, 2012 at 10:4 Comment(38)
Am I missing something? You don't seem to have posted the code for getIndices?Whim
@john: I'm sorry - it was probably just too obvious to me. getIndices() is just a tiny wrapper, eventually calling lc::db::PackedIndices::unpack.Garin
Any chance of count being zero? Your code would break if that were true.Whim
Show the code that calls unpackIndices() in the failure case. Also, does p point to the right place at the time of failure?Tradescantia
@john: no - cnt is never 0 (they code path would not be entered)Garin
Seems like the thread is hung trying to take an SRW lock exclusively. !locks command will only look at critical section objects and not SRW. Use !rwlock.Dispossess
@atzz: I added it to ideone.com/sVVXX7.Garin
@nanda: I can't find that command - is it new?Garin
@nanda: .load sosex.dll, !rwlock => Unable to initialize .NET data interface. The CLR has not yet been loaded in the process.... But this is a native app.Garin
Please show the code that calls PackedIndices::unpack (and maybe several layers above it -- as many as needed to see where the indices parameter comes from).Tradescantia
@atzz: indices is the first parameter passed to getIndices() in the code I showed in my question above. There is no code in between that would alter that pointer.Garin
I don't think the SRW lock is the issue. The stack trace shows it's an internal lock from the system exception mechanism. So far, my bet is that you are corrupting something vital (probably a SEH frame on the stack).Tradescantia
@atzz: trouble is, I cannot reproduce the error when running the application under the debugger. It's as if there is some kind of race condition. I was suspecting glDrawElements to still access the memory while I'm modifying it, but I'm not sure.Garin
The reason for the hang, from the call stack of the thread, seems to be slim reader writer lock. But the lock itself belongs (again from the stack) to the exception handling mechanism code of the OS (something like the loader lock). If there are other threads - I suppose there are - you should look at their call stack and see if any of thread is stuck in an exception handling code. If you find the real cause of the exception - there would be no hang of course - but the hang might have clues to the cause for the exception.Dispossess
@nanda: I was already looking for something like this. The only thing I see is this: snipt.org/vhtf2Garin
I'd investigate where p was pointing at the moment of exception. The easiest way is probably to copy it to a global variable before each (*p++) = ... (which shouldn't distort timings).Tradescantia
Leaving the hang aside. The instruction that is causing the crash is a bit surprising. "(mov dword ptr [esp+10h],eax)" given the only memory that is referred in this instruction is on the stack I wonder what kind of an exception it is. You could also inspect the exception object to see what is the exception.Dispossess
So, one thread stops in exception handler due to a lock held in some thread that stopped in GetProcAddress... or other module-related function. I suggest getting rid of SEH, VEH or whatever... and using an UEF (Unhandled Exception Filter). I've had tons of problems with VEH (which is also used in '__except' implementation) during debugging. It doesn't necessarily mean a problem with your program. What about the Release build without debugger attached? Two cents from experience.Quartermaster
@user1240436: I just set an UEF - but it doesn't seem to change anything. It still hangs at the same location.Garin
@nanda: strange. .exr -1 yields: ExceptionAddress: 77aa000c (ntdll!DbgBreakPoint), ExceptionCode: 80000003 (Break instruction exception). So the exception is just me breaking into the deadlocked application?Garin
@nanda: I tried to locate the correct exception context (I'm not within my comfort zone here). Please see the update to the question.Garin
You should 1) remove the __try __except, 2) set a UEF 3) put a debugger breakpoint into it 4) run in debugger (or have something wait for you to analyze it). Have you tried exactly this? This should bypass any exception filters and give you the root exception context. But it might get tricky if you're using C++ exceptions. I suggest adding UEF only just before the problem occurs. If no exceptions occur, try expanding your search for an exception by placing UEF earlier in the code.Quartermaster
You could get the exception_record from the following stack frame. 0034e848 77aa0133 0034e860 0034e8b0 0034e860 ntdll!RtlDispatchException+0x19 - the argument to RtlDispatchException is pointer to EXCEPTION_RECORD. so if you type .exr 0034e860 you should be able to see the exception record corresponding to the call stack above.Dispossess
@nanda: ok, so it's: Attempt to write to address 06a28fd0 ExceptionAddress: 1c43c296 (lcdb4!lc::db::PackedIndices::unpackIndices<unsigned char>+0x00000036) ExceptionCode: c0000005 (Access violation).Garin
06a28fd0 is not on the stack of this thread which should have been the case if "mov dword ptr [esp+10h],eax" was the instruction causing the exception - this is more clear now. So, the address is probably on the heap - now you could closely look at the instruction and figure out from your code how the address correlates to the variables in your code.Dispossess
@nanda: so what? 0x06a28fd0 is the buffer I've allocated (the memory addresses have changed, it's a different run). I don't get it - writing to that memory location should be just fine.Garin
Really, if it's just an Access Violation, use the UEF to stop your application, then attach a debugger to see the actual stack.Quartermaster
@user1240436: but I did - I removed the SEH and added this: SetUnhandledExceptionFilter(uef);, where uef contains {DebugBreak(); return EXCEPTION_CONTINUE_SEARCH;}.Garin
By 'stop your application' I mean Sleep(-1) or a MessageBox(...). DebugBreak raises another terminal exception, or even an infinite exception handler recursion if you return CONTINUE_SEARCH.Quartermaster
@nanda: that's really weird. From another run, I still get an Access violation, this time for address 07751e90. So I search for !heap -p -a 07751e90, and get: address 07751e90 found in _HEAP @ 7a0000, HEAP_ENTRY Size Prev Flags UserPtr UserSize - state, 07751e88 0003 0000 [00] 07751e90 0000c - (busy). So that pointer is definitely valid.Garin
This is weird. Though very very remote, I doubt if any code in your process is modifying page attributes of any memory pages. Try !address 7751e90 and see what are the page attributes of the page containing this portion of memory.Dispossess
I managed to find two other threads that also try to acquire the SRW: snipt.org/vhvb6.Garin
@user1240436: thanks - did that but it won't enter the handler anyway.Garin
@nanda: there you are: Protect: 00000002 PAGE_READONLY. But it's in the heap, and no, I never modified the page attributes.Garin
Ok - it's as Hans Passant says. The heap does seem to be in a weird state. Something locked that memory page that belongs to it. Is it possible to set a breakpoint on heap protection changes?Garin
I see from your snippet that heap is raising exception to indicate corruption or overrun. May be windows changes the page attributes before raising the exception to stop or break further corruption - I am just guessing. There seems to be some corruption in the ole heap too - from your latest snippet. The root of the problem is probably a code corrupting a heap - which the heap finds and raises an exception for and the exception handler code is hung on the SWR lock and then another thread touches an address which the heap has already made protected due to the corruption.Dispossess
@nanda: Thanks for your help - I haven't found the source of the problem yet, but we can be reasonably sure that it's some kind of heap corruption. I'll probably have to spend countless hours to find the source of the problem. Would you like to summarize your comments in an answer so I can accept it?Garin
Daniel, thanks. I have summarized my comments to an answer.Dispossess
D
2

The thread is hung awaiting for an exclusive lock on SRW (slim read write lock) belonging to the OS exception handling code. And that exception is caused by your code. The exact exception and details of it could be found using the following stack frame. 0034e848 77aa0133 0034e860 0034e8b0 0034e860 ntdll!RtlDispatchException+0x19 - the argument to RtlDispatchException is pointer to EXCEPTION_RECORD. So if you type .exr 0034e860 you can see the exception record. From the exception record you would know access to which address is causing the exception (if the exception is access violation exception).

As, after these steps, you had found that the access violation was happening due to a write to an address that you had rightfully allocated on the heap - you can find the protection attributes of the virtual page containing that address through the command !address "the virtual address"

As you had found out that the page protection attributes have been changed to (by some code) PAGE_READONLY on those heap addresses and after seeing the call stack of other threads I have the following conjecture which I think might help you find the root cause.

I am guessing that Windows Heap manager changes the page attributes before raising an exception to indicate heap corruption. There seems to be some corruption in the ole heap too - from the call stack of other threads you had showed. The root of the problem is probably a code corrupting a heap - which the heap finds subsequently and raises an exception for, following that the exception mechanism implementation code of the OS kicks-in and gets hung on the SWR lock before it is able to call the exception handler in your or other library code. Following this another ignorant thread in your code rightfully touches the heap memory, which the heap has already made protected due to the corruption it had already found out about, causing an exception and making the exception mechanism code to kick-in and fall into the same dead-lock. Given that you had said that problem is not reproducible when the program is run under the debugger, it would be anyone's guess that the problem has some timing issue or race condition.

Dispossess answered 4/11, 2012 at 5:49 Comment(0)
T
2

The stack trace tells the story. Your program is crashing, good odds that this is an access violation exception, a typical failure mode for C++ code and usually triggered by heap corruption. Windows then tries to invoke the exception filters to look for any code that is willing to handle the exception. First up are the handlers installed by AddVectoredExceptionHandler(). It must take a lock to do so to prevent re-entry when one of those handlers in turn causes a crash.

And that's where the buck stops. Exactly why is unclear from the stack trace. It could be because another thread has also fallen over on the heap corruption and is busy handling the exception and has taken the lock. Use Debug + Windows + Threads to look at them. But more likely is that the process state is so mangled that the lock object itself got corrupted as well. Unlikely but it does happen.

And yes, switching off the low-fragmentation heap has a knack for hiding heap corruption. The memory layout will be very different so whatever code is causing the corruption may now have whacked something innocent. It is of course not a solution.

Debug + Exception, tick the Thrown checkbox for "Win32 Exceptions". The debugger will now stop when the exception is thrown. At least you'll know what exception is being thrown. Ultimately you do need to find out where the heap corruption occurs. It is never located at the code that crashed, good luck debugging it.

Tun answered 2/11, 2012 at 12:4 Comment(2)
Thanks Hans. This all makes sense. Unfortunately for me, I can't reproduce the problem with the debugger attached (even if I attach it after the process has been created). So I can't break into the debugger when the exception is thrown.Garin
Well, you have you very good guess at what is wrong. Solving heap corruption with a debugger broken at the location where the program falls over isn't that useful anyway. You gotta find the code that actually caused the corruption.Tun
D
2

The thread is hung awaiting for an exclusive lock on SRW (slim read write lock) belonging to the OS exception handling code. And that exception is caused by your code. The exact exception and details of it could be found using the following stack frame. 0034e848 77aa0133 0034e860 0034e8b0 0034e860 ntdll!RtlDispatchException+0x19 - the argument to RtlDispatchException is pointer to EXCEPTION_RECORD. So if you type .exr 0034e860 you can see the exception record. From the exception record you would know access to which address is causing the exception (if the exception is access violation exception).

As, after these steps, you had found that the access violation was happening due to a write to an address that you had rightfully allocated on the heap - you can find the protection attributes of the virtual page containing that address through the command !address "the virtual address"

As you had found out that the page protection attributes have been changed to (by some code) PAGE_READONLY on those heap addresses and after seeing the call stack of other threads I have the following conjecture which I think might help you find the root cause.

I am guessing that Windows Heap manager changes the page attributes before raising an exception to indicate heap corruption. There seems to be some corruption in the ole heap too - from the call stack of other threads you had showed. The root of the problem is probably a code corrupting a heap - which the heap finds subsequently and raises an exception for, following that the exception mechanism implementation code of the OS kicks-in and gets hung on the SWR lock before it is able to call the exception handler in your or other library code. Following this another ignorant thread in your code rightfully touches the heap memory, which the heap has already made protected due to the corruption it had already found out about, causing an exception and making the exception mechanism code to kick-in and fall into the same dead-lock. Given that you had said that problem is not reproducible when the program is run under the debugger, it would be anyone's guess that the problem has some timing issue or race condition.

Dispossess answered 4/11, 2012 at 5:49 Comment(0)
S
1

If you're using a ATI graphic card (with ATI drivers), it's a known issue that you must not leak any state else memory corruption occurs later on.

Try to disable all the states you can (glDisableClientState), use APITrace to find out which one you've forgot.

One easy way to test for memory corruption in a graphic driver is either to test on another board/driver, or force software rendering.

Succinate answered 2/11, 2012 at 11:20 Comment(1)
Thanks - the driver is NVIDIA. But I've been using glDEbugger to ensure that I'm not leaking anything.Garin

© 2022 - 2024 — McMap. All rights reserved.