Debug Win32 application hang

Asked 2/11, 2012 at 10:4 Answered 4/11, 2012 at 5:49

Solved c++debugging exception windbg visual-c++-2012

I'm having trouble finding the cause for a hang in a Win32 application. The software renders some data to an OpenGL visual in a tight loop:

std::vector<uint8_t> indices;
glPolygonMode(GL_FRONT_AND_BACK, GL_FILL);
glEnableClientState(GL_VERTEX_ARRAY);
glVertexPointer(2, GL_DOUBLE, 0, vertexDataBuffer);
while (...) {
    // get index type (1, 2, 4) and index count
    indices.resize(indexType * count);

    // get indices into "indices" buffer
    getIndices(indices.data(), indices.size()); //< seems to hang here!

    // draw (I'm using the correct parameters)
    glDrawElements(GL_TRIANGLES_*, count, GL_UNSIGNED_*);
}
glDisableClientState(GL_VERTEX_ARRAY);

The code is compiled using VC11 Update 1 (CTP 3). When running the optimized binary, it hangs inside the call to getIndices() (more about this below) after a few of those loops. I already have...

triple validated all buffers, even appended CRCs to make sure I'm not having any buffer overruns
Added a call to HeapValidate() inside the loop to ensure the heap is not corrupt
used ApplicationVerifier
Enabled heap allocation monitoring using GFlags and PageHeap.
broke into WinDbg when the application locks up

I did not find any problems with the code accessing the allocated buffer, nor any heap corruption. However, if I disable the low-fragmentation heap, the issue vanishes. It also vanishes, if I use a separate (low-fragmentation) heap for the indices buffer.

Anyway, here is the stack trace leading to the dead-lock:

0:000> kb
ChildEBP RetAddr  Args to Child              
0034e328 77b039c3 00000000 0034e350 00000000 ntdll!ZwWaitForKeyedEvent+0x15
0034e394 77b062bc 77b94724 080d36a8 0034e464 ntdll!RtlAcquireSRWLockExclusive+0x12e
0034e3c0 77aeb652 0034e464 0034e4b4 00000000 ntdll!RtlpCallVectoredHandlers+0x58
0034e3d4 77aeb314 0034e464 0034e4b4 77b94724 ntdll!RtlCallVectoredExceptionHandlers+0x12
0034e44c 77aa0133 0034e464 0034e4b4 0034e464 ntdll!RtlDispatchException+0x19
0034e44c 77b062c5 0034e464 0034e4b4 0034e464 ntdll!KiUserExceptionDispatcher+0xf
0034e7bc 77aeb652 0034e860 0034e8b0 00000000 ntdll!RtlpCallVectoredHandlers+0x61
0034e7d0 77aeb314 0034e860 0034e8b0 0034ec28 ntdll!RtlCallVectoredExceptionHandlers+0x12
0034e848 77aa0133 0034e860 0034e8b0 0034e860 ntdll!RtlDispatchException+0x19
0034e848 1c43c666 0034e860 0034e8b0 0034e860 ntdll!KiUserExceptionDispatcher+0xf
0034ebe8 1c43c4e5 0034ec28 080d35d0 080d35d6 lcdb4!lc::db::PackedIndices::unpackIndices<unsigned char>+0x86
0034ec14 1c45922d 0034ec28 080d35d0 00000006 lcdb4!lc::db::PackedIndices::unpack+0xb5
...
xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx getIndices

For completeness, I posted the code of lc::db::PackedIndices::unpackIndices(), including all code added for debugging, to http://ideone.com/sVVXX7.

The code triggering the call to KiUserExceptionDispatcher is (*p++) = static_cast<T>(index); (mov dword ptr [esp+10h],eax).

I just can't seem to figure out what's going on. An exception seems to have been thrown, but none of my exception handlers are called. The application just hangs. I checked for any deadlocked critical sections (!lock) but found none. Furthermore, I don't see why an exception should be raised, as the memory locations are all valid. Could anyone give me some hints?

Update

I tried to find the type of exception being thrown:

0:000> s -d esp L1000 1003f
0028ebdc  0001003f 00000000 00000000 00000000  ?...............
0028efd8  0001003f 00000000 00000000 00000000  ?...............
0:000> .cxr 0028ebdc
eax=77b94724 ebx=0804be30 ecx=00000002 edx=00000004 esi=77b94724 edi=0804be28
eip=77b062c5 esp=0028eec4 ebp=0028eee4 iopl=0         nv up ei ng nz na pe cy
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010287
ntdll!RtlpCallVectoredHandlers+0x61:
77b062c5 ff03            inc     dword ptr [ebx]      ds:002b:0804be30=00000001
0:000> .cxr 0028efd8
eax=0000003b ebx=00000001 ecx=0804bd98 edx=0028f340 esi=0028f340 edi=04b77580
eip=1c43c296 esp=0028f2c0 ebp=0028f2fc iopl=0         nv up ei pl nz na po nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010202
lcdb4!lc::db::PackedIndices::unpackIndices<unsigned char>+0x36:
1c43c296 8801            mov     byte ptr [ecx],al          ds:002b:0804bd98=3e

Garin answered 2/11, 2012 at 10:4 Comment(38)

Am I missing something? You don't seem to have posted the code for getIndices? – Whim 2/11, 2012 at 10:12

@john: I'm sorry - it was probably just too obvious to me. getIndices() is just a tiny wrapper, eventually calling lc::db::PackedIndices::unpack. – Garin 2/11, 2012 at 10:14

Any chance of count being zero? Your code would break if that were true. – Whim 2/11, 2012 at 10:24

Show the code that calls unpackIndices() in the failure case. Also, does p point to the right place at the time of failure? – Tradescantia 2/11, 2012 at 10:25

@john: no - cnt is never 0 (they code path would not be entered) – Garin 2/11, 2012 at 10:27

Seems like the thread is hung trying to take an SRW lock exclusively. !locks command will only look at critical section objects and not SRW. Use !rwlock. – Dispossess 2/11, 2012 at 10:28

@atzz: I added it to ideone.com/sVVXX7. – Garin 2/11, 2012 at 10:28

@nanda: I can't find that command - is it new? – Garin 2/11, 2012 at 10:31

@nanda: .load sosex.dll, !rwlock => Unable to initialize .NET data interface. The CLR has not yet been loaded in the process.... But this is a native app. – Garin 2/11, 2012 at 10:37

Please show the code that calls PackedIndices::unpack (and maybe several layers above it -- as many as needed to see where the indices parameter comes from). – Tradescantia 2/11, 2012 at 10:39

@atzz: indices is the first parameter passed to getIndices() in the code I showed in my question above. There is no code in between that would alter that pointer. – Garin 2/11, 2012 at 10:42

I don't think the SRW lock is the issue. The stack trace shows it's an internal lock from the system exception mechanism. So far, my bet is that you are corrupting something vital (probably a SEH frame on the stack). – Tradescantia 2/11, 2012 at 10:43

@atzz: trouble is, I cannot reproduce the error when running the application under the debugger. It's as if there is some kind of race condition. I was suspecting glDrawElements to still access the memory while I'm modifying it, but I'm not sure. – Garin 2/11, 2012 at 10:46

The reason for the hang, from the call stack of the thread, seems to be slim reader writer lock. But the lock itself belongs (again from the stack) to the exception handling mechanism code of the OS (something like the loader lock). If there are other threads - I suppose there are - you should look at their call stack and see if any of thread is stuck in an exception handling code. If you find the real cause of the exception - there would be no hang of course - but the hang might have clues to the cause for the exception. – Dispossess 2/11, 2012 at 10:53

@nanda: I was already looking for something like this. The only thing I see is this: snipt.org/vhtf2 – Garin 2/11, 2012 at 11:5

I'd investigate where p was pointing at the moment of exception. The easiest way is probably to copy it to a global variable before each (*p++) = ... (which shouldn't distort timings). – Tradescantia 2/11, 2012 at 11:11

Leaving the hang aside. The instruction that is causing the crash is a bit surprising. "(mov dword ptr [esp+10h],eax)" given the only memory that is referred in this instruction is on the stack I wonder what kind of an exception it is. You could also inspect the exception object to see what is the exception. – Dispossess 2/11, 2012 at 11:16

So, one thread stops in exception handler due to a lock held in some thread that stopped in GetProcAddress... or other module-related function. I suggest getting rid of SEH, VEH or whatever... and using an UEF (Unhandled Exception Filter). I've had tons of problems with VEH (which is also used in '__except' implementation) during debugging. It doesn't necessarily mean a problem with your program. What about the Release build without debugger attached? Two cents from experience. – Quartermaster 2/11, 2012 at 11:31

@user1240436: I just set an UEF - but it doesn't seem to change anything. It still hangs at the same location. – Garin 2/11, 2012 at 11:54

@nanda: strange. .exr -1 yields: ExceptionAddress: 77aa000c (ntdll!DbgBreakPoint), ExceptionCode: 80000003 (Break instruction exception). So the exception is just me breaking into the deadlocked application? – Garin 2/11, 2012 at 12:0

@nanda: I tried to locate the correct exception context (I'm not within my comfort zone here). Please see the update to the question. – Garin 2/11, 2012 at 12:12

You should 1) remove the __try __except, 2) set a UEF 3) put a debugger breakpoint into it 4) run in debugger (or have something wait for you to analyze it). Have you tried exactly this? This should bypass any exception filters and give you the root exception context. But it might get tricky if you're using C++ exceptions. I suggest adding UEF only just before the problem occurs. If no exceptions occur, try expanding your search for an exception by placing UEF earlier in the code. – Quartermaster 2/11, 2012 at 12:13

You could get the exception_record from the following stack frame. 0034e848 77aa0133 0034e860 0034e8b0 0034e860 ntdll!RtlDispatchException+0x19 - the argument to RtlDispatchException is pointer to EXCEPTION_RECORD. so if you type .exr 0034e860 you should be able to see the exception record corresponding to the call stack above. – Dispossess 2/11, 2012 at 12:22

@nanda: ok, so it's: Attempt to write to address 06a28fd0 ExceptionAddress: 1c43c296 (lcdb4!lc::db::PackedIndices::unpackIndices<unsigned char>+0x00000036) ExceptionCode: c0000005 (Access violation). – Garin 2/11, 2012 at 12:25

06a28fd0 is not on the stack of this thread which should have been the case if "mov dword ptr [esp+10h],eax" was the instruction causing the exception - this is more clear now. So, the address is probably on the heap - now you could closely look at the instruction and figure out from your code how the address correlates to the variables in your code. – Dispossess 2/11, 2012 at 12:31

@nanda: so what? 0x06a28fd0 is the buffer I've allocated (the memory addresses have changed, it's a different run). I don't get it - writing to that memory location should be just fine. – Garin 2/11, 2012 at 12:36

Really, if it's just an Access Violation, use the UEF to stop your application, then attach a debugger to see the actual stack. – Quartermaster 2/11, 2012 at 12:40

@user1240436: but I did - I removed the SEH and added this: SetUnhandledExceptionFilter(uef);, where uef contains {DebugBreak(); return EXCEPTION_CONTINUE_SEARCH;}. – Garin 2/11, 2012 at 12:42

By 'stop your application' I mean Sleep(-1) or a MessageBox(...). DebugBreak raises another terminal exception, or even an infinite exception handler recursion if you return CONTINUE_SEARCH. – Quartermaster 2/11, 2012 at 12:45

@nanda: that's really weird. From another run, I still get an Access violation, this time for address 07751e90. So I search for !heap -p -a 07751e90, and get: address 07751e90 found in _HEAP @ 7a0000, HEAP_ENTRY Size Prev Flags UserPtr UserSize - state, 07751e88 0003 0000 [00] 07751e90 0000c - (busy). So that pointer is definitely valid. – Garin 2/11, 2012 at 12:49

This is weird. Though very very remote, I doubt if any code in your process is modifying page attributes of any memory pages. Try !address 7751e90 and see what are the page attributes of the page containing this portion of memory. – Dispossess 2/11, 2012 at 12:55

I managed to find two other threads that also try to acquire the SRW: snipt.org/vhvb6. – Garin 2/11, 2012 at 13:3

@user1240436: thanks - did that but it won't enter the handler anyway. – Garin 2/11, 2012 at 13:3

@nanda: there you are: Protect: 00000002 PAGE_READONLY. But it's in the heap, and no, I never modified the page attributes. – Garin 2/11, 2012 at 13:6

Ok - it's as Hans Passant says. The heap does seem to be in a weird state. Something locked that memory page that belongs to it. Is it possible to set a breakpoint on heap protection changes? – Garin 2/11, 2012 at 13:14

I see from your snippet that heap is raising exception to indicate corruption or overrun. May be windows changes the page attributes before raising the exception to stop or break further corruption - I am just guessing. There seems to be some corruption in the ole heap too - from your latest snippet. The root of the problem is probably a code corrupting a heap - which the heap finds and raises an exception for and the exception handler code is hung on the SWR lock and then another thread touches an address which the heap has already made protected due to the corruption. – Dispossess 2/11, 2012 at 13:17

@nanda: Thanks for your help - I haven't found the source of the problem yet, but we can be reasonably sure that it's some kind of heap corruption. I'll probably have to spend countless hours to find the source of the problem. Would you like to summarize your comments in an answer so I can accept it? – Garin 2/11, 2012 at 15:31

Daniel, thanks. I have summarized my comments to an answer. – Dispossess 4/11, 2012 at 5:49

The thread is hung awaiting for an exclusive lock on SRW (slim read write lock) belonging to the OS exception handling code. And that exception is caused by your code. The exact exception and details of it could be found using the following stack frame. 0034e848 77aa0133 0034e860 0034e8b0 0034e860 ntdll!RtlDispatchException+0x19 - the argument to RtlDispatchException is pointer to EXCEPTION_RECORD. So if you type .exr 0034e860 you can see the exception record. From the exception record you would know access to which address is causing the exception (if the exception is access violation exception).

As, after these steps, you had found that the access violation was happening due to a write to an address that you had rightfully allocated on the heap - you can find the protection attributes of the virtual page containing that address through the command !address "the virtual address"

As you had found out that the page protection attributes have been changed to (by some code) PAGE_READONLY on those heap addresses and after seeing the call stack of other threads I have the following conjecture which I think might help you find the root cause.

I am guessing that Windows Heap manager changes the page attributes before raising an exception to indicate heap corruption. There seems to be some corruption in the ole heap too - from the call stack of other threads you had showed. The root of the problem is probably a code corrupting a heap - which the heap finds subsequently and raises an exception for, following that the exception mechanism implementation code of the OS kicks-in and gets hung on the SWR lock before it is able to call the exception handler in your or other library code. Following this another ignorant thread in your code rightfully touches the heap memory, which the heap has already made protected due to the corruption it had already found out about, causing an exception and making the exception mechanism code to kick-in and fall into the same dead-lock. Given that you had said that problem is not reproducible when the program is run under the debugger, it would be anyone's guess that the problem has some timing issue or race condition.

Dispossess answered 4/11, 2012 at 5:49 Comment(0)

The stack trace tells the story. Your program is crashing, good odds that this is an access violation exception, a typical failure mode for C++ code and usually triggered by heap corruption. Windows then tries to invoke the exception filters to look for any code that is willing to handle the exception. First up are the handlers installed by AddVectoredExceptionHandler(). It must take a lock to do so to prevent re-entry when one of those handlers in turn causes a crash.

And that's where the buck stops. Exactly why is unclear from the stack trace. It could be because another thread has also fallen over on the heap corruption and is busy handling the exception and has taken the lock. Use Debug + Windows + Threads to look at them. But more likely is that the process state is so mangled that the lock object itself got corrupted as well. Unlikely but it does happen.

And yes, switching off the low-fragmentation heap has a knack for hiding heap corruption. The memory layout will be very different so whatever code is causing the corruption may now have whacked something innocent. It is of course not a solution.

Debug + Exception, tick the Thrown checkbox for "Win32 Exceptions". The debugger will now stop when the exception is thrown. At least you'll know what exception is being thrown. Ultimately you do need to find out where the heap corruption occurs. It is never located at the code that crashed, good luck debugging it.

Tun answered 2/11, 2012 at 12:4 Comment(2)

Thanks Hans. This all makes sense. Unfortunately for me, I can't reproduce the problem with the debugger attached (even if I attach it after the process has been created). So I can't break into the debugger when the exception is thrown. – Garin 2/11, 2012 at 12:18

Well, you have you very good guess at what is wrong. Solving heap corruption with a debugger broken at the location where the program falls over isn't that useful anyway. You gotta find the code that actually caused the corruption. – Tun 2/11, 2012 at 12:42

Dispossess answered 4/11, 2012 at 5:49 Comment(0)

If you're using a ATI graphic card (with ATI drivers), it's a known issue that you must not leak any state else memory corruption occurs later on.

Try to disable all the states you can (glDisableClientState), use APITrace to find out which one you've forgot.

One easy way to test for memory corruption in a graphic driver is either to test on another board/driver, or force software rendering.

Succinate answered 2/11, 2012 at 11:20 Comment(1)

Thanks - the driver is NVIDIA. But I've been using glDEbugger to ensure that I'm not leaking anything. – Garin 2/11, 2012 at 11:52

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Update

Recommended topics

Hot tags