How do I debug a difficult-to-reproduce crash with no useful call stack?

Asked 17/1, 2011 at 23:48 Answered 18/1, 2011 at 10:3

Solved delphi crash c++builder callstack

I am encountering an odd crash in our software and I'm having a lot of trouble debugging it, and so I am seeking SO's advice on how to tackle it.

The crash is an access violation reading a NULL pointer:

First chance exception at $00CF0041. Exception class $C0000005 with message 'access violation at 0x00cf0041: read of address 0x00000000'.

It only happens 'sometimes' - I haven't managed to figure out any rhyme or reason, yet, for when - and only in the main thread. When it occurs, the call stack contains one incorrect entry:

Call stack with one line, Classes::TList::Get, address 0x00cf0041

For the main thread, which this is, it should show a large stack full of other items.

At this point, all other threads are inactive (mostly sitting in WaitForSingleObject or a similar function.) I have only seen this crash occur in the main thread. It always has the same call stack of one entry, in the same method at the same address. This method may or may not be related - we do use the VCL in our application. My bet, though, is that something (possibly quite a while ago) is corrupting the stack, and the address where it's crashing is effectively random. Note it has been the same address across several builds, though - it's probably not truly random.

Here is what I've tried:

Trying to reproduce it reliably at a certain point. I have found nothing that reproduces it every time, and a couple of things that occasionally do, or do not, for no apparent reason. These are not 'narrow' enough actions to narrow it down to a particular section of code. It may be timing related, but at the point the IDE breaks in, other threads are usually doing nothing. I can't rule out a threading problem, but think it's unlikely.
Building with extra debugging statements (extra debug info, extra asserts, etc.) After doing so, the crash never occurs.
Building with Codeguard enabled. After doing so, the crash never occurs and Codeguard shows no errors.

My questions:

1. How do I find what code caused the crash? How do I do the equivalent of walking back up the stack?

2. What general advice do you have for how to trace the cause of this crash?

I am using Embarcadero RAD Studio 2010 (the project mostly contains C++ Builder code and small amounts of Delphi.)

Edit: I thought I should add what actually caused this. There was a thread that called ReadDirectoryChangesW and then, using GetOverlappedResult, waited on an event to continue and do something with the changes. The event was also signalled in order to terminate the thread after setting a status flag. The problem was that when the thread exited it never called CancelIO. As a result, Windows was still tracking changes and probably still writing to the buffer when the directory changed, even though the buffer, overlapped structure and event no longer existed (nor did the thread context in which they were created.) When CancelIO was called, there were no more crashes.

Rani answered 17/1, 2011 at 23:48 Comment(3)

I'm not familiar with CodeGaurd - does it also introduce stack canaries and validation? I ask because you are mixing C++ and Delphi - which means you might be mixing calling conventions without realizing it. That can very quickly mess up your stack in ways that would manifest as a seemingly random crash on your main thread with a corrupt call stack. – Carolynncarolynne 17/1, 2011 at 23:59

Codeguard fills the uninitialized portion of the stack with a byte pattern. It also (tries to) verify things like accessing freed memory, overruns in allocated memory, etc. Getting a calling convention wrong would definitely cause something like this, yes (and thanks for the suggestion!) but if so I've no idea where: C++ Builder is designed to interoperate with Delphi code and we'd have to have made an error in a declaration somewhere, and most are IDE- or compiler-managed. I guess the key question then is, how would I go about finding an incorrectly declared method? – Rani 18/1, 2011 at 0:34

I'm not putting this as an answer because it's vague, but you may want to try a different debugger. You can give e.g. WinDbg hints (or everything) to reconstruct the real callstack if it's been corrupted or confused. – Coreencorel 18/1, 2011 at 0:48

Even when the IDE-provided stack trace isn't very complete, that doesn't mean there isn't still useful information on the stack. Open up the CPU view and check out the stack pane; for every CALL opcode, a return address is pushed on the stack. Since the stack grows downwards, you'll find these return addresses above the current stack location, i.e. by scrolling upwards in the stack pane.

The stack for the main thread will be somewhere around $00120000 or $00180000 (address space randomization in Vista and upwards has made it more random). Code for the main executable will be somewhere around $00400000. You can speculatively investigate elements on the stack that don't look like integer data (low values) or stack addresses ($00120000+ range) by right-clicking on the stack entry and selecting Follow -> Near Code, which will cause the disassembly window to jump to that code address. If it looks like invalid code, it's probably not a valid entry in the stack trace. If it's valid code, it may be OS code (frequently around $77000000 and above) in which case you won't have meaningful symbols, but every so often you'll hit on an actual proper stack entry.

This technique, though somewhat laborious, can get you meaningful stack trace info when the debugger isn't able to trace things through. It doesn't help you if ESP (the stack pointer) has been screwed with, though. Fortunately, that's pretty rare.

Clutter answered 18/1, 2011 at 5:11 Comment(2)

Thanks Barry! This is very helpful - and very useful info to know in general anyway. – Rani 19/1, 2011 at 22:49

This has just solved what may have been this bug (or another - either way, it's been very helpful. I've been uncovering some random code recently!) Thanks for taking the time to answer - I've just marked it as the answer to the question. – Rani 25/1, 2011 at 0:11

That's is the reason I made the Process Stack viewer :-) http://code.google.com/p/asmprofiler/wiki/ProcessStackViewer

It can show the stack with raw stack tracing, so it will show the complete stack when normal stack tracing is not possible.
But beware: raw stack tracing will show "false positives"! Any address on the stack for which an function name can be found, will be listed.

It helped me a number of times when I ran in the same problem as yours (no normal stack walking by Delphi possible due to invalid stack state)

Edit: new version uploaded, on website was an old version (I use the new version a lot myself) http://asmprofiler.googlecode.com/files/AsmProfiler_Sampling%20v1.0.7.13.zip

Ent answered 18/1, 2011 at 10:3 Comment(5)

rrrr too busy lately, for got all about it, thank you for reminding!! – Bonkers 18/1, 2011 at 10:48

+1 It sounds like you've automated the procedure I described. – Clutter 18/1, 2011 at 11:41

Sounds very useful! I'll try it out. – Rani 19/1, 2011 at 22:50

Andre, I have tried this and I get a little bit more information with the default stack trace it shows, but still not much. The "Raw stack tracing" checkbox is disabled, so I can't enable it. Any idea why? It happens on both Vista64 and 32. – Rani 2/2, 2011 at 2:20

David, sorry, new version uploaded – Costplus 2/2, 2011 at 7:10

Threading may be the reason here. The usual suspect are threads that use OVERLAPPED structures on the stack and threads that send pointers to objects that are on the stack to other threads.

It may be possible to recover partial stack information if you use the Deubgging Tools For Windows and use the "dps" command.

Legitimatize answered 18/1, 2011 at 0:25 Comment(4)

Thanks John, and I'll look into this. I've written most of our threading code, and it where it passes objects they are definitely dynamically allocated. I will still double-check though! – Rani 18/1, 2011 at 0:45

Will the Debugging Tools for Windows work with code not compiled with a Microsoft compiler and not using their debug info format? Embarcadero's tools don't produce PDB files, for example. – Rani 18/1, 2011 at 0:46

The Windows tools require a compatible symbols format (preferably PDB, but for this purpose even DBG file would work). – Legitimatize 18/1, 2011 at 4:14

You can convert a Delphi map file to a (old) Windows DBG format: code.google.com/p/map2dbg However, DBG is not supported in newest Visual Studio (2008, 2010) but windbg still accepts it? – Costplus 2/2, 2011 at 7:11

I'm not 100% sure, but from the image you provided I believe that somewhere along the executing you're trying to access a object in a TList that is NULL. i.e.:

AList[Index].SomeProperty/SomeMethod/etc. <-- error if (AList[Index] == NULL)

Regarding debugging and finding the actual place where the exception is raised is never an easy task especially when there's not much information or it is hard to reproduce, in this case I usually:

go step by step from the main form's execution(if no exception until there)
while going step by step, if I find any unsafe code I put it between try...except and conditions for indexes(if I have arrays, lists, expected values to be passed, etc.)
if the above fails to find the issue, check if some libraries are failing
use Eureka log, it sometimes fail as well(very few times) but it usually points you in the right direction

I have had numerous issues similar to yours and I can tell you that the issue was almost a extremely easy to fix, however when the error pops, I did not get a "point near" the error.

Bonkers answered 18/1, 2011 at 3:43 Comment(2)

I know it looks like the code is accessing a TList, but it may not be. The stack is broken, so who knows if even that part of it is valid. Eureka Log is an interesting suggestion: I've heard of it but never used it before! – Rani 19/1, 2011 at 22:51

@Rani M well you should, it saves a lot of time, when I first heard about it, I was skeptic, but after a few tests I've been very impressed about how much time it saves, again, there are situations in which Eureka fails but these are very few. – Bonkers 20/1, 2011 at 7:10

Recommended topics

Hot tags