Critical section hang analysis with Windbg
Asked Answered
P

1

5

I recently got myself a dump file generated via procdump when my application remained unresponsive for a while.

When I run !locks on the dump file, I get a lone entry that goes something like:

0:000> !locks

CritSec +123456 at 00123456
WaiterWoken       No
LockCount         0
RecursionCount    2
OwningThread      aaaa
EntryCount        0
ContentionCount   0
*** Locked

Just that one listing. That's all. When I drill down further with:

0:000> dt RTL_CRITICAL_SECTION 00123456
MyModule!RTL_CRITICAL_SECTION
  +0x000 DebugInfo      : 0x00aabbcc _RTL_CRITICAL_SECTION_DEBUG
  +0x004 LockCount      : 0n-2
  +0x008 RecursionCount : 0n2
  +0x00c OwningThread   : 0x0000aaaa Void
  +0x010 LockSemaphore  : (null)
  +0x014 SpinCount      : 0

Questions:

1) Should I take MyModule!RTL_CRITICAL_SECTION as definitive clue this critical section is probably defined inside MyModule?

2) Is there a way to make Windbg display the actual variable name of this critical section? (i.e assuming #1 is true and this is a CS the application code has defined/access to.)

3) Why do the LockCount values in the aforementioned listings at odds with each other? (one has 0 and the other has -2.)

4) I think I understand LockCount well enough to know it can't go below -1. Not to mention RecursionCount seems to be badly out of step with LockCount.

I guess the bottom line question is should I just chalk this up to corrupted CS?

Price answered 10/8, 2016 at 17:2 Comment(0)
K
8

I could easily reproduce the effects of negative LockCount with a sample application, and here are my answers:

Responses to your questions

Regarding 1)

Should I take MyModule!RTL_CRITICAL_SECTION as definitive clue this critical section is probably defined inside MyModule?

Yes, it's your definition of a critical section, which may or may not correspond to Microsoft's definition. To use Microsoft's definition use dt nt!_RTL_CRITICAL_SECTION

Regarding 2)

Is there a way to make Windbg display the actual variable name of this critical section? (i.e assuming #1 is true and this is a CS the application code has defined/access to.)

Yes, if it's used by a function that's on the call stack. With .frame, navigate to the frme, and with ?? variableName display the variable, e.g.

0:000> k L2
 # ChildEBP RetAddr  
00 0116faa4 00d67419 KERNELBASE!DebugBreak+0x2
01 0116fc5c 00d67ebe CriticalSectionLeaveTwice!main+0x109
0:000> .frame 1
01 0116fc5c 00d67ebe CriticalSectionLeaveTwice!main+0x109 [c:\users\t\documents\visual studio 2015\projects\criticalsectionleavetwice\criticalsectionleavetwice\criticalsectionleavetwice.cpp @ 24]
0:000> ?? CriticalSection
struct _RTL_CRITICAL_SECTION
   +0x000 DebugInfo        : 0xffffffff _RTL_CRITICAL_SECTION_DEBUG
   +0x004 LockCount        : 0n-2
   +0x008 RecursionCount   : 0n2
   +0x00c OwningThread     : 0x00000d6c Void
   +0x010 LockSemaphore    : (null) 
   +0x014 SpinCount        : 0x20007d0

Regarding 3)

Why do the LockCount values in the aforementioned listings at odds with each other? (one has 0 and the other has -2.)

The field LockCount is not a true lock count any more, as explained in this answer. Relevant part:

In Microsoft Windows Server 2003 Service Pack 1 and later versions of Windows, the LockCount field is parsed as follows:

  • The lowest bit shows the lock status. If this bit is 0, the critical section is locked; if it is 1, the critical section is not locked.
  • The next bit shows whether a thread has been woken for this lock. If this bit is 0, then a thread has been woken for this lock; if it is 1, no thread has been woken.
  • The remaining bits are the ones-complement of the number of threads waiting for the lock.

IMHO the !locks command should do the interpretation for you.

The particular value of -2 is 11111111 ... 1111110 in binary, so the last bit is 0, meaning that the critical section is locked. The bit before is 1, so no thread has been woken. The one's complement of the remaining is 0, which corresponds to the lock count output of !locks.

This means, that the critical section you analyzed is not involved in a deadlock.

Regarding 4)

I think I understand LockCount well enough to know it can't go below -1. Not to mention RecursionCount seems to be badly out of step with LockCount.

See #3.

Regarding unnumbered question 5)

should I just chalk this up to corrupted CS?

No. It does not seem to be corrupted.

Using !dlk for deadlock analysis

To check if it's responsible for a deadlock, I suggest using 's (download) !dlk command. Although it's an extension for .NET primarily, I have once requested to make it work for critical sections even without .NET - and the feature was implemented in one of the newer versions of SOSex.

If it could not find .NET, it'll output a warning about that fact and then continue analyzing the critical sections:

In your case it may look like this:

0:000> !dlk
Unable to initialize .NET data interface. The CLR has not yet been loaded in the process (mscorwks/clr module not loaded).
Examining CriticalSections...
No deadlocks detected.

Usage:

.load c:\path\to\sosex.dll
!dlk

If it identifies a deadlock, it's very easy to read. If it does not, you still need to apply other techniques, i.e. it's not a guarantee that there's no deadlock of some other kind (e.g. in case the wait chain includes other types of synchronization objects like threads, events, ...).

Example output (not for critical section, but will be similar):

0:010> !dlk
Deadlock detected:
CLR thread 4 holds sync block 00000000024c6970 OBJ:000000007fff0f80[System.String] STRVAL=SYNC1
             waits sync block 00000000024c6928 OBJ:000000007fff0fa8[System.String] STRVAL=SYNC2
CLR thread 5 holds sync block 00000000024c6928 OBJ:000000007fff0fa8[System.String] STRVAL=SYNC2
             waits sync block 00000000024c6970 OBJ:000000007fff0f80[System.String] STRVAL=SYNC1
CLR Thread 4 is waiting at ConsoleTestApp.ConsoleTestApp.MonitorDeadlockThreadProc()+0xa4(IL) [C:\dev\ConsoleTestApp\ConsoleTestApp.cs, line 195]
CLR Thread 5 is waiting at ConsoleTestApp.ConsoleTestApp.MonitorDeadlockThreadProc()+0xa4(IL) [C:\dev\ConsoleTestApp\ConsoleTestApp.cs, line 195]

1 deadlock detected. 

Hang analysis

!analyze -hang and Debug Diag may help in hang analysis.

Kafka answered 10/8, 2016 at 17:15 Comment(9)
Apropos how the recursion count goes up, I understand. But if the recursion count is at 2, how can the lock count be at 0 (or -2 depending on which listing you believe)? esp. considering if a thread calls LeaveCriticalSection both recursion and lock counts go down.Price
@ForeverLearning: I have reproduced your case and updated my answer.Kafka
you are a Ninja! Thanks for all the help. I will look into this a bit longer and report back.Price
If I'm reading that all correctly, a LockCount field of -2 means that one thread holds the lock and no threads are waiting for it. Doesn't that pretty much rule out the possibility that this particular critical section is involved in a deadlock?Beseech
@HarryJohnston: exactly. I added the explanation of the -2 and this conclusion.Kafka
Thanks for getting Steve Johnson to implement !dlk for non-managed processes!Oxytocic
@MarcSherman: I'm also glad he implemented it. Helped me out a couple of times already.Kafka
@HarryJohnston Not sure how you figure that. LockCount starts with -1 and if a thread acquires the CS, it goes to 0. A count of 2 means, one thread acquired it and 2 other threads are waiting on it. So unless you release the CS more times than you have acquired it, how can it go below -1?Price
@HarryJohnston sorry. I did not notice Thomas' update to this thread. He did mention I should stick to the interpretation of !locks which reports 0 and that definitely makes sense to me.Price

© 2022 - 2024 — McMap. All rights reserved.