High thread count stuck in GCFrame causes high CPU usage
Asked Answered
R

1

13

We have an application that uses Kestrel to serve HTTP requests. We've had some problems in the past where high load caused the thread pool to spawn thousands of threads, at which point we would get lock convoy symptoms. Most of the time, the threads would start blocking each other at Monitor.Enter() somewhere in our code, causing delays and more contentions until the application became unresponsive with a 100% CPU usage due to context switching. The problem would not go away until we restarted the application.

However, we've eliminated most locks and implemented a throttling mechanism so that we don't allow more than 1000 threads to enter the application. We're using the System.Threading.Semaphore class to allow only a set number of threads to continue. This has solved our lock contention problems, but possible introduced a new problem:

We still get cases of 100% CPU usage and high thread count (500-1000 threads), although this time the threads are not blocked on Monitor.Enter(). Instead, when we do thread dump (using Microsoft.Diagnostics.Runtime.ClrRuntime), we see the following call stack (for hundreds of threads):

thread id = 892
  GCFrame
  GCFrame
  HelperMethodFrame
  System.Threading.TimerQueueTimer.Fire()
  System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem()
  System.Threading.ThreadPoolWorkQueue.Dispatch()
  DebuggerU2MCatchHandlerFrame

In this case, the problem would case the application to become unresponsive, but in most cases it solves itself after a few minutes. Sometimes it takes hours.

What does a call stack like this mean? Is this a known problem with Kestrel or is it some kind of combination of Kestrel and Semaphore that is causing this?

UPDATE: A memory dump reveals that the HelperMethodFrame in the call stack is probably a call to Monitor.Enter() after all. However we still cannot pinpoint whether this is in our code or in Kestrel or some other library. When we had our lock convoy problems before, we would see our code in the call stack. Now it seems to be a Monitor.Enter() call inside TimerQueueTimer instead, which we are not using in our code. The memory dump looks like this:

.NET stack trace:

Child SP               IP Call Site
0000005a92b5e438 00007ff8a11c0c6a [GCFrame: 0000005a92b5e438] 
0000005a92b5e660 00007ff8a11c0c6a [GCFrame: 0000005a92b5e660] 
0000005a92b5e698 00007ff8a11c0c6a [HelperMethodFrame: 0000005a92b5e698] System.Threading.Monitor.Enter(System.Object)
0000005a92b5e790 00007ff88f30096b System.Threading.TimerQueueTimer.Fire()
0000005a92b5e7e0 00007ff88f2e1a1d System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem()
0000005a92b5e820 00007ff88f2e1f70 System.Threading.ThreadPoolWorkQueue.Dispatch()
0000005a92b5ed48 00007ff890413753 [DebuggerU2MCatchHandlerFrame: 0000005a92b5ed48]

Full stack trace:

 # Child-SP          RetAddr           : Args to Child                                                           : Call Site
00 0000005a`9cf5e9a8 00007ff8`9e6513ed : 00000000`00000000 00000000`00000001 00000000`00000001 00000000`00000000 : ntdll!ZwWaitForMultipleObjects+0xa
01 0000005a`9cf5e9b0 00007ff8`904e92aa : 0000005a`9cf5ef48 00007ff5`fffce000 0000005a`00000000 00000000`00000000 : KERNELBASE!WaitForMultipleObjectsEx+0xe1
02 0000005a`9cf5ec90 00007ff8`904e91bf : 00000000`00000001 00000000`00000000 0000005a`66b48e20 00000000`ffffffff : clr!WaitForMultipleObjectsEx_SO_TOLERANT+0x62
03 0000005a`9cf5ecf0 00007ff8`904e8fb1 : 0000005a`66b48e20 00000000`00000001 00000000`00000018 00007ff8`00000000 : clr!Thread::DoAppropriateWaitWorker+0x243
04 0000005a`9cf5edf0 00007ff8`90731267 : 00000000`00000000 00007ff8`00000001 0000004f`a419c548 0000004f`a419c548 : clr!Thread::DoAppropriateWait+0x7d
05 0000005a`9cf5ee70 00007ff8`90834a56 : 0000005a`5aec0308 0000005a`9cf5f0d0 00000000`00000000 0000005a`66b48e20 : clr!CLREventBase::WaitEx+0x28e6b7
06 0000005a`9cf5ef00 00007ff8`9083495a : 0000005a`5aec0308 0000005a`66b48e20 00000000`00000000 00000050`22945ab8 : clr!AwareLock::EnterEpilogHelper+0xca
07 0000005a`9cf5efc0 00007ff8`90763c8c : 0000005a`66b48e20 0000005a`5aec0308 0000005a`5aec0308 00000000`002d0d01 : clr!AwareLock::EnterEpilog+0x62
08 0000005a`9cf5f020 00007ff8`908347ed : 00000000`00000000 0000005a`9cf5f0d0 0000005a`5aec0308 0000005a`5aec0301 : clr!AwareLock::Enter+0x24390c
09 0000005a`9cf5f050 00007ff8`908338a5 : 00000050`22945ab8 0000005a`9cf5f201 0000005a`66b48e20 00007ff8`90419050 : clr!AwareLock::Contention+0x2fd
0a 0000005a`9cf5f110 00007ff8`8f30096b : 0000005a`5aec0308 0000005a`9cf5f2d0 0000005a`9cf5f560 00000000`00000000 : clr!JITutil_MonContention+0xc5
0b 0000005a`9cf5f2a0 00007ff8`8f2e1a1d : 00000051`a2bb6bb0 00007ff8`90417d0e 00000050`229491d8 0000005a`9cf5f330 : mscorlib_ni+0x49096b
0c 0000005a`9cf5f2f0 00007ff8`8f2e1f70 : 00000000`00000000 0000005a`9cf5f3a8 00000000`00000001 0000005a`9cf5f370 : mscorlib_ni+0x471a1d
0d 0000005a`9cf5f330 00007ff8`90413753 : 00000000`00000004 00000000`00000000 0000005a`9cf5f600 0000005a`9cf5f688 : mscorlib_ni+0x471f70
0e 0000005a`9cf5f3d0 00007ff8`9041361c : 00000050`22945ab8 00000000`00000000 0000005a`9cf5f640 0000005a`9cf5f6c8 : clr!CallDescrWorkerInternal+0x83
0f 0000005a`9cf5f410 00007ff8`904144d3 : 00000000`00000000 00000000`00000004 0000005a`9cf5f858 0000005a`9cf5f688 : clr!CallDescrWorkerWithHandler+0x4e
10 0000005a`9cf5f450 00007ff8`9041b73d : 0000005a`9cf5fb70 0000005a`9cf5fb20 0000005a`9cf5fb70 00000000`00000001 : clr!MethodDescCallSite::CallTargetWorker+0x2af
11 0000005a`9cf5f5e0 00007ff8`90416810 : 00000000`00000007 00007ff8`00000000 ffffffff`fffffffe 0000005a`66b48e20 : clr!QueueUserWorkItemManagedCallback+0x2a
12 0000005a`9cf5f6d0 00007ff8`904167c0 : 00670061`00500064 00000000`00730065 ffffffff`fffffffe 0000005a`66b48e20 : clr!ManagedThreadBase_DispatchInner+0x29
13 0000005a`9cf5f710 00007ff8`90416705 : ffffffff`ffffffff 00007ff8`90414051 0000005a`9cf5f7b8 00000000`ffffffff : clr!ManagedThreadBase_DispatchMiddle+0x6c
14 0000005a`9cf5f810 00007ff8`90416947 : ffffffff`ffffffff 0000005a`66b48e20 0000005a`66b48e20 00000000`00000001 : clr!ManagedThreadBase_DispatchOuter+0x75
15 0000005a`9cf5f8a0 00007ff8`9041b6a2 : 0000005a`9cf5f988 00000000`00000000 00000000`00000001 00007ff8`9e651118 : clr!ManagedThreadBase_FullTransitionWithAD+0x2f
16 0000005a`9cf5f900 00007ff8`904158ba : 0000005a`9cf5fb70 0000005a`9cf5fb68 00000000`00000000 00000000`00000200 : clr!ManagedPerAppDomainTPCount::DispatchWorkItem+0x11c
17 0000005a`9cf5fa90 00007ff8`904157da : 0000010b`010b010b 0000005a`9cf5fb20 00000000`00000000 0000005a`66b48e20 : clr!ThreadpoolMgr::ExecuteWorkRequest+0x64
18 0000005a`9cf5fac0 00007ff8`90433e1e : 00000000`00000000 00000000`00000000 00000000`00000001 00000000`0000041d : clr!ThreadpoolMgr::WorkerThreadStart+0x3b5
19 0000005a`9cf5fb60 00007ff8`9e7c13d2 : 00007ff8`90433da8 0000005a`5add4db0 00000000`00000000 00000000`00000000 : clr!Thread::intermediateThreadProc+0x7d
1a 0000005a`9cf5fca0 00007ff8`a11454e4 : 00007ff8`9e7c13b0 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0x22
1b 0000005a`9cf5fcd0 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x34

UPDATE 2: WinDbg syncblock command gives us this:

The version of SOS does not match the version of CLR you are debugging.  Please
load the matching version of SOS for the version of CLR you are debugging.
CLR Version: 4.6.1055.0
SOS Version: 4.6.1637.0
Index SyncBlock MonitorHeld Recursion Owning Thread Info  SyncBlock Owner
  148 0000005a5aec0308          426         0 0000000000000000     none    0000005022945ab8 System.Threading.TimerQueue
-----------------------------
Total           152
CCW             1
RCW             1
ComClassFactory 0
Free            66

UPDATE 3: More digging shows that we have about 42000 Timer objects:

00007ff8871bedd0    41728      1001472 System.Runtime.Caching.MemoryCacheEqualityComparer
00007ff88f4a0998    42394      1017456 System.Threading.TimerHolder
00007ff8871bbed0    41728      1335296 System.Runtime.Caching.UsageBucket[]
00007ff88f51ab30    41749      1335968 Microsoft.Win32.SafeHandles.SafeWaitHandle
00007ff88f519de0    42394      1356608 System.Threading.Timer
00007ff8871be870    41728      1669120 System.Runtime.Caching.CacheUsage
00007ff88f50ea80    41734      2003232 System.Threading.ManualResetEvent
00007ff8871be810    41728      2336768 System.Runtime.Caching.CacheExpires
00007ff88f519f08    42390      2712960 System.Threading.TimerCallback
00007ff8871be558    41728      3338240 System.Runtime.Caching.MemoryCacheStore
00007ff88f4a0938    42394      3730672 System.Threading.TimerQueueTimer
00007ff8871be8d0    41728      4005888 System.Runtime.Caching.UsageBucket
00007ff8871bb9c8    41728     11016192 System.Runtime.Caching.ExpiresBucket[]

Checking a few of the _methodPtr references, they all point to:

00007ff8`871b22c0 0f1f440000      nop     dword ptr [rax+rax]
00007ff8`871b22c5 33d2            xor     edx,edx
00007ff8`871b22c7 4533c0          xor     r8d,r8d
00007ff8`871b22ca 488d055ffeffff  lea     rax,[System_Runtime_Caching_ni+0x32130 (00007ff8`871b2130)]
00007ff8`871b22d1 48ffe0          jmp     rax

And with GC Traces looking similar to this:

0:000> !gcroot 00000055629e5ca0
The version of SOS does not match the version of CLR you are debugging.  Please
load the matching version of SOS for the version of CLR you are debugging.
CLR Version: 4.6.1055.0
SOS Version: 4.6.1637.0
Thread 27a368:
    0000005a61c4ed10 00007ff88f2d2490 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
        r14: 
            ->  0000004b6296f840 System.Threading.ThreadHelper
            ->  0000004b6296f7a0 System.Threading.ThreadStart
            ->  0000004b6296f750 Quartz.Simpl.SimpleThreadPool+WorkerThread
            ->  0000004b6296f7e0 System.Threading.Thread
            ->  0000004b62959710 System.Runtime.Remoting.Contexts.Context
            ->  0000004aa29315a8 System.AppDomain
            ->  0000004c22c4b368 System.EventHandler
            ->  00000051e2eb5f48 System.Object[]
            ->  00000050629e6180 System.EventHandler
            ->  000000506298b268 System.Runtime.Caching.MemoryCache
            ->  000000506298b348 System.Runtime.Caching.MemoryCacheStore[]
            ->  000000506298d470 System.Runtime.Caching.MemoryCacheStore
            ->  000000506298d5a0 System.Runtime.Caching.CacheExpires
            ->  000000506298e868 System.Threading.Timer
            ->  000000506298eaa8 System.Threading.TimerHolder
            ->  000000506298e888 System.Threading.TimerQueueTimer
            ->  000000506298fe78 System.Threading.TimerQueueTimer
Radke answered 10/5, 2017 at 14:41 Comment(5)
Jeez Marie, if a thousand threads can't drive cpu usage up to 100% then all hope is lost. If you got that many timers ticking then something went seriously wrong a while ago. Not terribly obvious how it went from TimerQueueTimer.Fire() straight into Monitor.Enter(), perhaps you need to look for [MethodImpl(MethodImplOptions.Synchronized)]. Always a good way to cause unintended deadlock.Pierro
Thanks for the tip, will look for that attribute. Some more digging suggests that this might not be related to Kestrel, but rather to Quartz.net, which our application is also using. Added the tag.Radke
That seems very likely. Quartz is a Java library. MethodImplOptions.Synchronized is a curse inherited from Java.Pierro
Digging into the source (v 2.3.3 that we are using) for Quartz.net, I find that there are explicit Monitor.Enter calls inside the Run() method of Quartz.Simpl.SimpleThreadPool+WorkerThread. No MethodImplOptions.Synchronized attribute what I can see. We'll try to configure Quartz to not use its' own SimpleThreadPool and use something else instead.Radke
We've seen the same thing happening inside RabbitMQ client library as well. It seems as though a lot of Timer objects are created, which causes the thread pool in Quartz or RabbitMQ (both use their own thread pools) to stall and create a lot of new threads. Could this be related to using GCLatencyMode.SustainedLowLatency? It's one of the changes we've done in the version exhibiting this behaviour.Radke
R
16

Just want to add this for future interwebz travellers. The root cause was that we used a System.Runtime.Caching.MemoryCache instance that we were re-creating frequently without proper disposal. The MemoryCaches created timers for function calls and these timers were not cleared from memory when the cache was replaced, and the timers would grab a threadpool thread every now and then to check if they should fire, and so when the timers built up, so did the CPU usage.

The reason it was hard to detect was that it did not appear in the stack traces, even in dump files. Instead we would see the classes (lambdas, mostly) called by the timers. We found the issue by extensive audit of all code, basically. And the documentation for MemoryCache specifically says not to do what we did.

The solution was to dispose the old cache before creating a new one, and then the problem disappeared.

EDIT: looking at the above stack traces, it looks like it atually gave us quite good evidence that the problem was in System.Runtime.Caching. I guess we were just blind and didn't think the problem would be inside a System namespace.

Radke answered 12/12, 2017 at 15:35 Comment(2)
Thanks for posting a followup!Appoggiatura
I really, really appreciate you adding how things worked out. Not sure it's the answer I need yet, but it has given me reason to dig deeper in a direction.Materials

© 2022 - 2024 — McMap. All rights reserved.