WCF service stops handling calls for 15 seconds
Asked Answered
G

1

6

I faced a strange behavior in one of my WCF services. This service worked fine for around 1,5 years but since a few weeks it shows some kind of "outages" (unfortunately I cannot post images cause I'm new here).

The calls/second drop to 0, although there are still incoming calls. The "outage" is always 15 seconds long. After these 15 seconds the queued calls are processed. It could not be network related because 90% of all calls come from another WCF service on the same server and no other service (10 in total) are affected by this behavior. The service itself does continue working like calculating internal stuff, doing DB updates, etc. No increase in execution times of the internal work. This occurs around 18 - 25 minutes, but the outage is always 15 seconds.

OS

Windows Server 2012

WCF running as Windows Service

WCF configuration:

InstanceContextMode = InstanceContextMode.PerCall,

ConcurrencyMode = ConcurrencyMode.Multiple,

UseSynchronizationContext = false,

IncludeExceptionDetailInFaults = true

Binding = WebHttpBinding

Concurrency Throttle settings:

MaxConcurrentCalls = 384,

MaxConcurrentInstances = 2784,

MaxConcurrentSessions = 2400

I already did some investigating:

  1. WCF throttle settings

I made a full dump of the service in the exact time when it happened. Neither the ConcurrentCalls nor the ConcurrentSessions are exhausted. The dump didn't show any exception which could be cause of the problem.

  1. MAX TCP Conenction

Monitoring of active TCP connection is far from it's limits.

  1. Port trunking in the switch

As no calls are coming in, even from the local services (using localhost), I'm pretty sure it is not network related.

  1. Load problem

This behavior occurs with low load (see below) and also with high load (5 times as much of incoming calls). The frequency of this is not changing based on the load. I also tried to reproduce the behavior on my staging system with around 600-1000 calls / second. I managed to bring the service into a state where I sent more incoming calls / second as the service could handle. The outstanding calls increased and at some point the service crashed of course. But this behavior never showed up.

  1. Thread Pool Exhaustion

The problem occurs when the service is running with 50 threads and also with 200 threads. Although with no more available threads, there would be an error message about that.

I'm running out of possible things which would cause such behavior. I think, it could be the GC blocking threads as the service is using around 10GB in the RAM. It is a kind of memory cache service. Or it could be the OS (Windows Server 2012)or something related to the Windows Service itself.

Has anyone faced something like this on your own or does someone has another idea what could cause that ?

Edit: Now I can post images:

enter image description here

Edit: GC heap dump (thanks to usr)

enter image description here

I see that nearly 50% (in total 70%, including the associated references) are caused by one big dictionary with approx. 27 million entries (based on memory dump heap). I will focus on that to refactor it. There are lots of unused items in it. Maybe this will help.

Additionally I will add GC.WaitForFullGCApproach Method from msdn, to see if the GC is running while the service stops handling incoming requests.

I will keep you posted when I know more.

Edit: GC Stats (14 seconds of the outage included)

•CLR Startup Flags: CONCURRENT_GC
•Total CPU Time: 42.662 msec
•Total GC CPU Time: 2.748 msec
•Total Allocs : 1.524,637 MB
•MSec/MB Alloc : 1,802 msec/MB
•Total GC Pause: 2.977,2 msec
•% Time paused for Garbage Collection: 19,4%
•% CPU Time spent Garbage Collecting: 6,4%
•Max GC Heap Size: 11.610,333 MB
•Peak Process Working Set: 14.917,915 MB
•Peak Virtual Memory Usage: 15.326,974 MB

That's "just" 3 seconds of pausing. Anyway, that shouldn't be that high and I'm gonna refactor the memory store. But it does not explain the 15 seconds at all :(

Edit: During the weekend I did the following:

  1. Installed latest windows updates (last update was 2 month ago)

  2. Restarted the windows server

  3. Refactored the in-mem store of the 27 million objects. I managed to reduce the used memory from 11GB to 6-8GB (which is quite a lot). Very old code in there ;)

The problem is not re-occurring so far (approx. 17h running now). That leads me to the assumption that the GC caused the pausing of the service or some OS related issue was causing that behavior.

I guess the problem is not "solved" at all and will re-occur at some point, cause the data will increase over time.

Thanks to everyone spend time on this. I will continue investigate the dumps and try to find out what happened in detail. I'll keep you posted.

Governor answered 20/7, 2015 at 9:54 Comment(14)
Have you tried switching on WCF tracing? I have seen 15 second pauses whilst the channel is being created, and would also like to find out why.Wexler
Very good diagnosis attempts so far! I often observe a GC rate of 1-2GB/sec. That kind of fits the problem here. Look into the dump to see what all threads are doing. Also use some kind of GC profiler, PerfView is super easy and analyzes GC for you.Manufactory
@Polyfun: Yes I tried WCF tracing with log level = Warning. Nothing suspicious in it.Governor
@usr: thanks for the advice. Will take a look at it now.Governor
Does the event log brings some extra information? During this time, the application pool is recycled?Unction
Are you querying a database from the WCF service? 15 seconds default timeout maybe set in your connection string?Filter
@RicardoPontual: No, there is nothing in it. I think app pool recycling just would be relevant if the WCF service is hosted in the IIS, right ? It's wrapped in a windows service though.Governor
@lloydom: There is a lot of database operations going on. My timeout in the connection string is at 60 seconds and I see no database exceptions popping up. Most operations are done in < 100 ms.Governor
if you see the dump file from the service do you see the delay in 15 second intervals in the call stack right at the point in time where you notice the delay. From there we can find out where the issue really is?Filter
Edited the post with the GC heap.Governor
PerfView can show you a list of GC events. It's in the tree view to the left. GC Stats or something like that.Manufactory
Added GC Stats report.Governor
@lloydom: I did the dump while the "outage" so I don't see any new calls. I monitored the outage with PerfView and the request reports shows no requests/calls at the time of the outage.Governor
The windows server ran into a blue-screen with error code PAGE_FAULT_IN_NON_PAGED_AREA (0x00000050). Afterwards I thought it "solved" the problem, because I didn't see any outage for over 1h. Unfortunately the same behavior re-occurred.Governor
G
0

If the outage is predictable enough, can you can connect with windbg+SOS during the outage and:

  • Pause the service twice during the outage
  • Each time run !threads and ~*e!dumpstack to show the thread status and stacks

If you have 100 threads doing nothing productive for 15 seconds this should be reflected in the stacks - with luck the bulk of your 100 threads are:

  1. Stuck in one of your methods (check the 'current frame' for each thread)
  2. Stuck in WCF methods
  3. Executing a *WaitFor* call
  4. Executing a Sleep/Delay/IO Completion call
Guyenne answered 24/7, 2015 at 23:58 Comment(3)
thanks for the proposal. As it is not occurring anymore, I cannot try it. Anyway I will try it as soon as it occurs again.I got one question: If I see the threads in one of your points mentioned, how do they cause the behavior ? Even if the threads are waiting or blocked etc. how do they keep the service from continue working/accepting new requests ?Governor
If you see lots of threads stuck in 1 or 2 you have a livelock/deadlock in the service code, and you'll get a rough clue as to where the code is getting stuck. If most are stuck in 4, then requests may not be making it off the nework. Its not an exact science but I've found livelock-type issues this way before, the symptoms were pretty much as you describe but with high CPU usage from the service.Guyenne
Ok, understood. Will try to get it when it happens again.Governor

© 2022 - 2024 — McMap. All rights reserved.