I faced a strange behavior in one of my WCF services. This service worked fine for around 1,5 years but since a few weeks it shows some kind of "outages" (unfortunately I cannot post images cause I'm new here).
The calls/second drop to 0, although there are still incoming calls. The "outage" is always 15 seconds long. After these 15 seconds the queued calls are processed. It could not be network related because 90% of all calls come from another WCF service on the same server and no other service (10 in total) are affected by this behavior. The service itself does continue working like calculating internal stuff, doing DB updates, etc. No increase in execution times of the internal work. This occurs around 18 - 25 minutes, but the outage is always 15 seconds.
OS
Windows Server 2012
WCF running as Windows Service
WCF configuration:
InstanceContextMode = InstanceContextMode.PerCall,
ConcurrencyMode = ConcurrencyMode.Multiple,
UseSynchronizationContext = false,
IncludeExceptionDetailInFaults = true
Binding = WebHttpBinding
Concurrency Throttle settings:
MaxConcurrentCalls = 384,
MaxConcurrentInstances = 2784,
MaxConcurrentSessions = 2400
I already did some investigating:
- WCF throttle settings
I made a full dump of the service in the exact time when it happened. Neither the ConcurrentCalls nor the ConcurrentSessions are exhausted. The dump didn't show any exception which could be cause of the problem.
- MAX TCP Conenction
Monitoring of active TCP connection is far from it's limits.
- Port trunking in the switch
As no calls are coming in, even from the local services (using localhost), I'm pretty sure it is not network related.
- Load problem
This behavior occurs with low load (see below) and also with high load (5 times as much of incoming calls). The frequency of this is not changing based on the load. I also tried to reproduce the behavior on my staging system with around 600-1000 calls / second. I managed to bring the service into a state where I sent more incoming calls / second as the service could handle. The outstanding calls increased and at some point the service crashed of course. But this behavior never showed up.
- Thread Pool Exhaustion
The problem occurs when the service is running with 50 threads and also with 200 threads. Although with no more available threads, there would be an error message about that.
I'm running out of possible things which would cause such behavior. I think, it could be the GC blocking threads as the service is using around 10GB in the RAM. It is a kind of memory cache service. Or it could be the OS (Windows Server 2012)or something related to the Windows Service itself.
Has anyone faced something like this on your own or does someone has another idea what could cause that ?
Edit: Now I can post images:
Edit: GC heap dump (thanks to usr)
I see that nearly 50% (in total 70%, including the associated references) are caused by one big dictionary with approx. 27 million entries (based on memory dump heap). I will focus on that to refactor it. There are lots of unused items in it. Maybe this will help.
Additionally I will add GC.WaitForFullGCApproach Method from msdn, to see if the GC is running while the service stops handling incoming requests.
I will keep you posted when I know more.
Edit: GC Stats (14 seconds of the outage included)
•CLR Startup Flags: CONCURRENT_GC
•Total CPU Time: 42.662 msec
•Total GC CPU Time: 2.748 msec
•Total Allocs : 1.524,637 MB
•MSec/MB Alloc : 1,802 msec/MB
•Total GC Pause: 2.977,2 msec
•% Time paused for Garbage Collection: 19,4%
•% CPU Time spent Garbage Collecting: 6,4%
•Max GC Heap Size: 11.610,333 MB
•Peak Process Working Set: 14.917,915 MB
•Peak Virtual Memory Usage: 15.326,974 MB
That's "just" 3 seconds of pausing. Anyway, that shouldn't be that high and I'm gonna refactor the memory store. But it does not explain the 15 seconds at all :(
Edit: During the weekend I did the following:
Installed latest windows updates (last update was 2 month ago)
Restarted the windows server
Refactored the in-mem store of the 27 million objects. I managed to reduce the used memory from 11GB to 6-8GB (which is quite a lot). Very old code in there ;)
The problem is not re-occurring so far (approx. 17h running now). That leads me to the assumption that the GC caused the pausing of the service or some OS related issue was causing that behavior.
I guess the problem is not "solved" at all and will re-occur at some point, cause the data will increase over time.
Thanks to everyone spend time on this. I will continue investigate the dumps and try to find out what happened in detail. I'll keep you posted.