WCF Reliable Sessions Fault when Server under heavy CPU load or Thread Pool Busy
There appears to be a design flaw in WCF Reliable Sessions that prevents the issue or acceptance of infrastructure keep-alive messages when the server is under high CPU load (80-100% range) or when there isn't an immediate IO threadpool thread available to handle the message. The symptoms manifest as apparently random channel aborts due to reliable session inactivity timeouts. However it appears the abort logic runs at a higher priority or via a different mechanism because the abort timer seems to fire even though the keep-alive timer can't run.
Digging into the reference source, it appears that the ChannelReliableSession uses an InterruptableTimer class to handle the inactivityTimer. In response, it fires the PollingCallback, set by the ReliableOutputSessionChannel, which creates an ACKRequestedMessage and sends it to the remote endpoint. The InactivityTimer uses the WCF internal IOThreadTimer/IOThreadScheduler to schedule itself. That depends on an available (non-busy) IO ThreadPool thread to service the timer. If CPU load is high, it appears the thread pool won't spawn a new thread. As a result if several threads are executing (appears to be 8 threads on my 4-core machine; with a 15 second inactivityTimeout 7 will abort and fail) then no thread is available to send the keep-alive. However if you modify the reliable session inactivity timeout on the client to be longer than the server, even under these conditions the server will still unilaterally abort the channel because it expected a message in a shorter time. So it appears the abort logic is running at a higher priority or throws an exception into one of the executing threads (not sure which); I expected the abort on the server to be delayed due to high CPU and for the client's longer timeout to eventually hit but this was not the case. If CPU load is lower then this exact same scenario works perfectly fine even with concurrent calls that take 30-90 seconds to return.
It is irrelevant what your InstanceMode is, the max concurrent connections, sessions, or instances are, what any of the other timeout values are (other than recieveTimeout must be greater than the inactivityTimeout). It is entirely a design flaw of the WCF implementation; it should be using an isolated high-priority or realtime thread to service the keep-alive messages so spurious aborts are not generated.
The short version is: I can issue 1000 concurrent requests that take 60 seconds to complete with a 15 second Reliable Session Inactivity Timeout with no problems, so long as the CPU load stays low.As soon as the CPU load gets heavy, calls will randomly begin aborting, including calls that aren't taking up any CPU time or duplex sessions idling waiting to be used. If incoming calls also add to CPU load then the service will enter a death spiral, as execution time is wasted on requests guaranteed to abort, while other requests sit in the inbound queue. The service cannot return to a healthy state until all requests are stopped, all in-flight threads finish, and CPU load drops. This behavior appears to paradoxically make Reliable Sessions one of the least reliable communication mechanisms.
This same behavior applies to clients; in that case the WCF client may be at the mercy of other processes on the box but under high CPU load it will randomly abort the reliable sessions unless all operations take less than the inactivityTimeout to complete, though if you don't issue a new call quickly WCF may still fail to send the keep-alive and the channel may fault.