How can we troubleshoot intermittent "An existing connection was forcibly closed" errors caused by a Cisco CSS
Asked Answered
R

3

10

We have the "standard" three tier architecture with our middle tier hosted in IIS and accessed via .net remoting. These errors occur between our web and web services servers (front tier) that are remoting to the app servers (middle tier). We'll get this error 3-10 times a day out of ~130K total calls in the day.

The exception and stack trace always look similar to this:


Exception Type: System.Net.WebException
Message: The underlying connection was closed: An unexpected error occurred on a receive.

Server stack trace: 
   at System.Runtime.Remoting.Channels.Http.HttpClientTransportSink.ProcessResponseException(WebException webException, HttpWebResponse& response)
   at System.Runtime.Remoting.Channels.Http.HttpClientTransportSink.ProcessMessage(IMessage msg, ITransportHeaders requestHeaders, Stream requestStream, ITransportHeaders& responseHeaders, Stream& responseStream)
   at System.Runtime.Remoting.Channels.BinaryClientFormatterSink.SyncProcessMessage(IMessage msg)

Exception rethrown at [0]: 
   at System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg)
   at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type)
   at XXXXX.BusinessFacade.Interface.XXXXInterface.SubmitXXXX(
   at XXX.XXXXWebServicesLibrary.XXXXService.CreateXXXXXX.RunXXXXMethod()
   at XXX.XXXXWebServicesLibrary.XXXXService.XXXXXXMethod`2.RunMethod()
   at XXX.XXXXWebServicesLibrary.XXXXXWebMethod`2.Run()HandleReturnMessage()
Inner Exception: 

Exception Type: System.IO.IOException
Message: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host.
   at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size)
   at System.Net.PooledStream.Read(Byte[] buffer, Int32 offset, Int32 size)
   at System.Net.Connection.SyncRead(HttpWebRequest request, Boolean userRetrievedStream, Boolean probeRead)Read()
Inner Exception: 

Exception Type: System.Net.Sockets.SocketException
Message: An existing connection was forcibly closed by the remote host
   at System.Net.Sockets.Socket.Receive(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags)
   at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size)Receive()

There's no particular remoting call that causes this to happen, it can be any of them which seems to rule out any sort of application specific cause. The only common denominator is the "Exception Type: System.Net.Sockets.SocketException Message: An existing connection was forcibly closed by the remote host" portion of the error.

The front and middle tiers are separated by a firewall and we are also utilizing a VIP device. I strongly suspect an issue with our network/firewall configuration but our network guys are just scratching their heads and not offering any suggestions.

Although a 0.003% failure rate may seem insignificant, we have partners that scrutinize our communications very carefully and I am just waiting for this to become an issue they notice. I don't want to have to say "I don't know" when that time comes.

Does anyone have any ideas on how I could provide more information or any suggestions I could make to our network guys to get this resolved?

Ruffin answered 6/8, 2010 at 19:4 Comment(3)
Is the appdomain in IIS recycling when the exception occurs?Tacheometer
The IIS Worker Process may recycle for few reason : lifetime reached (in minutes), number of requests reached, memory limit reached . This is for "normal" reclycling depending on the IIS -pool- configuration. If it recycle for an abnormal reason, you should have an event log like : System>W3SVC|Warning:A process serving application pool 'xxx' suffered a fatal communication ... For IIS 7 the source is 'WAS' not 'W3SVC'.Hardaway
I reviewed the logs and I'm not seeing anything like that.Ruffin
R
7

The problem was the Cisco CSS. We determined this by pointing the tier 1 servers directly to the tier 2 servers and going 48 hours without observing the problem. Once we determined it was the CSS, we corrected this problem by adjusting the insanely low default value for this parameter:

"Default flow inactivity timeouts, in seconds, for the TCP or UDP port. If a flow is idle for the amount of time specified in the timeout value, the CSS tears down the flow and reclaims the flow resources."

We set this to 84 (which is 84 16-second increments). Since the default keep-alive for HTTP is 120 seconds, the default value was too low.

Ruffin answered 15/4, 2011 at 20:14 Comment(0)
C
2

To check recycling of the Application pool go to your IIS and open the Properties of the Application Pool on which your remoting service is running. You can configure recycling of Application pools using a time interval, number of requests or define specific times.

You could remove the current recycling rules and set a recycling to a time where no connections are expected, like 3.00 at night. Then see if the exceptions stil occur.

Comprehensible answered 21/2, 2011 at 20:20 Comment(1)
The default recycling rules are in place (1740 minutes). Based on the description there, I don't see how this would be the problem since "normal" recycling only occurs on idle worker processes and the connections aren't tied to the worker processes.Ruffin
P
2

It could be a network component causing this. The way to rule this out would be to place both machines (or test machines) on the same subnet, then run a load test, and verify that you do not get the same error.

The other things that could be causing it could be:

Propound answered 23/2, 2011 at 19:30 Comment(5)
Those are all good suggestions. Unfortunately, we have done load tests in our "test" environment with loads that far exceed our production volume without reproducing the issue. We aren't using WCF so the configuration options you mentioned aren't relevant. I've checked the message size in the IIS log when we've gotten this failure and it's not large at all. I will probably awared you the bounty tomorrow morning if no one else has answered just so those points don't go to waste. :)Ruffin
Which Firewall and VIP device are you using?Propound
Turns out that it was a problem with the Cisco CSS we had between our front and middle tiers to load balance. When we pointed each front tier server directly to a middle tier server, we no longer had this problem. I will post more info as it become available.Ruffin
Hi @Ruffin did you manage to resolve this issue? I'm having the same problem - get the same error message when we go through a load balancer, but the problem doesn't occur when we bypass the load balancer and go straight to a particular serverFaroff
@Ciaran Bruen we have not. We have just isolated the problem to the CSS.Ruffin

© 2022 - 2024 — McMap. All rights reserved.