Unexpected fault on ReliableSession in NetTcpBinding (WCF)

A

3

1

I have a client server application. My scenario:

.Net Framework 4.6.1
Quad Core i7 machine with hyperthreading enabled
Server CPU load from 20 - 70 %
Network load < 5% (GBit NIC)
100 users
30 services (some administrative ones, some generic ones per datatype) running and each user is connected to all services
NetTcpBinding (compression enabled)
ReliableSession enabled
each second I do trigger (server side) an update notification and all clients load from the server approx. 100 kB
additionally a heartbeat is running (for testing 15 seconds interval) which simply returns the server time in UTC

Sometimes the WCF connections change to faulted state. Usually when this happens the server has no network upstream at all. I did write a memory dump and was able to see that lots of WCF threads were waiting for some WaitQueue. The call stack is:

Server stack trace: 
   at System.ServiceModel.Channels.TransmissionStrategy.WaitQueueAdder.Wait(TimeSpan timeout)
   at System.ServiceModel.Channels.TransmissionStrategy.InternalAdd(Message message, Boolean isLast, TimeSpan timeout, Object state, MessageAttemptInfo& attemptInfo)
   at System.ServiceModel.Channels.ReliableOutputConnection.InternalAddMessage(Message message, TimeSpan timeout, Object state, Boolean isLast)
   at System.ServiceModel.Channels.ReliableDuplexSessionChannel.OnSend(Message message, TimeSpan timeout)
   at System.ServiceModel.Channels.DuplexChannel.Send(Message message, TimeSpan timeout)
   at System.ServiceModel.Dispatcher.DuplexChannelBinder.Send(Message message, TimeSpan timeout)
   at System.ServiceModel.Channels.ServiceChannel.Call(String action, Boolean oneway, ProxyOperationRuntime operation, Object[] ins, Object[] outs, TimeSpan timeout)
   at System.ServiceModel.Channels.ServiceChannelProxy.InvokeService(IMethodCallMessage methodCall, ProxyOperationRuntime operation)
   at System.ServiceModel.Channels.ServiceChannelProxy.Invoke(IMessage message)

I did tweak the settings and it seems that the situation is eased - Now there are faulting less clients. My settings:

ReliableSession.InactivityTimeout: 01:30:00
ReliableSession.Enabled: True
ReliableSession.Ordered: False
ReliableSession.FlowControlEnabled: False
ReliableSession.MaxTransferWindowSize: 4096
ReliableSession.MaxPendingChannels: 16384
MaxReceivedMessageSize: 1073741824
ReaderQuotas.MaxStringContentLength: 8388608
ReaderQuotas.MaxArrayLength: 1073741824

I am stuck. Why do all calls try to wait for some WaitQueue in the TransmissionStrategy? I do not care about messages being sent out of order (I do take care of that myself). I was already thinking about disabling reliable messaging but the application is used in a company network worldwide. I need to know that my messages were delivered.

Any ideas how to teach WCF to just send the messages and do not care about anything else?

EDIT

The values for service throttling are set to Int32.MaxValue.

I did also try to set MaxConnections and ListenBackLog (on NetTcpBinding) to their maximum values. It did not change anything - as far as I can tell.

EDIT 2

Checking the WCF Traces it tells me (German message, therefore a rough translation) that there is no available space in the reliable messaging transfer window - and then all I get are Timeouts because no more messages are sent.

Whats going on there? Is it possible that the reliable messaging confuses itself?

Apprehensible answered 11/1, 2019 at 8:4 Comment(8)

Turning on trace should show when a channel faults, and why. I think its better to give trace a chance before tinkering with a production app – Microcopy 12/1, 2019 at 19:21

@Microcopy please check Edit 2 – Apprehensible 14/1, 2019 at 9:41

It looks like thwres a deeper problem, if the 4096 transfer window is nit enough. Its seems nessages are not being acknowledged. Are the client and server on the same network? If so reliable sessions is not required. If not maybe the acks are getting dropped – Microcopy 14/1, 2019 at 11:38

@Microcopy in my testing environment most of them are on the same switch - however in production it is a world wide company network – Apprehensible 14/1, 2019 at 11:48

And the issue is on the testing net? – Microcopy 14/1, 2019 at 11:54

@Microcopy I did set up this testing environment because I have this behavior in production – Apprehensible 14/1, 2019 at 12:10

And this issue reproduces in the test env? – Microcopy 14/1, 2019 at 12:33

Let us continue this discussion in chat. – Microcopy 14/1, 2019 at 12:44

A

2

Long story short:

It turns out that my WCF settings are just fine.

The ThreadPool is the limiting factor. In high traffic (and therefore high load) situations I do generate to much messages which have to be sent to the clients. Those are queued up as there are not enough worker threads to send the messages. At some point the queue is full - and there you are.

For more details check this question & answer from Russ Bishop.

Interesting detail: This did even decrease the CPU load in high traffic situations. From spiking crazy between 30 and 80 percent to a(n) (almost) steady value around 30 percent. I can only assume that is is because of threadpool thread generation and cleanup.

EDIT

I did the following:

ThreadPool.SetMinThreads(1000, 500)

That values might be like using a sledgehammer to crack a nut - but it works.

Apprehensible answered 17/1, 2019 at 12:17 Comment(3)

Thank you for the update ! But what did you do to solve the problem ? – Microcopy 17/1, 2019 at 13:17

Also, if the server is calling clients only as a notification , it should not take long to handle, and thus will not drain the server`s resources. it seems to me that remote calls should be as short as possible – Microcopy 17/1, 2019 at 13:28

@Microcopy To further clarify: when a dataobject is modified an update notification is sent to the clients which then load the new version of the object. If there were many updates it seems that the worker threads were not able to keep pace... – Apprehensible 17/1, 2019 at 13:48

C

2

The wait queue can be related to wcf built in throttling behavior https://learn.microsoft.com/en-us/dotnet/framework/configure-apps/file-schema/wcf/servicethrottling The best way to troubleshoot is to enable wcf tracing https://learn.microsoft.com/en-us/dotnet/framework/configure-apps/file-schema/wcf/servicethrottling And know exactly what is the root cause

Capreolate answered 11/1, 2019 at 15:51 Comment(2)

Nope. I did set all 3 of them to Int.MaxValue – Apprehensible 12/1, 2019 at 19:7

Tracing just tells me that the reliable transfer window is full (see my updated question). Any further ideas? – Apprehensible 14/1, 2019 at 9:42

A

2

Long story short:

It turns out that my WCF settings are just fine.

The ThreadPool is the limiting factor. In high traffic (and therefore high load) situations I do generate to much messages which have to be sent to the clients. Those are queued up as there are not enough worker threads to send the messages. At some point the queue is full - and there you are.

For more details check this question & answer from Russ Bishop.

Interesting detail: This did even decrease the CPU load in high traffic situations. From spiking crazy between 30 and 80 percent to a(n) (almost) steady value around 30 percent. I can only assume that is is because of threadpool thread generation and cleanup.

EDIT

I did the following:

ThreadPool.SetMinThreads(1000, 500)

That values might be like using a sledgehammer to crack a nut - but it works.

Apprehensible answered 17/1, 2019 at 12:17 Comment(3)

Thank you for the update ! But what did you do to solve the problem ? – Microcopy 17/1, 2019 at 13:17

Also, if the server is calling clients only as a notification , it should not take long to handle, and thus will not drain the server`s resources. it seems to me that remote calls should be as short as possible – Microcopy 17/1, 2019 at 13:28

@Microcopy To further clarify: when a dataobject is modified an update notification is sent to the clients which then load the new version of the object. If there were many updates it seems that the worker threads were not able to keep pace... – Apprehensible 17/1, 2019 at 13:48

T

1

Do you use connectionManagement to set maxconnection of your client?(If your session is duplex) https://learn.microsoft.com/en-us/dotnet/framework/configure-apps/file-schema/network/connectionmanagement-element-network-settings

Your MaxPendingChannels is set to 16384, which will make too many client wait in the queue, if the server couldn't deal with the clients in time, the channel may turn to fault state.

FlowControlEnabled means whether to continue sending message to server side when the server has no space left to save the message. You had better set it to true.

InactivityTimeout means whether to close the session when there is no message exchange within a certain period of time. You had better set it to a suitable value.

In addition , have you set your binding's timeout?

  <netTcpBinding>
    <binding  closeTimeout="" openTimeout="" receiveTimeout="" sendTimeout="" ></binding>
  </netTcpBinding>

Taper answered 14/1, 2019 at 6:43 Comment(1)

I do use default values except for the receive timeout. – Apprehensible 14/1, 2019 at 9:44

Recommended topics

Hot tags