Akka.Net remote disconnection
Asked Answered
M

0

8

I am using Akka.Net in a very simple client server configuration. Nothing very advanced at this time. After about 3 or 4 days of sending messages back and forth it seems that the entire system gets in a disconnection state. With a restart of the services everything reconnects and there are no issues. Prior to this things will disconnect however it seems to reconnect right away.

During this time both machines are accessible on the network and don't seem to have any actual connection problems.

I am not sure where to go from here.

Client config (Server very similar)

return ConfigurationFactory
                .ParseString(string.Format(@"
                akka {{  
                    loggers = [""XYZ.AkkaLogger, XYZ""]

                    actor {{
                        provider = ""Akka.Remote.RemoteActorRefProvider, Akka.Remote""
                        serializers {{
                            json = ""XYZ.AkkaSerializer, XYZ""
                        }}
                    }}

                    remote {{
                        helios.tcp {{
                            transport-class = ""Akka.Remote.Transport.Helios.HeliosTcpTransport, Akka.Remote""
                            applied-adapters = []
                            transport-protocol = tcp
                            port = 0
                            hostname = {0}
                            send-buffer-size = 512000b
                            receive-buffer-size = 512000b
                            maximum-frame-size = 1024000b
                            tcp-keepalive = on
                        }}

                        transport-failure-detector {{
                            heartbeat-interval = 60 s # default 4s
                            acceptable-heartbeat-pause = 20 s # default 10s
                        }}
                    }}

                    stdout-loglevel = DEBUG
                    loglevel = DEBUG

                    debug {{  
                            receive = on 
                            autoreceive = on
                            lifecycle = on
                            event-stream = on
                            unhandled = on
                    }}
                }}
                ", Environment.MachineName));

This cycle is pretty sporadic at first however after a while it repeats and nothing connects anymore until a reset of the service.

WARN  2015-07-31 07:22:12,994 [1584] - Association with remote system akka.tcp://SystemName@Server:8081 has failed; address is now gated for 5000 ms. Reason is: [Disassociated]
ERROR 2015-07-31 07:22:12,994 [1584] - Disassociated
Akka.Remote.EndpointDisassociatedException: Disassociated
   at Akka.Remote.EndpointWriter.PublishAndThrow(Exception reason, LogLevel level)
   at Akka.Remote.EndpointWriter.Unhandled(Object message)
   at Akka.Remote.EndpointWriter.Writing(Object message)
   at Akka.Actor.ActorCell.<>c__DisplayClass3e.<Akka.Actor.IUntypedActorContext.Become>b__3d(Object m)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Actor.ActorCell.ReceiveMessage(Object message)
   at Akka.Actor.ActorCell.AutoReceiveMessage(Envelope envelope)
   at Akka.Actor.ActorCell.Invoke(Envelope envelope)
DEBUG 2015-07-31 07:22:12,996 [1494] - Disassociated [akka.tcp://SystemName@Client:57284] -> akka.tcp://SystemName@Server:8081
DEBUG 2015-07-31 07:23:13,033 [1469] - Drained buffer with maxWriteCount: 50, fullBackoffCount: 1,smallBackoffCount: 0, noBackoffCount: 0,adaptiveBackoff: 10000
ERROR 2015-07-31 07:24:13,019 [1601] - No response from remote. Handshake timed out or transport failure detector triggered.
DEBUG 2015-07-31 07:24:13,020 [1569] - Disassociated [akka.tcp://SystemName@Client:57284] -> akka.tcp://SystemName@Server:8081
WARN  2015-07-31 07:24:13,020 [1601] - Association with remote system akka.tcp://SystemName@Server:8081 has failed; address is now gated for 5000 ms. Reason is: [Disassociated]
ERROR 2015-07-31 07:24:13,021 [1601] - Disassociated
Moynahan answered 31/7, 2015 at 18:45 Comment(4)
Saw your issue on Github for this - might be a possible bug related to elapsed system time (I'd check for an OverflowException somewhere.) That's one possibility. I'll be looking into this more next week!Tyrannosaur
I am not sure about that. I actually have two parallel systems which don't interact. Both are virtually identical except for the messages and the port. One seems stable and the other not so much. However both do disconnect from time to time. If a long running operation is received are heartbeats skipped?Moynahan
Could be an issue with this: petabridge.com/blog/large-messages-and-sockets-in-akkadotnet - if you have a massive message that occassionaly goes over the socket (one that takes longer than 5s to process end to end) then that could do it. The remoting system runs on top of its own threads, so it should be safe from side effects from long-running operations - the only exception is if CPU utilization spikes to close to 100% and their dedicated heartbeat threads can't get scheduled.Tyrannosaur
Do you mean a large message that takes longer than 5s to send end to end? I refactored my system to many small messages however some do take 45s to process.Moynahan

© 2022 - 2024 — McMap. All rights reserved.