Distributed Erlang - network split recovery and using heart with distributed applications
Asked Answered
A

0

12

I have a standard situation, two distributed Erlang nodes, one master one standby.

When I stop the master the standby comes on - failover, when I start the master the standby stops - takeover. Everything works fine as long as heart is not turned on and there is no network split.

However, when I disconnect the master from the network after 60 seconds or so the standby gives me an error message ** removing (timedout) connection ** and starts up as if the master node stopped. This makes sense to me, it doesn't know if the master is alive or not, and epmd can't connect to the master node so it is removed from the nodes() list. Lets pretend for a moment that this is the desired outcome.

The problem is that, when the connection is restored, I have master and standby running at the same time and the standby is oblivious to the fact that the master is running. Pinging the standby during the masters init does not solve the issue. I checked nodes() on the standby after doing so, it sees the master node but still it continues to run.

My solution for now has been to create a process, that monitors all nodes that are above each node in hierarchy and if any of them are online, can be pinged, the process calls erlang:halt() to terminate the standby node. It works for simple situations, but maybe someone can tell me if there is a better way? I found a similar problem described on Elixir forum so it probably a known erlang problem without an easy solution. https://elixirforum.com/t/distributed-application-network-split/10007

If during a network split you don't want to have two nodes running in parallel I'm guessing an outside monitoring application needs to be used?

The second major issue is heart. If heart is turned on, as is, the failover never happens. If heart is running with a sleep before it calls start it stops the failover node when it calls the application start. So even when it can't start the master, do to it not having access to vital resources for example, it stops the failover node, and doesn't bring it back up after it fails to start the master. I don't know if heart is not supposed to be used with a distributed application or if there is an option to run some erlang code to check if the resources are available before attempting a start the node and before stopping the failover node?

The documentation on heart is not great. Very hard to find any examples of HEART_COMMAND. I found a way to set the HEART_COMMAND to a script from within my application, but there is a limit to how long the argument can be, and it's not as long as stated in the documentation from what I can tell. This for example sets a sleep timer for 60 seconds before calling application start again. It doesn't solve any issues, because in 60 seconds it stops the failover node and hangs if master node can't start. heart:set_cmd("sleep 60; ./bin/myapp start")

The solution I've ended up with for now is letting heart of the main release start another release, a pre-loader, which does a preliminary check that all resources are available and if they are it starts the main release-application, and if they are not it continues checking forever. This way the main app is running on the failover node without interruption. So the main release has heart turned on, and the pre-loader does not. I ended up using a bash script file because I needed to do more work than I could fit in the heart:set_cmd/1, so I'm still calling heart:set_cmd(Dir ++ "/priv/myHeartScript.sh " ++ Arg1 ++ " " ++ Arg2), but don't get carried away with the Args as there is a limit on size! I also used Environment Variables which I set in vm.args using -env to pass data to the script, such as the pre-loader path/name. This allowed me to avoid having to edit the scrip too during deployment.

If anyone has a better solution PLEASE let me know.

UPDATE

The team at Erlang Solutions was kind enough to shed some light onto the subject. Basically, nobody they know uses the Erlangs built in distributed model. Everything revolves around the data, and as long as it is available on redundant databases you can spin up new applications anytime. They recommend using the cloud hosts that can spin up new servers when one goes down or use a redundant node design, so have 5 nodes up in parallel and if a few go down you can restart them manually or by other means.

As for me, I can say that getting heart to start a pre-loader release/app gets the job done but it gets complicated fast. To launch the app now requires provisioning several extra sys.config/vm.args/rebar.config files. I will be looking into their suggestions for the next iteration.

UPDATE

Moved away from using Erlang distributed model. Using RabbitMQ to send heartbeats to all nodes, including itself. If a node is receiving heartbeats from itself and no other node it's the master, if receiving more than one use any attribute like node name to chose the master. You don't have to use RabbitMQ, but you need to make sure all nodes can reach the same destination and consume from it.

Also, devOps oppose using heart. They prefer to use standard Linux tools to monitor application status and restart it after crash or a server reboot.

Ashlan answered 12/4, 2018 at 21:31 Comment(6)
Can you show the contents of your config file? Your system should be noticing that the primary has restarted and it should perform a takeover from the secondary. Do you have sasl enabled, or have you tried using observer, so you can see more details about what's going on?Bechuana
Hi Steve, The config is pretty standard, sasl in enabled, though it might start after my app which could be a problem, I have to check. I don't know what observer would tell me in this case. I've also noticed a big problem with heart. When heart tries to restart the master node it stops the failover node, and after it tries and fails it doesn't failover to the standby again. [{kernel, [{distributed, [{myapp, 5000t, [[email protected], [email protected]]}]}, {sync_nodes_mandatory, []}, {sync_nodes_optional, [[email protected]]}, {sync_nodes_timeout, 20000} ]}].Ashlan
Are you using releases, or using some other approach, to ensure your application starts as part of each node's bootup?Bechuana
I'm using rebar3 release, testing this way not rebar3 as prod. Would that make any diff?Ashlan
No, rebar3 release should be fine. Another question: have you tried setting your primary as a mandatory sync node? Seems like as things currently stand, your application can start without requiring any nodes to be running. And regarding heart, I would avoid it at first until you can get a takeover from secondary to primary to occur, and once that works, then reintroduce heart.Bechuana
Steve, failover and takeover work fine without heart except for when a network split happens. With a network split, the failover node would not be able to restart using heart if Master was mandatory, which is my logic for making it optional.Ashlan

© 2022 - 2024 — McMap. All rights reserved.