I'm building a monitoring tool in Erlang. When run on a cluster, it should run a set of data collection functions on all nodes and record that data using RRD on a single "recorder" node.
The current version has a supervisor running on the master node (rolf_node_sup
) which attempts to run a 2nd supervisor on each node in the cluster (rolf_service_sup
). Each of the on-node supervisors should then start and monitor a bunch of processes which send messages back to a gen_server on the master node (rolf_recorder
).
This only works locally. No supervisor is started on any remote node. I use the following code to attempt to load the on-node supervisor from the recorder node:
rpc:call(Node, supervisor, start_child, [{global, rolf_node_sup}, [Services]])
I've found a couple of people suggesting that supervisors are really only designed for local processes. E.g.
What is the most OTP way to implement my requirement to have supervised code running on all nodes in a cluster?
- A distributed application is suggested as one alternative to a distributed supervisor tree. These don't fit my use case. They provide for failover between nodes, but keeping code running on a set of nodes.
- The pool module is interesting. However, it provides for running a job on the node which is currently the least loaded, rather than on all nodes.
- Alternatively, I could create a set of supervised "proxy" processes (one per node) on the master which use
proc_lib:spawn_link
to start a supervisor on each node. If something goes wrong on a node, the proxy process should die and then be restarted by it's supervisor, which in turn should restart the remote processes. The slave module could be very useful here. - Or maybe I'm overcomplicating. Is directly supervising nodes a bad idea, instead perhaps I should architect the application to gather data in a more loosely coupled way. Build a cluster by running the app on multiple nodes, tell one to be master, leave it at that!
Some requirements:
- The architecture should be able to cope with nodes joining and leaving the pool without manual intervention.
- I'd like to build a single-master solution, at least initially, for the sake of simplicity.
- I would prefer to use existing OTP facilities over hand-rolled code in my implementation.