Can an OTP supervisor monitor a process on a remote node?
Asked Answered
G

1

8

I'd like to use erlang's OTP supervisor in a distributed application I'm building. But I'm having trouble figuring out how a supervisor of this kind can monitor a process running on a remote Node. Unlike erlang's start_link function, start_child has no parameters for specifying the Node on which the child will be spawned.

Is it possible for a OTP supervisor to monitor a remote child, and if not, how can I achieve this in erlang?

Glenda answered 24/11, 2016 at 17:3 Comment(2)
I think the recommended way is to have a supervisor on each node.Confirmand
@Confirmand Indeed. Actually, not just a supervisor, but usually a full replica of whatever system is being distributed so that work requests can be cross-node without requiring any dramatic changes to the code.Onstad
O
5

supervisor:start_child/2 can be used across nodes.

The reason for your confusion is just a mix-up regarding the context of execution (which is admittedly a bit hard to keep straight sometimes). There are three processes involved in any OTP spawn:

  • The Requestor
  • The Supervisor
  • The Spawned Process

The context of the Requestor is the one in which supervisor:start_child/2 is called -- not the context of the supervisor itself. You would normally provide a supervisor interface by exporting a function that wraps the call to supervisor:spawn_child/2:

do_some_crashable_work(Data) ->
    supervisor:start_child(sooper_dooper_sup, [Data]).

That might be defined and exported from the supervisor module, be defined internally within a "manager" sort of process according to the "service manager/supervisor/workers" idiom, or whatever. In all cases, though, some process other than the supervisor is making this call.

Now look carefully at the Erlang docs for supervisor:start_child/2 again (here, and an R19.1 doc mirror since sometimes erlang.org has a hard time for some reason). Note that the type sup_ref() can be a registered name, a pid(), a {global, Name} or a {Name, Node}. The requestor may be on any node calling a supervisor on any other node when calling with a pid(), {global, Name} or a {Name, Node} tuple.

The supervisor doesn't just randomly kick things off, though. It has a child_spec() it is going off of, and the spec tells the supervisor what to call to start that new process. This first call into the child module is made in the context of the supervisor and is a custom function. Though we typically name it something like start_link/N, it can do whatever we want as a part of startup, including declare a specific node on which to spawn. So now we wind up with something like this:

%% Usually defined in the requestor or supervisor module
do_some_crashable_work(SupNode, WorkerNode, Data) ->
    supervisor:start_child({sooper_dooper_sup, SupNode}, [WorkerNode, Data]).

With a child spec of something like:

%% Usually in the supervisor code
SooperWorker = {sooper_worker,
                {sooper_worker, start_link, []},
                temporary,
                brutal_kill,
                worker,
                [sooper_worker]},

Which indicates that the first call would be to sooper_worker:start_link/2:

%% The exported start_link function in the worker module
%% Called in the context of the supervisor
start_link(Node, Data) ->
     Pid = proc_lib:spawn_link(Node, ?MODULE, init, [self(), Data]).

%% The first thing the newly spawned process will execute
%% in its own context, assuming here it is going to be a gen_server.
init(Parent, Data) ->
    Debug = sys:debug_options([]),
    {ok, State} = initialize_some_state(Data)
    gen_server:enter_loop(Parent, Debug, State).

You might be wondering what all that mucking about with proc_lib was for. It turns out that while calling for a spawn from anywhere within a multi-node system to initiate a spawn anywhere else within a multi-node system is possible, it just isn't a very useful way of doing business, and so the gen_* behaviors and even proc_lib:start_link/N doesn't have a method of declaring the node on which to spawn a new process.

What you ideally want is nodes that know how to initialize themselves and join the cluster once they are running. Whatever services your system provides are usually best replicated on the other nodes within the cluster, and then you only have to write a way of picking a node, which allows you to curry out the business of startup entirely as it is now node-local in every case. In this case whatever your ordinary manager/supervisor/worker code does doesn't have to change -- stuff just happens, and it doesn't matter that the requestor's PID happens to be on another node, even if that PID is the address to which results must be returned.

Stated another way, we don't really want to spawn workers on arbitrary nodes, what we really want to do is step up to a higher level and request that some work get done by another node and not really care about how that happens. Remember, to spawn a particular function based on an {M,F,A} call the node you are calling must have access to the target module and function -- if it has a copy of the code already, why isn't it a duplicate of the calling node?

Hopefully this answer explained more than it confused.

Onstad answered 26/11, 2016 at 10:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.