erlang OTP Supervisor crashing
Asked Answered
B

3

10

I'm working through the Erlang documentation, trying to understand the basics of setting up an OTP gen_server and supervisor. Whenever my gen_server crashes, my supervisor crashes as well. In fact, whenever I have an error on the command line, my supervisor crashes.

I expect the gen_server to be restarted when it crashes. I expect command line errors to have no bearing whatsoever on my server components. My supervisor shouldn't be crashing at all.

The code I'm working with is a basic "echo server" that replies with whatever you send in, and a supervisor that will restart the echo_server 5 times per minute at most (one_for_one). My code:

echo_server.erl

-module(echo_server).
-behaviour(gen_server).

-export([start_link/0]).
-export([echo/1, crash/0]).
-export([init/1, handle_call/3, handle_cast/2]).

start_link() ->
    gen_server:start_link({local, echo_server}, echo_server, [], []).

%% public api
echo(Text) ->
    gen_server:call(echo_server, {echo, Text}).
crash() ->
    gen_server:call(echo_server, crash)..

%% behaviours
init(_Args) ->
    {ok, none}.
handle_call(crash, _From, State) ->
    X=1,
    {reply, X=2, State}.
handle_call({echo, Text}, _From, State) ->
    {reply, Text, State}.
handle_cast(_, State) ->
    {noreply, State}.

echo_sup.erl

-module(echo_sup).
-behaviour(supervisor).
-export([start_link/0]).
-export([init/1]).

start_link() ->
    supervisor:start_link(echo_sup, []).
init(_Args) ->
    {ok,  {{one_for_one, 5, 60},
       [{echo_server, {echo_server, start_link, []},
             permanent, brutal_kill, worker, [echo_server]}]}}.

Compiled using erlc *.erl, and here's a sample run:

Erlang R13B01 (erts-5.7.2) [source] [smp:2:2] [rq:2] [async-threads:0] [kernel-p
oll:false]

Eshell V5.7.2  (abort with ^G)
1> echo_sup:start_link().
{ok,<0.37.0>}
2> echo_server:echo("hi").
"hi"
3> echo_server:crash().   

=ERROR REPORT==== 5-May-2010::10:05:54 ===
** Generic server echo_server terminating 
** Last message in was crash
** When Server state == none
** Reason for termination == 
** {'function not exported',
       [{echo_server,terminate,
            [{{badmatch,2},
              [{echo_server,handle_call,3},
               {gen_server,handle_msg,5},
               {proc_lib,init_p_do_apply,3}]},
             none]},
        {gen_server,terminate,6},
        {proc_lib,init_p_do_apply,3}]}

=ERROR REPORT==== 5-May-2010::10:05:54 ===
** Generic server <0.37.0> terminating 
** Last message in was {'EXIT',<0.35.0>,
                           {{{undef,
                                 [{echo_server,terminate,
                                      [{{badmatch,2},
                                        [{echo_server,handle_call,3},
                                         {gen_server,handle_msg,5},
                                         {proc_lib,init_p_do_apply,3}]},
                                       none]},
                                  {gen_server,terminate,6},
                                  {proc_lib,init_p_do_apply,3}]},
                             {gen_server,call,[echo_server,crash]}},
                            [{gen_server,call,2},
                             {erl_eval,do_apply,5},
                             {shell,exprs,6},
                             {shell,eval_exprs,6},
                             {shell,eval_loop,3}]}}
** When Server state == {state,
                            {<0.37.0>,echo_sup},
                            one_for_one,
                            [{child,<0.41.0>,echo_server,
                                 {echo_server,start_link,[]},
                                 permanent,brutal_kill,worker,
                                 [echo_server]}],
                            {dict,0,16,16,8,80,48,
                                {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
                                 []},
                                {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
                                  [],[]}}},
                            5,60,
                            [{1273,79154,701110}],
                            echo_sup,[]}
** Reason for termination == 
** {{{undef,[{echo_server,terminate,
                          [{{badmatch,2},
                            [{echo_server,handle_call,3},
                             {gen_server,handle_msg,5},
                             {proc_lib,init_p_do_apply,3}]},
                           none]},
             {gen_server,terminate,6},
             {proc_lib,init_p_do_apply,3}]},
     {gen_server,call,[echo_server,crash]}},
    [{gen_server,call,2},
     {erl_eval,do_apply,5},
     {shell,exprs,6},
     {shell,eval_exprs,6},
     {shell,eval_loop,3}]}
** exception exit: {{undef,
                        [{echo_server,terminate,
                             [{{badmatch,2},
                               [{echo_server,handle_call,3},
                                {gen_server,handle_msg,5},
                                {proc_lib,init_p_do_apply,3}]},
                              none]},
                         {gen_server,terminate,6},
                         {proc_lib,init_p_do_apply,3}]},
                    {gen_server,call,[echo_server,crash]}}
     in function  gen_server:call/2
4> echo_server:echo("hi").
** exception exit: {noproc,{gen_server,call,[echo_server,{echo,"hi"}]}}
     in function  gen_server:call/2
5>
Bortz answered 5/5, 2010 at 16:39 Comment(0)
G
18

The problem testing supervisors from the shell is that the supervisor process is linked to the shell process. When gen_server process crashes the exit signal is propagated up to the shell which crashes and get restarted.

To avoid the problem add something like this to the supervisor:

start_in_shell_for_testing() ->
    {ok, Pid} = supervisor:start_link(echo_sup, []),
    unlink(Pid).
Gyno answered 5/5, 2010 at 18:20 Comment(2)
I would add that this is only a valid approach when developing or debugging. In a live production system, it is better to try and wrap your code in a standard OTP application under a normal supervision tree.Gluey
Take a look here #6720972 I haven't written erlang code for some time now, but if I remember well, what is happening is that, when the gen_server crashes, the exit signal is propagated to all linked processes. You can trap the exit signal as in the link above, but you almost never want to do it. It's erlang let it crash philosophy.Gyno
H
10

I would suggest you to debug/trace your application to check what's going on. It's very helpful in understanding how things work in OTP.

In your case, you might want to do the following.

Start the tracer:

dbg:tracer().

Trace all function calls for your supervisor and your gen_server:

dbg:p(all,c).
dbg:tpl(echo_server, x).
dbg:tpl(echo_sup, x).

Check which messages the processes are passing:

dbg:p(new, m).

See what's happening to your processes (crash, etc):

dbg:p(new, p).

For more information about tracing:

http://www.erlang.org/doc/man/dbg.html

http://aloiroberto.wordpress.com/2009/02/23/tracing-erlang-functions/

Hope this can help for this and future situations.

HINT: The gen_server behaviour is expecting the callback terminate/2 to be defined and exported ;)

UPDATE: After the definition of the terminate/2 the reason of the crash is evident from the trace. This is how it looks:

We (75) call the crash/0 function. This is received by the gen_server (78).

(<0.75.0>) call echo_server:crash()
(<0.75.0>) <0.78.0> ! {'$gen_call',{<0.75.0>,#Ref<0.0.0.358>},crash}
(<0.78.0>) << {'$gen_call',{<0.75.0>,#Ref<0.0.0.358>},crash}
(<0.78.0>) call echo_server:handle_call(crash,{<0.75.0>,#Ref<0.0.0.358>},none)

Uh, problem on the handle call. We have a badmatch...

(<0.78.0>) exception_from {echo_server,handle_call,3} {error,{badmatch,2}}

The terminate function is called. The server exits and it gets unregistered.

(<0.78.0>) call echo_server:terminate({{badmatch,2},
 [{echo_server,handle_call,3},
  {gen_server,handle_msg,5},
  {proc_lib,init_p_do_apply,3}]},none)
(<0.78.0>) returned from echo_server:terminate/2 -> ok
(<0.78.0>) exit {{badmatch,2},
 [{echo_server,handle_call,3},
  {gen_server,handle_msg,5},
  {proc_lib,init_p_do_apply,3}]}
(<0.78.0>) unregister echo_server

The Supervisor (77) receive the exit signal from the gen_server and it does its job:

(<0.77.0>) << {'EXIT',<0.78.0>,
                      {{badmatch,2},
                       [{echo_server,handle_call,3},
                        {gen_server,handle_msg,5},
                        {proc_lib,init_p_do_apply,3}]}}
(<0.77.0>) getting_unlinked <0.78.0>
(<0.75.0>) << {'DOWN',#Ref<0.0.0.358>,process,<0.78.0>,
                      {{badmatch,2},
                       [{echo_server,handle_call,3},
                        {gen_server,handle_msg,5},
                        {proc_lib,init_p_do_apply,3}]}}
(<0.77.0>) call echo_server:start_link()

Well, it tries... Since it happens what Filippo said...

Humankind answered 5/5, 2010 at 18:20 Comment(3)
thanks for the debug tips. The "function not defined ... terminate" error confused me, too. The gen_server behaviour shouldn't expect terminate to be defined, since echo_server is not trapping exits. That is per the docs, anyway; I haven't read OTP's code yet.Bortz
Well, defining and exporting the terminate/2 will remove the UNDEF, showing the real reason of the crash (badmatch on 2). The linking stuff is another story... You confused me a bit now. What do you mean exactly with "the gen_server behaviour shouldn't expect terminate to be defined, since echo_server is not trapping exits"?Humankind
Awesome update, thank you for that. The docs I'm referring to are the OTP Design Manual > gen_server > stopping. "If it is necessary to clean up before termination, the shutdown strategy must be a timeout value and the gen_server must be set to trap exit signals in the init function. When ordered to shutdown, the gen_server will then call the callback function terminate(shutdown, State)" My interpretation is that defining terminate/2 is optional, based on the needs of your gen_server. Is that not the case?Bortz
S
2

On the other hand, if at all restart-strategy has to be tested from within console, use console to start the supervisor and check with pman to kill the process.

You would see that pman refreshes with same supervisor Pid but with different worker Pids depending upon the MaxR and MaxT you have set in restart-strategy.

Stifle answered 17/3, 2012 at 12:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.