wait3 (waitpid alias) returns -1 with errno set to ECHILD when it should not
Asked Answered
X

1

31

Context is this Redis issue. We have a wait3() call that waits for the AOF rewriting child to create the new AOF version on disk. When the child is done, the parent is notified via wait3() in order to substitute the old AOF with the new one.

However in the context of the above issue the user notified us about a bug. I modified a bit the implementation of Redis 3.0 in order to clearly log when wait3() returned -1 instead of crashing because of this unexpected condition. So this is what happens apparently:

  1. wait3() is called when we have pending children to wait for.
  2. the SIGCHLD should be set to SIG_DFL, there is no code setting this signal at all in Redis, so it's the default behavior.
  3. When the first AOF rewrite happens, wait3() successfully works as expected.
  4. Starting from the second AOF rewrite (the second child created), wait3() starts to return -1.

AFAIK it is not possible in the current code that we call wait3() while there are no pending children, since when the AOF child is created, we set server.aof_child_pid to the value of the pid, and we reset it only after a successful wait3() call.

So wait3() should have no reason to fail with -1 and ECHILD, but it does, so probably the zombie child is not created for some unexpected reason.

Hypothesis 1: It is possible that Linux during certain odd conditions will discard the zombie child, for example because of memory pressure? Does not look reasonable since the zombie has just metadata attached to it but who knows.

Note that we call wait3() with WNOHANG. And given that SIGCHLD is set to SIG_DFL by default, the only condition that should lead to failing and returning -1 and ECHLD should be no zombie available to report the information.

Hypothesis 2: Other thing that could happen but there is no explanation if it happens, is that after the first child dies, the SIGCHLD handler is set to SIG_IGN, causing wait3() to return -1 and ECHLD.

Hypothesis 3: Is there some way to remove the zombie children externally? Maybe this user has some kind of script that removes zombie processes in the background so that then the information is no longer available for wait3()? To my knowledge it should never be possible to remove the zombie if the parent does not wait for it (with waitpid or handling the signal) and if the SIGCHLD is not ignored, but maybe there is some Linux specific way.

Hypothesis 4: There is actually some bug in the Redis code so that we successfully wait3() the child the first time without correctly resetting the state, and later we call wait3() again and again but there are no longer zombies, so it returns -1. Analyzing the code it looks impossible, but maybe I'm wrong.

Another important thing: we never observed this in the past. Only happens in this specific Linux system apparently.

UPDATE: Yossi Gottlieb proposed that the SIGCHLD is received by another thread in the Redis process for some reason (does not happen normally, only on this system). We already mask SIGALRM in bio.c threads, perhaps we could try masking SIGCHLD from I/O threads as well.

Appendix: selected parts of Redis code

Where wait3() is called:

/* Check if a background saving or AOF rewrite in progress terminated. */
if (server.rdb_child_pid != -1 || server.aof_child_pid != -1) {
    int statloc;
    pid_t pid;

    if ((pid = wait3(&statloc,WNOHANG,NULL)) != 0) {
        int exitcode = WEXITSTATUS(statloc);
        int bysignal = 0;

        if (WIFSIGNALED(statloc)) bysignal = WTERMSIG(statloc);

        if (pid == -1) {
            redisLog(LOG_WARNING,"wait3() returned an error: %s. "
                "rdb_child_pid = %d, aof_child_pid = %d",
                strerror(errno),
                (int) server.rdb_child_pid,
                (int) server.aof_child_pid);
        } else if (pid == server.rdb_child_pid) {
            backgroundSaveDoneHandler(exitcode,bysignal);
        } else if (pid == server.aof_child_pid) {
            backgroundRewriteDoneHandler(exitcode,bysignal);
        } else {
            redisLog(REDIS_WARNING,
                "Warning, detected child with unmatched pid: %ld",
                (long)pid);
        }
        updateDictResizePolicy();
    }
} else {

Selected parts of backgroundRewriteDoneHandler:

void backgroundRewriteDoneHandler(int exitcode, int bysignal) {
    if (!bysignal && exitcode == 0) {
        int newfd, oldfd;
        char tmpfile[256];
        long long now = ustime();
        mstime_t latency;

        redisLog(REDIS_NOTICE,
            "Background AOF rewrite terminated with success");

        ... more code to handle the rewrite, never calls return ...

    } else if (!bysignal && exitcode != 0) {
        server.aof_lastbgrewrite_status = REDIS_ERR;

        redisLog(REDIS_WARNING,
            "Background AOF rewrite terminated with error");
    } else {
        server.aof_lastbgrewrite_status = REDIS_ERR;

        redisLog(REDIS_WARNING,
            "Background AOF rewrite terminated by signal %d", bysignal);
    }

cleanup:
    aofClosePipes();
    aofRewriteBufferReset();
    aofRemoveTempFile(server.aof_child_pid);
    server.aof_child_pid = -1;
    server.aof_rewrite_time_last = time(NULL)-server.aof_rewrite_time_start;
    server.aof_rewrite_time_start = -1;
    /* Schedule a new rewrite if we are waiting for it to switch the AOF ON. */
    if (server.aof_state == REDIS_AOF_WAIT_REWRITE)
        server.aof_rewrite_scheduled = 1;
}

As you can see all the code paths must execute the cleanup code that reset server.aof_child_pid to -1.

Errors logged by Redis during the issue

21353:C 29 Nov 04:00:29.957 * AOF rewrite: 8 MB of memory used by copy-on-write

27848:M 29 Nov 04:00:30.133 ^@ wait3() returned an error: No child processes. rdb_child_pid = -1, aof_child_pid = 21353

As you can see aof_child_pid is not -1.

Xerophagy answered 30/11, 2015 at 8:46 Comment(20)
To me this sounds as if you are testing to fast, to early, the child simply did not end yet.Undersell
Perhaps you might like to elaborate on how you make sure this: "wait3() is called when we have pending children to wait for." really is the case, as it obviously isn't. I have to admit, I do not know the Redis code, but which other mechanics would you use to synchronize processes regarding their live time, but using calls to wait*()? I'd say you are facing a race.Undersell
@alk: About the "testing too fast" comment, wait3 should return 0 if the child is still working not -1, as I used WNOHANG as specified. About your second comment, you can find the relevant code snippet in the appendix where it is highlighted where wait3 is called, how the work it does is handled by the function backgroundRewriteDoneHandler(), and how this function can only result, AFAIK, into setting aof_child_pid to -1.Xerophagy
Perhaps the moment you call wait3() the child did not even started yet (completely)?Undersell
Not possible because of fork() semantics. The parent is notified with the child pid when fork() returns, and the child process exists from now on. However if you check the original issue, there is a continuous look of wait3() calls so even in the future it never returns something different.Xerophagy
For debugging: Handle pid==-1/ECHLD the same as pid==0 and log whether wait3() finds the process in question later.Undersell
I am not sure I understood the set up correctly: Do you observer the described behaviour also when explicitly setting signal(SIGCHLD, SIG_DFL);?Undersell
Per man 7 signal the default for SIGCHLD is to be ignored.Undersell
Also to have more portable code (and probably less such issues as you are observing) you want to replace signal() by sigaction().Undersell
SIGCHLD by default is set to SIG_DFL, just the documentation is a bit confusing. Also note that this works in all the Redis installations out there but in this system explicitly and reproducibly. The default action for SIGCHLD is to ignore the signal in the sense no signal hander is called, but zombie processes will queue waiting for waitpid() to receive their status. Btw if you check at the end of the Redis issue at Github, it was suggested for the user to explicitly set the signal to SIG_DFL and retry. However note that the first time the child exists, waitpid() works.Xerophagy
@Xerophagy The older unix signal did reset the signal handler to default (SIG_DFL) after the first handling of a signal. So it's possible that hypothesis 2 to happen. Simply replace signal() call with sigaction() (which doesn't reset to SIG_DFL) to see if this is true.Arietta
@Blue Moon: yep I though to this, but it's a recent Linux system, Ubuntu 12.04, and the paradox is, SIG_DFL is what we want, so that is almost impossible to be the cause. Hypothesis two is that, incredibly, it would be reset to SIG_IGN instead of SIG_DFL. Anyway se use sigaction() in the rest of the code but for those two signals, so indeed, it's better to switch everything to sigaction() for clearness, but in this specific matter should not be related.Xerophagy
@Xerophagy with regards to your updates about threads and masking signals, it might help to mask all signals on all but the thread(s) that you're okay to receive signals on. Not really related, but for example, in PulseAudio, in the client library, we explicitly mask receiving any signals at all since we want no interruptions in a potentially real-time I/O thread.Ambry
@antirez, since apparently you know the pid of the child you are waiting for, and you want to wait specifically for that particular child, it seems it would be wiser to use wait4() than wait3(). I doubt that will fix the issue you asked about, but it may save you grief later.Cotenant
@antirez, on the other hand, is there any chance that another thread collects the expected child first?Cotenant
In any event, if indeed the behavior can be reproduced only in a specific system, then that strongly suggests that some unusual characteristic of that system is influencing it. Perhaps it is the system that is broken, rather than Redis. Perhaps a buggy kernel module or a misconfigured custom kernel is in play.Cotenant
Redis has another wait3() call in sentinelCollectTerminatedScripts(), can we be certain this doesn't eat up the processes identified by rdb_child_pid /server.aof_child_pid in this case ?Posture
Building on the comment by @nos, stopAppendOnly in aof.c also kills and wait3s, specifically for the process in question. Can you be sure some other code hasn't already won the race to reap the child? It may be a red herring, but in your bug report the errors all happen very near five minute boundaries — perhaps cron or some other timed maintenance job is causing the unexpected reap?Galenism
I would start by running the code behind strace(1), make sure to use -e so that you only see the calls to wait3(2). This will tell you if the code is actually trying to reap the same child more than once or if it's something external that is wiping the zombie (misconfigured or buggy kernel, or other weird scripts running on that user's machine).Planometer
Also, were you able to reproduce this issue? Maybe installing the same kernel image in a virtual machine?Planometer
T
6

TLDR: you are currently relying on unspecified behaviour of signal(2); use sigaction (carefully) instead.

Firstly, SIGCHLD is strange. From the manual page for sigaction;

POSIX.1-1990 disallowed setting the action for SIGCHLD to SIG_IGN. POSIX.1-2001 allows this possibility, so that ignoring SIGCHLD can be used to prevent the creation of zombies (see wait(2)). Nevertheless, the historical BSD and System V behaviors for ignoring SIGCHLD differ, so that the only completely portable method of ensuring that terminated children do not become zombies is to catch the SIGCHLD signal and perform a wait(2) or similar.

And here's the bit from wait(2)'s manual page:

POSIX.1-2001 specifies that if the disposition of SIGCHLD is set to SIG_IGN or the SA_NOCLDWAIT flag is set for SIGCHLD (see sigaction(2)), then children that terminate do not become zombies and a call to wait() or waitpid() will block until all children have terminated, and then fail with errno set to ECHILD. (The original POSIX standard left the behavior of setting SIGCHLD to SIG_IGN unspecified. Note that even though the default disposition of SIGCHLD is "ignore", explicitly setting the disposition to SIG_IGN results in different treatment of zombie process children.) Linux 2.6 conforms to this specification. However, Linux 2.4 (and earlier) does not: if a wait() or waitpid() call is made while SIGCHLD is being ignored, the call behaves just as though SIGCHLD were not being ignored, that is, the call blocks until the next child terminates and then returns the process ID and status of that child.

Note the effect of that is that if the signal's handling behaves like SIG_IGN is set, then (under Linux 2.6+) you will see the behaviour you are seeing - i.e. wait() will return -1 and ECHLD because the child will have been automatically reaped.

Secondly, signal handling with pthreads (which I think you are using here) is notoriously hard. The way it's meant to work (as I'm sure you know) is that process directed signals get sent to an arbitrary thread within the process that has the signal unmasked. But whilst threads have their own signal mask, there is a process wide action handler.

Putting these two things together, I think you are running across a problem I've run across before. I have had problems getting SIGCHLD handling to work with signal() (which is fair enough as that was deprecated prior to pthreads), which were fixed by moving to sigaction and carefully setting per thread signal masks. My conclusion at the time was that the C library was emulating (with sigaction) what I was telling it to do with signal(), but was getting tripped up by pthreads.

Note that you are currently relying on unspecified behaviour. From the manual page of signal(2):

The effects of signal() in a multithreaded process are unspecified.

Here's what I recommend you do:

  1. Move to sigaction() and pthread_sigmask(). Explicitly set the handling of all the signals you care about (even if you think that's the current default), even when setting them to SIG_IGN or SIG_DFL. I block signals whilst I do this (possibly overabundance of caution but I copied the example from somewhere).

Here's what I am doing (roughly):

sigset_t set;
struct sigaction sa;

/* block all signals */
sigfillset (&set);
pthread_sigmask (SIG_BLOCK, &set, NULL);

/* Set up the structure to specify the new action. */
memset (&sa, 0, sizeof (struct sigaction));
sa.sa_handler = handlesignal;        /* signal handler for INT, TERM, HUP, USR1, USR2 */
sigemptyset (&sa.sa_mask);
sa.sa_flags = 0;
sigaction (SIGINT, &sa, NULL);
sigaction (SIGTERM, &sa, NULL);
sigaction (SIGHUP, &sa, NULL);
sigaction (SIGUSR1, &sa, NULL);
sigaction (SIGUSR2, &sa, NULL);

sa.sa_handler = SIG_IGN;
sigemptyset (&sa.sa_mask);
sa.sa_flags = 0;
sigaction (SIGPIPE, &sa, NULL);     /* I don't care about SIGPIPE */

sa.sa_handler = SIG_DFL;
sigemptyset (&sa.sa_mask);
sa.sa_flags = 0;
sigaction (SIGCHLD, &sa, NULL);     /* I want SIGCHLD to be handled by SIG_DFL */

pthread_sigmask (SIG_UNBLOCK, &set, NULL);
  1. Where possible set all your signal handlers and masks etc. prior to any pthread operations. Where possible do not change signal handlers and masks (you might need to do this prior to and subsequent to fork() calls).

  2. If you need to a signal handler for SIGCHLD (rather than relying on SIG_DFL), if possible let it be received by any thread, and use the self-pipe method or similar to alert the main program.

  3. If you must have threads that do/don't handle certain signals, try to restrict yourself to pthread_sigmask in the relevant thread rather than sig* calls.

  4. Just in case you run headlong into the next issue I ran into, ensure that after you have fork()'d, you set up again the signal handling from scratch (in the child) rather than relying on whatever you might inherit from the the parent process. If there's one thing worse than signals mixed with pthread, it's signals mixed with pthread with fork().

Note I cannot explain exactly entirely why change (1) works, but it has fixed what looks like a very similar issue for me and was after all relying on something that was 'unspecified' previously. It's closest to your 'hypothesis 2' but I think it is really incomplete emulation of legacy signal functions (specifically emulating the previously racy behaviour of signal() which is what caused it to be replaced by sigaction() in the first place - but this is just a guess).

Incidentally, I suggest you use wait4() or (as you aren't using rusage) waitpid() rather than wait3(), so you can specify a specific PID to wait for. If you have something else that generates children (I've had a library do it), you may end up waiting for the wrong thing. That said, I don't think that's what's happening here.

Thyratron answered 8/12, 2015 at 11:42 Comment(5)
Not germane, I think. Both the OP and a grep of the sources suggest that SIGCHLD's disposition is never set in the code. Neither is it masked (not that that would matter here). The symptoms, further, suggest that SIG_IGN could not be inherited, since wait3 works normally at least once.Galenism
I would still suggest setting it up properly with sigaction. It working the first time doesn't surprise me; it worked unreliably for me, and some semantics for an emulated signal require the signal handling be reset in the handler. If it has no effect then it's still the right way to set up signal handling.Thyratron
Wise advice but again not applicable here. It isn't "set up" at all — there are no calls to signal(SIGCHLD, ...) nor sigaction(SIGCHLD,...), and we can infer that the process starts with SIGCHLD set to SIG_DFL.Galenism
Right, but there is still a default state of SIGCHLD handling; that might be 'as handled by signal()'. In my case life was more complicated (linking with libxl which does its own forks and signal handling if you want the gory details), but I can only say that the above incantation fixed things. Given explicitly setting signal handling is not a bad idea anyway (particularly given the default for SIGCHLD is somewhat opaque), I think this is worth OP trying. It might be nothing to do with what I've suggested, in which case the OP merely gets a cleaner signal setup.Thyratron
We might be headed into chat territory here, but the default SIGCHLD disposition is well-specified, even on those older systems where signal() behaved like SA_RESETHAND. This answer is a fine technical exposition, but it's not an answer to the question asked.Galenism

© 2022 - 2024 — McMap. All rights reserved.