Context is this Redis issue. We have a wait3()
call that waits for the AOF rewriting child to create the new AOF version on disk. When the child is done, the parent is notified via wait3()
in order to substitute the old AOF with the new one.
However in the context of the above issue the user notified us about a bug. I modified a bit the implementation of Redis 3.0 in order to clearly log when wait3()
returned -1 instead of crashing because of this unexpected condition. So this is what happens apparently:
wait3()
is called when we have pending children to wait for.- the
SIGCHLD
should be set toSIG_DFL
, there is no code setting this signal at all in Redis, so it's the default behavior. - When the first AOF rewrite happens,
wait3()
successfully works as expected. - Starting from the second AOF rewrite (the second child created),
wait3()
starts to return -1.
AFAIK it is not possible in the current code that we call wait3()
while there are no pending children, since when the AOF child is created, we set server.aof_child_pid
to the value of the pid, and we reset it only after a successful wait3()
call.
So wait3()
should have no reason to fail with -1 and ECHILD
, but it does, so probably the zombie child is not created for some unexpected reason.
Hypothesis 1: It is possible that Linux during certain odd conditions will discard the zombie child, for example because of memory pressure? Does not look reasonable since the zombie has just metadata attached to it but who knows.
Note that we call wait3()
with WNOHANG
. And given that SIGCHLD
is set to SIG_DFL
by default, the only condition that should lead to failing and returning -1 and ECHLD
should be no zombie available to report the information.
Hypothesis 2: Other thing that could happen but there is no explanation if it happens, is that after the first child dies, the SIGCHLD
handler is set to SIG_IGN
, causing wait3()
to return -1 and ECHLD
.
Hypothesis 3: Is there some way to remove the zombie children externally? Maybe this user has some kind of script that removes zombie processes in the background so that then the information is no longer available for wait3()
? To my knowledge it should never be possible to remove the zombie if the parent does not wait for it (with waitpid
or handling the signal) and if the SIGCHLD
is not ignored, but maybe there is some Linux specific way.
Hypothesis 4: There is actually some bug in the Redis code so that we successfully wait3()
the child the first time without correctly resetting the state, and later we call wait3()
again and again but there are no longer zombies, so it returns -1. Analyzing the code it looks impossible, but maybe I'm wrong.
Another important thing: we never observed this in the past. Only happens in this specific Linux system apparently.
UPDATE: Yossi Gottlieb proposed that the SIGCHLD
is received by another thread in the Redis process for some reason (does not happen normally, only on this system). We already mask SIGALRM
in bio.c
threads, perhaps we could try masking SIGCHLD
from I/O threads as well.
Appendix: selected parts of Redis code
Where wait3() is called:
/* Check if a background saving or AOF rewrite in progress terminated. */
if (server.rdb_child_pid != -1 || server.aof_child_pid != -1) {
int statloc;
pid_t pid;
if ((pid = wait3(&statloc,WNOHANG,NULL)) != 0) {
int exitcode = WEXITSTATUS(statloc);
int bysignal = 0;
if (WIFSIGNALED(statloc)) bysignal = WTERMSIG(statloc);
if (pid == -1) {
redisLog(LOG_WARNING,"wait3() returned an error: %s. "
"rdb_child_pid = %d, aof_child_pid = %d",
strerror(errno),
(int) server.rdb_child_pid,
(int) server.aof_child_pid);
} else if (pid == server.rdb_child_pid) {
backgroundSaveDoneHandler(exitcode,bysignal);
} else if (pid == server.aof_child_pid) {
backgroundRewriteDoneHandler(exitcode,bysignal);
} else {
redisLog(REDIS_WARNING,
"Warning, detected child with unmatched pid: %ld",
(long)pid);
}
updateDictResizePolicy();
}
} else {
Selected parts of backgroundRewriteDoneHandler
:
void backgroundRewriteDoneHandler(int exitcode, int bysignal) {
if (!bysignal && exitcode == 0) {
int newfd, oldfd;
char tmpfile[256];
long long now = ustime();
mstime_t latency;
redisLog(REDIS_NOTICE,
"Background AOF rewrite terminated with success");
... more code to handle the rewrite, never calls return ...
} else if (!bysignal && exitcode != 0) {
server.aof_lastbgrewrite_status = REDIS_ERR;
redisLog(REDIS_WARNING,
"Background AOF rewrite terminated with error");
} else {
server.aof_lastbgrewrite_status = REDIS_ERR;
redisLog(REDIS_WARNING,
"Background AOF rewrite terminated by signal %d", bysignal);
}
cleanup:
aofClosePipes();
aofRewriteBufferReset();
aofRemoveTempFile(server.aof_child_pid);
server.aof_child_pid = -1;
server.aof_rewrite_time_last = time(NULL)-server.aof_rewrite_time_start;
server.aof_rewrite_time_start = -1;
/* Schedule a new rewrite if we are waiting for it to switch the AOF ON. */
if (server.aof_state == REDIS_AOF_WAIT_REWRITE)
server.aof_rewrite_scheduled = 1;
}
As you can see all the code paths must execute the cleanup
code that reset server.aof_child_pid
to -1.
Errors logged by Redis during the issue
21353:C 29 Nov 04:00:29.957 * AOF rewrite: 8 MB of memory used by copy-on-write
27848:M 29 Nov 04:00:30.133 ^@ wait3() returned an error: No child processes. rdb_child_pid = -1, aof_child_pid = 21353
As you can see aof_child_pid
is not -1.
wait*()
? I'd say you are facing a race. – Undersellwait3()
the child did not even started yet (completely)? – Undersellpid==-1
/ECHLD
the same aspid==0
and log whetherwait3()
finds the process in question later. – Undersellsignal(SIGCHLD, SIG_DFL);
? – Undersellman 7 signal
the default forSIGCHLD
is to be ignored. – Undersellsignal()
bysigaction()
. – UndersellSIG_DFL
) after the first handling of a signal. So it's possible that hypothesis 2 to happen. Simply replacesignal()
call withsigaction()
(which doesn't reset to SIG_DFL) to see if this is true. – Ariettawait4()
thanwait3()
. I doubt that will fix the issue you asked about, but it may save you grief later. – Cotenantrdb_child_pid /server.aof_child_pid
in this case ? – PosturestopAppendOnly
inaof.c
also kills and wait3s, specifically for the process in question. Can you be sure some other code hasn't already won the race to reap the child? It may be a red herring, but in your bug report the errors all happen very near five minute boundaries — perhaps cron or some other timed maintenance job is causing the unexpected reap? – Galenismstrace(1)
, make sure to use-e
so that you only see the calls towait3(2)
. This will tell you if the code is actually trying to reap the same child more than once or if it's something external that is wiping the zombie (misconfigured or buggy kernel, or other weird scripts running on that user's machine). – Planometer