Programmatically check for zombie child process in Linux using C
Asked Answered
H

2

7

I have written a simple C program in RedHat Linux which waits for a child process using waitpid after calling execv.

int main( int argc, char * argv[] )
{
    int pid;
    int status = 0;
    int wait_ret;

    const char * process_path = argv[1];

    if ( argc < 2 )
    {
        exit( EXIT_FAILURE );
    }

    pid = fork(); //spawn child process

    if ( 0 == pid ) //child
    {
        int ret = execv( process_path, &argv[1] );

        if ( ret )
        {
            printf( "execv failed: %s\n", strerror( errno ) );
        }

        exit( EXIT_SUCCESS );
    }

    //wait for the child to terminate
    wait_ret = waitpid( pid, &status, WUNTRACED );

    if ( -1 == wait_ret )
    {
        printf( "ERROR: Failed to wait for process termination\n" );
        exit( EXIT_FAILURE );
    }

    // ... handlers for child exit status ...

    return 0;
}

I am using this as a simple watchdog for some processes I am runnning.

My problem is that one process in particular is not being reaped by waitpid upon exiting and instead remains forever in a Zombie state while waitpid is hung. I am not sure why waitpid is unable to reap this process once it becomes a Zombie (maybe a leaked file descriptor or something).

I could use the WNOHANG flag and poll the child's stat proc file to check for the Zombie state but I would prefer a more elegant solution. Maybe there is some function that I could use to get the Zombie status from without polling this file?

Does anyone know an alternative to waitpid which WILL return when the process becomes a Zombie?

Additional Information:

The child process is being closed by a call to exit( EXIT_FAILURE); in one of its threads.

cat /proc/<CHILD_PID>/stat (before exit):

1037 (my_program) S 1035 58 58 0 -1 4194560 1309 0 22 0 445 1749 0 0 20 0 13 0 4399 22347776 1136 4294967295 3336716288 3338455332 3472776112 3472775232 3335760920 0 0 4 31850 4294967295 0 0 17 0 0 0 26 0 0 3338489412 3338507560 3338600448

cat /proc/<CHILD_PID>/stat (after exit):

1037 (my_program) Z 1035 58 58 0 -1 4227340 1316 0 22 0 464 1834 0 0 20 0 2 0 4399 0 0 4294967295 0 0 0 0 0 0 0 4 31850 4294967295 0 0 17 0 0 0 26 0 0 0 0 0

Note that the child PID is 1037 and the parent PID is 1035 in this case.

Heptagonal answered 25/5, 2018 at 19:7 Comment(14)
@HaukeLaging Last I checked there was not a C specific Linux stack exchange. Do you want me to ask in Stack Overflow, Software Engineering, or Code Review? This is a Linux specific question not a programming question.Heptagonal
What happens if the child exits before the parent has a chance to execute waitpid()?Urina
@Urina From 'man waitpid': "If a child has already changed state, then these calls return immediately. Otherwise, they block until either a child changes state or a signal handler interrupts the call". That said, I am triggering the exit so it has been a long time since waitpid was called.Heptagonal
This is a question for Stack Overflow. There is no problem with a question being Linux-specific on SO. Questions which are only relevant for C programmers are off-topic here. We'll see if a majority decides to move the question there.Dwyer
@HaukeLaging I suppose one could argue that all questions here could belong either on Stack Overflow or Super User. Since waitpid is a Linux specific operation, and my question is related only to Linux process behavior, I thought it better to ask here.Heptagonal
It’s not clear how much this is a programming question and how much it is a Unix question. Either way, the obvious response is “What research have you done, what have you done to diagnose this problem, and what have you learned?” Does this always happen with this one program? Have you tried (e.g., by code analysis and/or strace) to see what it is doing to get into this persistent zombie state? Have you examined its /proc data; have you run lsof on it? If you kill your watchdog process, what happens to the zombie. … If it remains a zombie after its parent is gone, this is a Unix question.Teleutospore
This just occurred to me: is it possible that the problematic process is changing its UID, GID, process group ID, session ID, or anything like that? Also, are you sure that it’s dead and not just stopped?Teleutospore
@NathanOwen: Yes, this is an ongoing fight. Our Help Center says that questions about the “UNIX C API and System Interfaces” are on-topic (within reason). Your question is clearly not a C programming question, it’s a Unix system interface question.  And yet some people want to throw questions like this, which clearly require Unix knowledge to solve, over to a community of programmers who might know little to nothing about Unix process management.Teleutospore
@G-Man I have added a bit of additional information to the question. Sadly there is no strace or lsof on the embeded RedHat system I am working on. However when I do an ls -al on the /proc/PID/fd/ directory there is nothing listed after the process exits (there are many fds listed before it exits).Heptagonal
@G-Man as for how much research I have done, I always do a few hours of reaserch before asking a question on here. I have already tried adding O_CLOEXEC flags to all file descriptor opens so everything should be getting cleaned up. In any case the reason for the child process entering the zombie state is less of an issue to me at the moment. My job it to ensure the watchdog process catches this state (and any other crash/exit conditions). The process in question should remain running forever. I am wondering if waitid might be able to catch this state somehow.Heptagonal
Thanks for the update. (1) Perhaps I wasn’t clear: I don’t really care why the child enters a zombie state (i.e., exits while its parent isn’t waiting for it); I care why it enters a persistent zombie state (where waitpid doesn’t work). Unfortunately nothing is jumping out at me as a reason or explanation. (2) I’m just realizing that I may have partially misread the question. Are you asking how to reap this zombie, or how to detect that the process has become a zombie? (3) Try catching SIGCHLD. … … … … … … … … … … … … … … … … Good luck.Teleutospore
Can you post an mcve for this, with a common Unix executable?Vittle
@G-Man In this case I would like to know how to detect that it has become a Zombie. I have already implemented a 1 second poll that inspects the stat file for the Zombie flag as a temporary fix. However, I would like a way to catch this with signals as polling is undesirable. Obviously figuring out what might cause this persistent zombie is also of interest to me but that task will take longer and will likely not be answered here. And, thanks! I'll need it...Heptagonal
@PSkocik sadly the program that is becoming a persistent Zombie is proprietary non-releasable (not to mention verrrry large and non-portable). To create a small example I would likely have to know why it is becoming a persistent zombie in the first place.Heptagonal
A
1

Any process that terminates becomes a zombie until it is collected by a wait call. Here the wait does not seem to happen in all cases.

From the code given I can't figure out why the wait does not happen and the process remains a zombie. (not without running it anyway)

But instead of waiting on a specific pid only, you can wait on any child by using -1 as the first argument to waitpid. Don't use WNOHANG, as it require busy polling (don't do that).

You may also want to drop WUNTRACED unless you have a specific reason to include it. But there is no harm in dropping it and see what difference it makes.

Apheliotropic answered 24/12, 2020 at 13:36 Comment(1)
Hi koder, thank you for your answer, however this post about zombie process is two years old and a bit of a zombie post at this point. I no longer work at the company where I encountered this issue or have access to any of the hardware or operating system builds I would need to test further. Currently I use init services like systemd or runit to monitor processes and handle unexpected process termination.Heptagonal
W
0

My problem is that one process in particular is not being reaped by waitpid upon exiting and instead remains forever in a Zombie state while waitpid is hung ? If I understand correctly, you don't want child to become zombie then Use SA_NOCLDWAIT flag. From the manual page of sigaction()

SA_NOCLDWAIT (since Linux 2.6) If signum is SIGCHLD, do not transform children into zombies when they terminate. See also waitpid(2). This flag is meaningful only when establishing a handler for SIGCHLD, or when setting that signal's disposition to SIG_DFL.

              If the SA_NOCLDWAIT flag is set when establishing a  handler
              for SIGCHLD, POSIX.1 leaves it unspecified whether a SIGCHLD
              signal is generated when a  child  process  terminates.   On
              Linux,  a  SIGCHLD signal is generated in this case; on some
              other implementations, it is not.

Idea is when child process completes first, parent receives signal no 17 or SIGCHLD & child process will become zombie as parent still running. So how to remove child ASAP it becomes zombie, solution is use flags SA_NOCLDWAIT.

Here is the sample code

void my_isr(int n) {
        /* error handling */
}
int main(void) {
        if(fork()==0) { /* child process */
                printf("In child process ..c_pid: %d and p_pid : %d\n",getpid(),getppid());
                sleep(5);
                printf("sleep over .. now exiting \n");
        }
        else { /*parent process */
                struct sigaction v;
                v.sa_handler=my_isr;/* SET THE HANDLER TO ISR */
                v.sa_flags=SA_NOCLDWAIT; /* it will not let child to become zombie */
                sigemptyset(&v.sa_mask);
                sigaction(17,&v,NULL);/* when parent receives SIGCHLD, IT GETS CALLED */
                while(1); /*for observation purpose, to make parent process alive */
        }
        return 0;
}

Just comment/uncomment the v.sa_flags=SA_NOCLDWAIT; line & analyze the behavior by running a.out in one terminal & check ps -el | grep pts/0 in another terminal.

Does anyone know an alternative to waitpid which WILL return when the process becomes a Zombie ? use WNOHANG as you did & told in manual page of waitpid()

WUNTRACED also return if a child has stopped (but not traced via ptrace(2)). Status for traced children which have stopped is provided even if this option is not specified.

Waterlog answered 2/6, 2018 at 7:54 Comment(3)
Using sigaction with the SA_NOCLDWAIT did not solve the problem. The child is still ending up in the zombie state. Indeed the ISR is never called indicating a SIGCHLD interrupt is never received. It does work on other programs though, just not the problem one.Heptagonal
Have you check the ps result ? The child is still ending up in the zombie state ? No, you didn't run my code & analyze.Waterlog
Yes, I ran ps, how else would I know the process was in a zombie state? Yes I did run your code, it did not work for the problem process as I stated. It was good code and did work for normal well-behaved programs. As stated, the ISR is never called. I suspect the child process is at fault somehow but have not yet found the root cause (few hundred thousand lines of code to analyze).Heptagonal

© 2022 - 2024 — McMap. All rights reserved.