Zombie process in python multiprocessing daemon

After researching python daemons, this walk through seemed to be the most robust: http://www.jejik.com/articles/2007/02/a_simple_unix_linux_daemon_in_python/

Now I am trying to implement a pool of workers inside the daemon class which I believe is working (I have not thoroughly tested the code) except that on the close I get a zombie process. I have read I need to wait for the return code from the child but I just cannot see exactly how I need to do this yet.

Here are some code snippets:

def stop(self):
    ...
    try:
        while 1:
            self.pool.close()
            self.pool.join()
            os.kill(pid, SIGTERM)
            time.sleep(0.1)
    ...

Here I have tried os.killpg and a number of os.wait methods but with no improvement. I also have played with closing/joining the pool before and after the os.kill. This loop as it stands, never ends and as soon as it hits the os.kill I get a zombie process. self.pool = Pool(processes=4) occurs in the __init__ section of the daemon. From the run(self) which is excecuted after start(self), I will call self.pool.apply_async(self.runCmd, [cmd, 10], callback=self.logOutput). However, I wanted to address this zombie process before looking into that.

How can I properly implement the pool inside the daemon to avoid this zombie process?

It is not possible to have 100% confidence in an answer without knowing what is going on in the child/daemon process, but consider if this could be it. Since you have worker threads in your child process, you actually need to build in some logic to join all of those threads once you receive the SIGTERM. Otherwise your process may not exit (and even if it does you may not exit gracefully). To do this you need to:

write a signal handler to be used in the child/daemon process that captures the SIGTERM signal and triggers an event for your main thread
install the signal handler in the main thread (very important) of the child/daemon process
the event handler for SIGTERM must issue stop instructions to ALL threads in the child/daemon process
all threads must be join()ed when they are done (if you were assuming that the SIGTERM would automatically destroy everything you may have to implement this logic too)
once everything is joined and cleaned up you can exit the main thread

If you have threads for I/O and all kinds of things then this will be a real chore.

Also, I have found through experiment that the particular strategy for your event listener matters when you are using signal handlers. For example, if you use select.select() you must use a time-out and retry if the time-out occurs; otherwise your signal handler will not run. If you have a Queue.Queue object for events, and your event listener calls its .get() method, you must use a timeout, otherwise your signal handler will not run. (The "real" signal handler implemented in C within the VM runs, but your Python signal handler doesn't unless you use timeouts.)

Good luck!

Recommended topics

Hot tags