Zombie process in python multiprocessing daemon
Asked Answered
B

1

7

After researching python daemons, this walk through seemed to be the most robust: http://www.jejik.com/articles/2007/02/a_simple_unix_linux_daemon_in_python/

Now I am trying to implement a pool of workers inside the daemon class which I believe is working (I have not thoroughly tested the code) except that on the close I get a zombie process. I have read I need to wait for the return code from the child but I just cannot see exactly how I need to do this yet.

Here are some code snippets:

def stop(self):
    ...
    try:
        while 1:
            self.pool.close()
            self.pool.join()
            os.kill(pid, SIGTERM)
            time.sleep(0.1)
    ...

Here I have tried os.killpg and a number of os.wait methods but with no improvement. I also have played with closing/joining the pool before and after the os.kill. This loop as it stands, never ends and as soon as it hits the os.kill I get a zombie process. self.pool = Pool(processes=4) occurs in the __init__ section of the daemon. From the run(self) which is excecuted after start(self), I will call self.pool.apply_async(self.runCmd, [cmd, 10], callback=self.logOutput). However, I wanted to address this zombie process before looking into that.

How can I properly implement the pool inside the daemon to avoid this zombie process?

Bernhard answered 21/6, 2011 at 16:35 Comment(2)
do you have a handler in your daemon program to handle SIGCHLD?Caliber
No, I do not have such a handler. The only handler I have is for a timeout I am using in the runCmd() function which is signal.signal(signal.SIGALRM, self.handler). Here the handler throws a custom exception saying that the command has gone past the allocated execution time. Why would I need this handler? I thought multiprocessing took care of that in the pool.close and pool.join. Frankly, I don't know where the process is coming from as I have not called apply_async and so I do not have workers or the callback threads.Bernhard
S
5

It is not possible to have 100% confidence in an answer without knowing what is going on in the child/daemon process, but consider if this could be it. Since you have worker threads in your child process, you actually need to build in some logic to join all of those threads once you receive the SIGTERM. Otherwise your process may not exit (and even if it does you may not exit gracefully). To do this you need to:

  • write a signal handler to be used in the child/daemon process that captures the SIGTERM signal and triggers an event for your main thread
  • install the signal handler in the main thread (very important) of the child/daemon process
  • the event handler for SIGTERM must issue stop instructions to ALL threads in the child/daemon process
  • all threads must be join()ed when they are done (if you were assuming that the SIGTERM would automatically destroy everything you may have to implement this logic too)
  • once everything is joined and cleaned up you can exit the main thread

If you have threads for I/O and all kinds of things then this will be a real chore.

Also, I have found through experiment that the particular strategy for your event listener matters when you are using signal handlers. For example, if you use select.select() you must use a time-out and retry if the time-out occurs; otherwise your signal handler will not run. If you have a Queue.Queue object for events, and your event listener calls its .get() method, you must use a timeout, otherwise your signal handler will not run. (The "real" signal handler implemented in C within the VM runs, but your Python signal handler doesn't unless you use timeouts.)

Good luck!

Stab answered 21/6, 2011 at 20:26 Comment(1)
I looked at it in a bit more detail and I do not believe this is the issue. However, I realized that this post was extremely lacking in description of the problem. If your still interested in the question, I have a re-asked it with much greater detail here: #6517008Bernhard

© 2022 - 2024 — McMap. All rights reserved.