Python inside GNU Screen eventually becomes idle if Screen is dettached
Asked Answered
C

1

9

I have a python script which uses multiprocessing and subprocess to launch multiple external commands in parallel with different arguments. The code can be found here.

For convenience I launch this script inside a GNU Screen session. The machine where this script is running has 12 processors which are idle until processes become active.

Each of the processes takes between a few hours to a couple of days to run hence I often disconnect from the machine and detach the screen session.

However, recently I've noticed a behavior which I never experienced before. On several occasions I've returned to the machine to find it idle with a load of zero. If I get a list of active processes either via ps ux or top I can still find the script (and the subprocesses) on the list of processes. I then reattach the screen session to check the state of the program and immediately a new batch of processes is sent to the queue and the load of the system goes back to 12 in a matter of seconds. Note that I did absolutely nothing to the script other than reattaching the screen session.

I've installed a monitoring tool on the system and what happens is that some processes finish after a certain time and no new processes are launched. So the system is active until subprocesses are busy and becomes idle as soon as no more jobs are released from the queue.

So my question is, does anyone know of any reason that explains this behavior?

EDIT: After a year or so, this problem is no longer reproducible, either some patch on screen or python itself. I'm accepting the answer as it provided good directions for testing.

Cloche answered 8/5, 2011 at 1:58 Comment(3)
Can you let us know what version of python and screen you were using when the problem was occuring, and what versions you are using now that the problem no longer occurs? I'm having a very similar problem myself.Sampan
Sorry SpoonMeiser, the problem was too long ago I don't have that information any more. Since then I started using tmux instead of screen. As for workarounds, I used file logging instead of printing to stdout/stderr.Cloche
Almost a decade later, and I have the exact same problem. Don't see any others with the issue online. Anyone else experience this and any fixes?Havens
N
4

I can't explain the reason for what you are seeing. However, I do have an idea of what you can try next.

  1. Try piping the output of the script to: | tee out.txt If that has no effect, try...
  2. Run screen on another [hop] host. From there SSH into your worker host. Run your script in the non-emulated shell. Then feel free to disconnect and reconnect from your hop to check on the process. This should hide from the worker that screen is in anyway involved.

Please comment back with the results of these tests. That will give me more to go on.

Nictitate answered 8/5, 2011 at 2:28 Comment(3)
I'm not exactly running what you described in step 2 because I don't want to stop the script. Instead I launched a screen in the hop host, ssh'ed to the server and attached the servers' screen inside it. Then detach the hop screen leaving the server one attached inside.Cloche
With all the calculation now complete, the 2 screen approach prevented any additional occurrence of the issue. However I still don't understand the reasons behind it. Hence I would like people would elaborate on further possibilities that explain what happened. In any case, +1 for the workaround.Cloche
You used a context manager on line 31. I would suggest using them in queue_dispatcher also dpaste.com/hold/545561 However, I still think the problem is happening in bin/launcher.py or what ever it is launching.Nictitate

© 2022 - 2024 — McMap. All rights reserved.