Using sleep and wait -n to implement simple timeout in bash, race condition or not?

Asked 21/6, 2018 at 18:45 Answered 25/6, 2018 at 10:28

If I do this in a bash script:

sleep 10 &
sleep_pid=$!
some_command &
wait -n
cmd_pid=$!

if kill -0 $sleep_pid 2> /dev/null; then
    # all ok
    kill $sleep_pid
else
    # some_command hung
    ...code to log diagnostics and then kill -9 $cmd_pid...
fi

where some_command is something that should be quick but can hang due to rare errors.

Is there then a risk that some_command can be done and cleaned up before "wait -n" starts, so there is only the sleep to wait for? Or does the '&' after one command guarantee that the shell won't call waitpid() on it until the next line of input has been handled?

It works in interactive shells. If you do:

sleep 10 &
sleep 0 &
wait -n

then the "wait -n" returns right away even if you wait a couple of seconds before running it. But I'm not sure if it can be trusted for non-interactive shells?

EDIT: Clarifying need for diagnostics + some grammar.

Melyndamem answered 21/6, 2018 at 18:45 Comment(3)

It's more trustworthy in non-interactive shells -- you don't have your process-table entries getting reaped to give the user interactive feedback on jobs that completed. I wouldn't particularly trust this code in an interactive shell, but it should be quite solid in a noninteractive one. – Heartburning 21/6, 2018 at 21:59

@CharlesDuffy So non-interactive shells don't do waitpid()/wait() unless explicitly asked to via the wait builtin? That means I should stop worrying about this and start looking for process leaks in all my other long running scripts instead. :) – Melyndamem 21/6, 2018 at 22:30

And a solution similar to yours is suggested in an answer there: https://mcmap.net/q/295544/-bash-wait-with-timeout I find this kind of solutions elegant and clever; I couldn't come up with something similar myself. – Corky 9/2, 2022 at 18:0

As @CharlesDuffy pointed out in comments, the answer is no, there is no race (provided it is run in a non-interactive shell).

Also there is no need (in non-interactive shells) to make sure the wait comes directly after the command, as non-interactive shells don't do automatic reaping of children.

But I guess one should wrap this in a sub-shell, so "wait -n" won't return early due to some previously started unrelated background job.

Melyndamem answered 25/6, 2018 at 10:28 Comment(0)

I believe you may be able to use the timeout command to do this.
http://man7.org/linux/man-pages/man1/timeout.1.html

timeout 10s command_to_run

You can check the exit status of the timeout command to know if it timed out.

timeout 2s sleep 10

if [[ $? -gt 0 ]]; then
  echo "it timed out"
else
  echo "It was successful"
fi

Discard answered 21/6, 2018 at 18:49 Comment(1)

timeout works for most cases, but it doesn’t work if the thing you want to run is a function in your shell script, or if you want to get a stack trace before killing. (Maybe some versions of timeout allows the latter?) – Melyndamem 21/6, 2018 at 20:57

By using the $! variable, we avoid relying on interactive job control features. Try this:

...long executing command... &
pid_long=$!

sleep 3 &
pid_sleep=$!

wait -n
kill -KILL $pid_long

The problem here is PID recycling. Unlikely to happen in 3 seconds, though.

In the case when the command finishes earlier than the sleep (and its PID has not been recycled to a new process) kill produces an error message; we could pipe that to /dev/null.

We should probably also kill the sleep in case it is the one that is lingering.

Fadil answered 21/6, 2018 at 19:22 Comment(5)

PID recycling isn't going to happen if the old entry is still in the process table, and if waitpid() or wait() hasn't been called (which is automatic only in interactive shells), it'll still be there as a zombie. – Heartburning 21/6, 2018 at 21:56

In my effort to keep the question short and to the point, I left out some detail that matters in this case (added in edit now). Speculative killing won't let me detect the problem and collect diagnostic info before cleaning up (which I didn't mention in the original question). – Melyndamem 21/6, 2018 at 21:58

@CharlesDuffy But wait has been called. When "wait -n" happens to reap the interesting command rather than sleep, then $pid_long is no longer a valid PID. kill will produce an error about a nonexistent PID. I reproduced this case in testing. – Fadil 21/6, 2018 at 22:31

@CharlesDuffy Non-interactive script. – Fadil 22/6, 2018 at 2:26

Oh -- I misread what you were saying. Yes, you're right -- if wait -n reaps the interesting command, it's no longer running, so it can't be killed, and yes, the PID could potentially be repurposed in that case (since it no longer has a zombie). – Heartburning 22/6, 2018 at 3:27

As @CharlesDuffy pointed out in comments, the answer is no, there is no race (provided it is run in a non-interactive shell).

Also there is no need (in non-interactive shells) to make sure the wait comes directly after the command, as non-interactive shells don't do automatic reaping of children.

But I guess one should wrap this in a sub-shell, so "wait -n" won't return early due to some previously started unrelated background job.

Melyndamem answered 25/6, 2018 at 10:28 Comment(0)

Recommended topics

Hot tags