Bash: wait with timeout
Asked Answered
G

11

66

In a Bash script, I would like to do something like:

app1 &
pidApp1=$!
app2 &
pidApp2=$1

timeout 60 wait $pidApp1 $pidApp2
kill -9 $pidApp1 $pidApp2

I.e., launch two applications in the background, and give them 60 seconds to complete their work. Then, if they don't finish within that interval, kill them.

Unfortunately, the above does not work, since timeout is an executable, while wait is a shell command. I tried changing it to:

timeout 60 bash -c wait $pidApp1 $pidApp2

But this still does not work, since wait can only be called on a PID launched within the same shell.

Any ideas?

Guardsman answered 5/4, 2012 at 12:42 Comment(4)
Could you sleep 60 instead? Not as efficient, but much simplerTarantella
"60" has to be a maximum upper execution time. The actual runtime of the applications might be a lot lower. So no, it would be to inefficient for me.Guardsman
If these programs really require you to use kill -9, they are broken. See also iki.fi/era/unix/award.html#killShowcase
The bash wait doesn't support timeout because it's implemented with syscall wait() or waitpid() depending if you pass a PID to it or not. Neither of those supports timeout natively. It might be possible to use signal handlers and alarm to get rid of the wait without actually waiting for the children, but I haven't tested if it would actually work.Vial
E
28

Write the PIDs to files and start the apps like this:

pidFile=...
( app ; rm $pidFile ; ) &
pid=$!
echo $pid > $pidFile
( sleep 60 ; if [[ -e $pidFile ]]; then killChildrenOf $pid ; fi ; ) &
killerPid=$!

wait $pid
kill $killerPid

That would create another process that sleeps for the timeout and kills the process if it hasn't completed so far.

If the process completes faster, the PID file is deleted and the killer process is terminated.

killChildrenOf is a script that fetches all processes and kills all children of a certain PID. See the answers of this question for different ways to implement this functionality: Best way to kill all child processes

If you want to step outside of BASH, you could write PIDs and timeouts into a directory and watch that directory. Every minute or so, read the entries and check which processes are still around and whether they have timed out.

EDIT If you want to know whether the process has died successfully, you can use kill -0 $pid

EDIT2 Or you can try process groups. kevinarpe said: To get PGID for a PID(146322):

ps -fjww -p 146322 | tail -n 1 | awk '{ print $4 }'

In my case: 145974. Then PGID can be used with a special option of kill to terminate all processes in a group: kill -- -145974

Everrs answered 5/4, 2012 at 12:52 Comment(10)
This does not work. wait requires a pid to be a child of the current shell. I get the following error: "wait.sh: line 2: wait: pid 22603 is not a child of this shell".Guardsman
I just noticed one problem with my approach: this only kills the shell that runs the app. Get a process list and look for a process with $pid as parent PID; that should be the app.Everrs
Works like a charm. Thanks! I somewhat simplified the inner part: ( sleep 60 ; kill -9 $pids ) & killerPid=$!. I think it is sufficient for my purpose.Guardsman
Killing the parent has not always an effect on the children. If your app hangs (kill doesn't work while kill -9 does), you will get zombie processes.Everrs
This should help: #392522Everrs
You shouldn't kill -9Ogden
I would add a sleep 5 or something after the killChildrenOf $pid to ensure that the kill $killerPid really does kill the process you think it's killing.Claimant
@histumness: You can do that or try kill -0 $pid to check whether the process is still there.Everrs
I just noticed one problem with my approach...: Did you consider a technique to use process group? Here is one message method to get PGID for a PID(146322): ps -fjww -p 146322 | tail -n 1 | awk '{ print $4 }'. (In my case: Outputs 145974) Then PGID can be used with a special mode of kill to terminate all processes in a group: kill -- -145974Antiar
A similar solution is suggested right in the question https://mcmap.net/q/297547/-using-sleep-and-wait-n-to-implement-simple-timeout-in-bash-race-condition-or-not/94687 . (And I find your idea quite elegant and clever; I couldn't come up with this myself.)Malaria
C
75

Both your example and the accepted answer are overly complicated, why do you not only use timeout since that is exactly its use case? The timeout command even has an inbuilt option (-k) to send SIGKILL after sending the initial signal to terminate the command (SIGTERM by default) if the command is still running after sending the initial signal (see man timeout).

If the script doesn't necessarily require to wait and resume control flow after waiting it's simply a matter of

timeout -k 60s 60s app1 &
timeout -k 60s 60s app2 &
# [...]

If it does, however, that's just as easy by saving the timeout PIDs instead:

pids=()
timeout -k 60s 60s app1 &
pids+=($!)
timeout -k 60s 60s app2 &
pids+=($!)
wait "${pids[@]}"
# [...]

E.g.

$ cat t.sh
#!/bin/bash

echo "$(date +%H:%M:%S): start"
pids=()
timeout 10 bash -c 'sleep 5; echo "$(date +%H:%M:%S): job 1 terminated successfully"' &
pids+=($!)
timeout 2 bash -c 'sleep 5; echo "$(date +%H:%M:%S): job 2 terminated successfully"' &
pids+=($!)
wait "${pids[@]}"
echo "$(date +%H:%M:%S): done waiting. both jobs terminated on their own or via timeout; resuming script"

.

$ ./t.sh
08:59:42: start
08:59:47: job 1 terminated successfully
08:59:47: done waiting. both jobs terminated on their own or via timeout; resuming script
Convoy answered 14/3, 2014 at 7:49 Comment(4)
According to gnu.org/software/coreutils/manual/html_node/…, the -k should go before 60s and in addition, you must specify a timeout for -k. So, for example, the first code example should be timeout -k 60s 60s app1 &.Johannessen
timeout does not appear to be on osx 10.12.Daphinedaphna
@Daphinedaphna This was a Linux-specific question so that point is moot, especially since the question is built around timeout to begin with. However, it's easy to install GNU timeout on OSX/MacOS via e.g. Homebrew.Driedup
"why do you not only use timeout since that is exactly its use case?" because they don't know how long a forked job should take, only that it should exit within N seconds after a certain pointBleacher
E
28

Write the PIDs to files and start the apps like this:

pidFile=...
( app ; rm $pidFile ; ) &
pid=$!
echo $pid > $pidFile
( sleep 60 ; if [[ -e $pidFile ]]; then killChildrenOf $pid ; fi ; ) &
killerPid=$!

wait $pid
kill $killerPid

That would create another process that sleeps for the timeout and kills the process if it hasn't completed so far.

If the process completes faster, the PID file is deleted and the killer process is terminated.

killChildrenOf is a script that fetches all processes and kills all children of a certain PID. See the answers of this question for different ways to implement this functionality: Best way to kill all child processes

If you want to step outside of BASH, you could write PIDs and timeouts into a directory and watch that directory. Every minute or so, read the entries and check which processes are still around and whether they have timed out.

EDIT If you want to know whether the process has died successfully, you can use kill -0 $pid

EDIT2 Or you can try process groups. kevinarpe said: To get PGID for a PID(146322):

ps -fjww -p 146322 | tail -n 1 | awk '{ print $4 }'

In my case: 145974. Then PGID can be used with a special option of kill to terminate all processes in a group: kill -- -145974

Everrs answered 5/4, 2012 at 12:52 Comment(10)
This does not work. wait requires a pid to be a child of the current shell. I get the following error: "wait.sh: line 2: wait: pid 22603 is not a child of this shell".Guardsman
I just noticed one problem with my approach: this only kills the shell that runs the app. Get a process list and look for a process with $pid as parent PID; that should be the app.Everrs
Works like a charm. Thanks! I somewhat simplified the inner part: ( sleep 60 ; kill -9 $pids ) & killerPid=$!. I think it is sufficient for my purpose.Guardsman
Killing the parent has not always an effect on the children. If your app hangs (kill doesn't work while kill -9 does), you will get zombie processes.Everrs
This should help: #392522Everrs
You shouldn't kill -9Ogden
I would add a sleep 5 or something after the killChildrenOf $pid to ensure that the kill $killerPid really does kill the process you think it's killing.Claimant
@histumness: You can do that or try kill -0 $pid to check whether the process is still there.Everrs
I just noticed one problem with my approach...: Did you consider a technique to use process group? Here is one message method to get PGID for a PID(146322): ps -fjww -p 146322 | tail -n 1 | awk '{ print $4 }'. (In my case: Outputs 145974) Then PGID can be used with a special mode of kill to terminate all processes in a group: kill -- -145974Antiar
A similar solution is suggested right in the question https://mcmap.net/q/297547/-using-sleep-and-wait-n-to-implement-simple-timeout-in-bash-race-condition-or-not/94687 . (And I find your idea quite elegant and clever; I couldn't come up with this myself.)Malaria
H
9

Here's a simplified version of Aaron Digulla's answer, which uses the kill -0 trick that Aaron Digulla leaves in a comment:

app &
pidApp=$!
( sleep 60 ; echo 'timeout'; kill $pidApp ) &
killerPid=$!

wait $pidApp
kill -0 $killerPid && kill $killerPid

In my case, I wanted to be both set -e -x safe and return the status code, so I used:

set -e -x
app &
pidApp=$!
( sleep 45 ; echo 'timeout'; kill $pidApp ) &
killerPid=$!

wait $pidApp
status=$?
(kill -0 $killerPid && kill $killerPid) || true

exit $status

An exit status of 143 indicates SIGTERM, almost certainly from our timeout.

Hopeh answered 13/3, 2014 at 16:20 Comment(1)
A similar solution is suggested right in the question https://mcmap.net/q/297547/-using-sleep-and-wait-n-to-implement-simple-timeout-in-bash-race-condition-or-not/94687 . (And I find this kind of solutions to the problem quite elegant and clever; I couldn't come up with this myself.)Malaria
S
3
app1 &
app2 &
sleep 60 &

wait -n
Septimal answered 26/6, 2019 at 15:11 Comment(1)
This does not work. If a single command, say app1 finishes before the timeout then the wait will pass. The question here is how to wait for all commands, but with a certain maximum time.Leverrier
R
2

I wrote a bash function that will wait until PIDs finished or until timeout, that return non zero if timeout exceeded and print all the PIDs not finisheds.

function wait_timeout {
  local limit=${@:1:1}
  local pids=${@:2}
  local count=0
  while true
  do
    local have_to_wait=false
    for pid in ${pids}; do
      if kill -0 ${pid} &>/dev/null; then
        have_to_wait=true
      else
        pids=`echo ${pids} | sed -e "s/^${pid}$//g"`
      fi
    done
    if ${have_to_wait} && (( $count < $limit )); then
      count=$(( count + 1 ))
      sleep 1
    else
      echo ${pids}
      return 1
    fi
  done   
  return 0
}

To use this is just wait_timeout $timeout $PID1 $PID2 ...

Rubbing answered 12/12, 2016 at 19:21 Comment(1)
Note that the sed will remove wrong PID numbers if you're unlucky enough to have e.g. $pid value 123 and the list of all pids contains PID 1231. You end up with modified list that is going to wait for PID 1 which is obviously not going to go away.Vial
W
2

To put in my 2c, we can boild down Teixeira's solution to:

try_wait() {
    # Usage: [PID]...
    for ((i = 0; i < $#; i += 1)); do
        kill -0 $@ && sleep 0.001 || return 0
    done
    return 1 # timeout or no PIDs
} &>/dev/null

Bash's sleep accepts fractional seconds, and 0.001s = 1 ms = 1 KHz = plenty of time. However, UNIX has no loopholes when it comes to files and processes. try_wait accomplishes very little.

$ cat &
[1] 16574
$ try_wait %1 && echo 'exited' || echo 'timeout'
timeout
$ kill %1
$ try_wait %1 && echo 'exited' || echo 'timeout'
exited

We have to answer some hard questions to get further.

Why has wait no timeout parameter? Maybe because the timeout, kill -0, wait and wait -n commands can tell the machine more precisely what we want.

Why is wait builtin to Bash in the first place, so that timeout wait PID is not working? Maybe only so Bash can implement proper signal handling.

Consider:

$ timeout 30s cat &
[1] 6680
$ jobs
[1]+    Running   timeout 30s cat &
$ kill -0 %1 && echo 'running'
running
$ # now meditate a bit and then...
$ kill -0 %1 && echo 'running' || echo 'vanished'
bash: kill: (NNN) - No such process
vanished

Whether in the material world or in machines, as we require some ground on which to run, we require some ground on which to wait too.

  • When kill fails you hardly know why. Unless you wrote the process, or its manual names the circumstances, there is no way to determine a reasonable timeout value.

  • When you have written the process, you can implement a proper TERM handler or even respond to "Auf Wiedersehen!" send to it through a named pipe. Then you have some ground even for a spell like try_wait :-)

Wall answered 15/11, 2018 at 22:44 Comment(1)
The precision and elegance of this answer is like a surgeon's knife.Pumpkin
B
1

You could use the timeout of the 'read' internal command.

The following will kill unterminated jobs and display the names of the completed jobs after at most 60 seconds:

( (job1; echo -n "job1 ")& (job2; echo -n "job2 ")&) | (read -t 60 -a jobarr; echo ${jobarr[*]} ${#jobarr[*]} )

It works by making a sub shell containing all the background jobs. The output of this sub shell is read into a bash array variable, which can be used as desired (in this example by printing the array + element count).

Be sure to reference the ${jobarr} in the same sub shell as the read command (hence the parenthesis), otherwise ${jobarr} will be empty.

All sub shells will automatically be muted (not killed) after the read command terminates. You have to kill them you self.

Banana answered 25/9, 2022 at 15:13 Comment(0)
H
0

Yet another timeout 's script

Running many subprocess with an overall timeout. Using recent features, I wrote this:

#!/bin/bash
maxTime=5.0 jobs=() pids=() cnt=1 Started=${EPOCHREALTIME/.}
if [[ $1 == -m ]] ;then maxTime=$2; shift 2; fi

for cmd ;do  # $cmd is unquoted in order to use strings as command + args
    $cmd &
    jobs[$!]=$cnt pids[cnt++]=$!
done

printf -v endTime %.6f $maxTime
endTime=$(( Started + 10#${endTime/.} ))
exec {pio}<> <(:) # Pseudo FD for "builtin sleep" by using "read -t" 
while ((${#jobs[@]})) && (( ${EPOCHREALTIME/.} < endTime ));do
    for cnt in ${jobs[@]};do
        if ! jobs $cnt &>/dev/null;then
            Elap=00000$(( ${EPOCHREALTIME/.} - Started ))
            printf 'Job %d (%d) ended after %.4f secs.\n' \
                   $cnt ${pids[cnt]} ${Elap::-6}.${Elap: -6}
            unset jobs[${pids[cnt]}] pids[cnt]
        fi
    done
    read -ru $pio -t .02 _
done
if ((${#jobs[@]})) ;then
    Elap=00000$(( ${EPOCHREALTIME/.} - Started ))
    for cnt in ${jobs[@]};do
        printf 'Job %d (%d) killed after %.4f secs.\n' \
               $cnt ${pids[cnt]} ${Elap::-6}.${Elap: -6}
    done
    kill ${pids[@]}
fi

Sample run:

  • Commands with argument could be submited as strings
  • -m switch let you choose a float as max time in seconds.
$ ./execTimeout.sh -m 2.3 "sleep 1" 'sleep 2' sleep\ {3,4}  'cat /dev/tty'
Job 1 (460668) ended after 1.0223 secs.
Job 2 (460669) ended after 2.0424 secs.
Job 3 (460670) killed after 2.3100 secs.
Job 4 (460671) killed after 2.3100 secs.
Job 5 (460672) killed after 2.3100 secs.

For testing this, I wrote this script that

  • choose random duratiopn between 1.0000 and 9.9999 seconds
  • for output random number of line between 0 and 8. (they could not ouptut anything).
  • lines output contain process id ($$), number of line left to print and total duration in seconds.
#!/bin/bash

tslp=$RANDOM lnes=${RANDOM: -1}
printf -v tslp %.6f ${tslp::1}.${tslp:1}
slp=00$((${tslp/.}/($lnes?$lnes:1)))
printf -v slp %.6f ${slp::-6}.${slp: -6}
# echo >&2 Slp $lnes x $slp == $tslp
exec {dummy}<> <(: -O)
while read -rt $slp -u $dummy; ((--lnes>0)); do
    echo $$ $lnes $tslp
done

Running this script 5 times in once, with a timeout of 5.0 seconds:

$ ./execTimeout.sh -m 5.0 ./tstscript.sh{,,,,}
2869814 6 2.416700
2869815 5 3.645000
2869814 5 2.416700
2869814 4 2.416700
2869815 4 3.645000
2869814 3 2.416700
2869813 5 8.414000
2869812 1 3.408000
2869814 2 2.416700
2869815 3 3.645000
2869814 1 2.416700
2869815 2 3.645000
Job 3 (2869814) ended after 2.4511 secs.
2869813 4 8.414000
2869815 1 3.645000
Job 1 (2869812) ended after 3.4518 secs.
Job 4 (2869815) ended after 3.6757 secs.
2869813 3 8.414000
Job 2 (2869813) killed after 5.0159 secs.
Job 5 (2869816) killed after 5.0159 secs.
Heading answered 26/9, 2022 at 9:24 Comment(0)
Y
0

There are some processes that do not work well when invoked from timeout. I came into an issue where needed to put a timeout catch around a qemu instance and if you invoke

timeout 900 qemu 

it will always hang.

My solution

./qemu_cmd &
qemuPid=$!
timeout 900 tail --pid=$qemuPid -f /dev/null
ret=$?
if [ "$ret" != "0" ]; then
   allpids=()
   descendent_pids $tracePid
   for pids in ${allpids[@]};do
      kill -9 $pids
   done
fi

descendent_pids(){
   allpids=("${allpids[@]}" $1)
   pids=$(pgrep -P $1)
   for pid in $pids; do
      descendent_pids $pid
   done
}

Also of note that timeout will not always kill descendant processes depending on how sophisticated the cmd that you are spawning from the timeout.

Yonina answered 17/3, 2023 at 20:42 Comment(1)
Faced the same problem and figured out that it is caused by qemu -nographic or -serial stdio options.Nanine
K
0

Better late than never, a solution using wait without polling (still a loop though), yet still stops as soon as possible.

app1 &
pidApp1=$!
app2 &
pidApp2=$!

# timeout 60 wait $pidApp1 $pidApp2
declare -A pidApps=( [$pidApp1]=running [$pidApp2]=running )
{ sleep 60; echo "stop"; } | read &
pidTmout=$!
while [[ ${#pidApps[@]} -gt 0 ]]; do
    wait -np pidStop
    [[ $pidStop == $pidTmout ]] && break
    unset pidApps[$pidStop]
done
[[ ${pidApps[$pidApp1]} == running ]] && kill -9 $pidApp1
[[ ${pidApps[$pidApp2]} == running ]] && kill -9 $pidApp2
Kanpur answered 9/7, 2023 at 21:43 Comment(0)
O
0

Using wait -p arg you can wait for multiple processes and identify which of them that finished first. By letting one of the processes be the timeout, you know if timeout or success occurred.

It's quite a monstrosity, but allows you to be more explicit in your error handling. If you need this level of control it's better to move to a more full-bodied language like Python.

Like the top answer says, using timeout is likely enough for 99% of more simple cases. I came up with parts of the solution below when needing a more rich watchdog process observing a heartbeat, instead of plain timeout. Don't see it as a full drop-in replacement, more like something to copy pieces from.

#!/bin/bash

# Start 2 processes to wait for, for sake of example, use sleep and let one of them fail
remaining=()
sleep 1 && false &
remaining+=($!)
sleep 2 && true &
remaining+=($!)

# Start the watchdog process, for sake of example, use sleep as a fixed timeout
sleep 2 &
pid_timeout=$!


# Keep looping until no more processes left
while [ ${#remaining[@]} != 0 ]
do
    echo "Waiting for remaining processes ${remaining[*]}"
    wait -n -p firstFinished "${remaining[@]}" $pid_timeout
    first_exit_code=$?
    
    if [ $firstFinished = $pid_timeout ]
    then
        echo "Timeout, killing remaining processes ${remaining[*]}"
        kill "${remaining[@]}"
        exit 124
    else 
        # Remove the finished process from remaining list
        echo "Process $firstFinished finished with code $first_exit_code"
        
        new_array=()
        for rem_pid in "${remaining[@]}"
        do
            if [ "$rem_pid" = "$firstFinished" ] 
            then
                echo "Process $rem_pid reported finished by wait, remove from list"
                continue
            fi 

            if ! kill -0 "$rem_pid" 
            then
                echo "Process $rem_pid no longer running, removing from wait list"
                # potential race condition when multiple processes finish before next iteration of loop
                # TODO: How to get exit code?
                continue
            fi 

            new_array+=("$rem_pid")
        done
        remaining=("${new_array[@]}")
        unset new_array
    fi
done
kill $pid_timeout || true
echo "all done"
Oculomotor answered 9/9, 2023 at 7:23 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.