My issue was that the child sleep
process was not being killed when the context timed out. The sh
parent process was being killed, but the child sleep
was being left around.
This would normally still allow the cmd.Wait()
call to succeed, but the problem is that cmd.Wait()
waits for both the process to exit and for outputs to be copied. Because we've assigned cmd.Stdout
, we have to wait for the read-end of the sleep
process' stdout pipe to close, but it never closes because the process is still running.
In order to kill child processes, we can instead start the process as its own process group leader by setting the Setpgid
bit, which will then allow us to kill the process using its negative PID to kill the process as well as any subprocesses.
Here is a drop-in replacement for exec.CommandContext
I came up with that does exactly this:
type Cmd struct {
ctx context.Context
terminated chan struct{}
*exec.Cmd
}
// NewCommand is like exec.CommandContext but ensures that subprocesses
// are killed when the context times out, not just the top level process.
func NewCommand(ctx context.Context, command string, args ...string) *Cmd {
return &Cmd{
ctx: ctx,
terminated: make(chan struct{}),
Cmd: exec.Command(command, args...),
}
}
func (c *Cmd) Start() error {
// Force-enable setpgid bit so that we can kill child processes when the
// context times out or is canceled.
if c.Cmd.SysProcAttr == nil {
c.Cmd.SysProcAttr = &syscall.SysProcAttr{}
}
c.Cmd.SysProcAttr.Setpgid = true
err := c.Cmd.Start()
if err != nil {
return err
}
go func() {
select {
case <-c.terminated:
return
case <-c.ctx.Done():
}
p := c.Cmd.Process
if p == nil {
return
}
// Kill by negative PID to kill the process group, which includes
// the top-level process we spawned as well as any subprocesses
// it spawned.
_ = syscall.Kill(-p.Pid, syscall.SIGKILL)
}()
return nil
}
func (c *Cmd) Run() error {
if err := c.Start(); err != nil {
return err
}
return c.Wait()
}
func (c *Cmd) Wait() error {
defer close(c.terminated)
return c.Cmd.Wait()
}
--- UPDATE ---
Since writing this code I've run into cases where subprocesses sometimes want to join their own process groups, and the setpgid trick no longer works because it will not kill processes in those new process groups. A more robust solution might be to manually traverse the process tree using something like go-ps
, and for each descendant process, use the following pseudocode:
// KillProcessTree kills an entire process tree using SIGKILL.
func KillProcessTree(pid int) error {
// Send SIGSTOP to prevent new children from being spawned
_ = syscall.Signal(pid, syscall.SIGSTOP)
// TODO: implement ChildProcesses
for _, c := range ChildProcesses(pid) {
_ = KillProcessTree(c.Pid)
}
// Now that the process is stopped and all descendants
// are guaranteed to be killed, we can safely SIGKILL
// this process, without worrying about descendant
// processes being reparented to pid 1 or anything
// like that.
_ = syscall.Signal(pid, syscall.SIGKILL)
return nil // TODO: better error handling :)
}
os/exec
package. – Fibrin