Go exec.CommandContext is not being terminated after context timeout
Asked Answered
E

2

5

In golang, I can usually use context.WithTimeout() in combination with exec.CommandContext() to get a command to automatically be killed (with SIGKILL) after the timeout.

But I'm running into a strange issue that if I wrap the command with sh -c AND buffer the command's outputs by setting cmd.Stdout = &bytes.Buffer{}, the timeout no longer works, and the command runs forever.

Why does this happen?

Here is a minimal reproducible example:

package main

import (
    "bytes"
    "context"
    "os/exec"
    "time"
)

func main() {
    ctx, cancel := context.WithTimeout(context.Background(), 100*time.Millisecond)
    defer cancel()

    cmdArgs := []string{"sh", "-c", "sleep infinity"}
    bufferOutputs := true

    // Uncommenting *either* of the next two lines will make the issue go away:

    // cmdArgs = []string{"sleep", "infinity"}
    // bufferOutputs = false

    cmd := exec.CommandContext(ctx, cmdArgs[0], cmdArgs[1:]...)
    if bufferOutputs {
        cmd.Stdout = &bytes.Buffer{}
    }
    _ = cmd.Run()
}

I've tagged this question with Linux because I've only verified that this happens on Ubuntu 20.04 and I'm not sure whether it would reproduce on other platforms.

Erythromycin answered 2/4, 2022 at 1:38 Comment(1)
I am facing the same issue on my MacOS, when setting cmd.Stdout, the command does not terminate after cancellation of context. I guess there is an issue with os/exec package.Fibrin
E
6

My issue was that the child sleep process was not being killed when the context timed out. The sh parent process was being killed, but the child sleep was being left around.

This would normally still allow the cmd.Wait() call to succeed, but the problem is that cmd.Wait() waits for both the process to exit and for outputs to be copied. Because we've assigned cmd.Stdout, we have to wait for the read-end of the sleep process' stdout pipe to close, but it never closes because the process is still running.

In order to kill child processes, we can instead start the process as its own process group leader by setting the Setpgid bit, which will then allow us to kill the process using its negative PID to kill the process as well as any subprocesses.

Here is a drop-in replacement for exec.CommandContext I came up with that does exactly this:

type Cmd struct {
    ctx context.Context
    terminated chan struct{}
    *exec.Cmd
}

// NewCommand is like exec.CommandContext but ensures that subprocesses
// are killed when the context times out, not just the top level process.
func NewCommand(ctx context.Context, command string, args ...string) *Cmd {
    return &Cmd{
        ctx:        ctx,
        terminated: make(chan struct{}),
        Cmd:        exec.Command(command, args...),
    }
}

func (c *Cmd) Start() error {
    // Force-enable setpgid bit so that we can kill child processes when the
    // context times out or is canceled.
    if c.Cmd.SysProcAttr == nil {
        c.Cmd.SysProcAttr = &syscall.SysProcAttr{}
    }
    c.Cmd.SysProcAttr.Setpgid = true
    err := c.Cmd.Start()
    if err != nil {
        return err
    }
    go func() {
        select {
        case <-c.terminated:
            return
        case <-c.ctx.Done():
        }
        p := c.Cmd.Process
        if p == nil {
            return
        }
        // Kill by negative PID to kill the process group, which includes
        // the top-level process we spawned as well as any subprocesses
        // it spawned.
        _ = syscall.Kill(-p.Pid, syscall.SIGKILL)
    }()
    return nil
}

func (c *Cmd) Run() error {
    if err := c.Start(); err != nil {
        return err
    }
    return c.Wait()
}

func (c *Cmd) Wait() error {
    defer close(c.terminated)
    return c.Cmd.Wait()
}

--- UPDATE ---

Since writing this code I've run into cases where subprocesses sometimes want to join their own process groups, and the setpgid trick no longer works because it will not kill processes in those new process groups. A more robust solution might be to manually traverse the process tree using something like go-ps, and for each descendant process, use the following pseudocode:

// KillProcessTree kills an entire process tree using SIGKILL.
func KillProcessTree(pid int) error {
  // Send SIGSTOP to prevent new children from being spawned
  _ = syscall.Signal(pid, syscall.SIGSTOP)
  // TODO: implement ChildProcesses
  for _, c := range ChildProcesses(pid) {
    _ = KillProcessTree(c.Pid)
  }
  // Now that the process is stopped and all descendants
  // are guaranteed to be killed, we can safely SIGKILL
  // this process, without worrying about descendant
  // processes being reparented to pid 1 or anything
  // like that.
  _ = syscall.Signal(pid, syscall.SIGKILL)
  return nil // TODO: better error handling :)
}
Erythromycin answered 2/4, 2022 at 2:14 Comment(2)
So basically you are killing the main head process and because of the code Setpgid = true, all the sub processes are being killed, right ? @ErythromycinTyne
@Tyne yep, that is how I understand itErythromycin
F
2

By setting cmd.WaitDelay we can make sure the process will be killed even if the io Pipes are not closed.

This was introduced in go 1.20.

   cmd.WaitDelay = 1 * time.Duration
Froward answered 4/5 at 14:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.