Many years later, so probably with a completely different root cause, this blog post explains why attaching a tracer might fix hung system calls: https://ayende.com/blog/198849-C/production-postmortem-the-heisenbug-server?Key=1eeda567-02a8-4bbb-b90f-557523973233. It looks like running strace
(or any other tool that uses the ptrace
system call) can causing in "hung" system calls to return (with an exit code of EINTR
).
Quoting the ptrace man page:
Some system calls return with EINTR if a signal was sent to a
tracee, but delivery was suppressed by the tracer. (This is very
typical operation: it is usually done by debuggers on every
attach, in order to not introduce a bogus SIGSTOP). As of Linux
3.2.9, the following system calls are affected (this list is
likely incomplete): epoll_wait(2), and read(2) from an inotify(7)
file descriptor. The usual symptom of this bug is that when you
attach to a quiescent process with the command
strace -p <process-ID>
then, instead of the usual and expected one-line output such as
restart_syscall(<... resuming interrupted call ...>_
or
select(6, [5], NULL, [5], NULL_
('_' denotes the cursor position), you observe more than one
line. For example:
clock_gettime(CLOCK_MONOTONIC, {15370, 690928118}) = 0
epoll_wait(4,_
What is not visible here is that the process was blocked in
epoll_wait(2) before strace(1) has attached to it. Attaching
caused epoll_wait(2) to return to user space with the error
EINTR. In this particular case, the program reacted to EINTR by
checking the current time, and then executing epoll_wait(2)
again. (Programs which do not expect such "stray" EINTR errors
may behave in an unintended way upon an strace(1) attach.)