'strace' fixes hung process
Asked Answered
G

3

6

I have a singlethreaded Unix process that communicates over TCP with other processes.

The problem is the following. When I start up the process it hangs (no busy loop) until I kill it.

The funny thing is, as soon as I attach with strace to it, it continues to run with the expected behavior as if there wasn't any problem at all (always reproducible).

What could be the reason for this behavior? What effect has strace on the state of a process?


The cause of strace changing the behavior was, because we used openonload with a bug. As soon as we attached strace, the stack was moved back to the kernel and the problem was gone.

Gainer answered 28/11, 2013 at 18:35 Comment(4)
If the code is threaded, a race condition might be avoided by a controlling process which forces context changes at different points in code execution than happens when the code runs natively. Running a debugger on threaded code that has problems has sometimes resulted in the code not diplaying the problem - for me.Kwok
True, but the OP said "single threaded" :)Therese
I have a similar situation... a hung process works alright if I attach to strace. Can anyone elaborate the explanation?Hypersensitize
There is now an article about that: ayende.com/blog/198849-C/…Khartoum
B
4

Many years later, so probably with a completely different root cause, this blog post explains why attaching a tracer might fix hung system calls: https://ayende.com/blog/198849-C/production-postmortem-the-heisenbug-server?Key=1eeda567-02a8-4bbb-b90f-557523973233. It looks like running strace (or any other tool that uses the ptrace system call) can causing in "hung" system calls to return (with an exit code of EINTR).

Quoting the ptrace man page:

Some system calls return with EINTR if a signal was sent to a

tracee, but delivery was suppressed by the tracer. (This is very typical operation: it is usually done by debuggers on every attach, in order to not introduce a bogus SIGSTOP). As of Linux 3.2.9, the following system calls are affected (this list is likely incomplete): epoll_wait(2), and read(2) from an inotify(7) file descriptor. The usual symptom of this bug is that when you attach to a quiescent process with the command

   strace -p <process-ID>

then, instead of the usual and expected one-line output such as

   restart_syscall(<... resuming interrupted call ...>_

or

   select(6, [5], NULL, [5], NULL_

('_' denotes the cursor position), you observe more than one line. For example:

    clock_gettime(CLOCK_MONOTONIC, {15370, 690928118}) = 0
    epoll_wait(4,_

What is not visible here is that the process was blocked in epoll_wait(2) before strace(1) has attached to it. Attaching caused epoll_wait(2) to return to user space with the error EINTR. In this particular case, the program reacted to EINTR by checking the current time, and then executing epoll_wait(2) again. (Programs which do not expect such "stray" EINTR errors may behave in an unintended way upon an strace(1) attach.)

Bromidic answered 27/1, 2023 at 15:22 Comment(2)
Can you summarise here? The link will break sooner or later.Chalky
Actually, the ptrace man page is the summary. But I added a sentence to summarize it.Bromidic
W
0

I had this problem only once and it was related to signal handling. It is one source of race conditions in single-threaded code.

Wake answered 29/1, 2014 at 13:51 Comment(0)
C
0

Most likely that strace output simply slows down the process making deadlocks much less likely. I have seen this happen before with strace OR can happen when adding other debug printing or debug calls.

Deadlocks most often seen with multi-threaded interaction. But in your case you have multiple processes. If the strace frees up the processes every time then I guess the way you open the sockets or handshake on the socket is what is hanging. Buffering and blocking on the socket I think could be getting you into a process-deadlocked state.

Similar question but with a multi-threaded process, deadlock between threads instead of between seperate processes: Using strace fixes hung memory issue

Hard to generalise examples, especially as don't know what your different processes are doing or if they're sharing resources in some way? I will try . . .

  1. Example with one object/resource which should be protected:
    One process starts making changes on an object (e.g. adding items to a list/db table)
    Another process starts iterating the list/table.
    Danger of one of those processes iterating loop being confused and never exiting OR doing something worse like writing to invalid memory.

  2. Example where object/resource is protected by mutexes
    The classic simple deadlock with two resources problem. ~ simpler than dining philosophers
    One thread/process grabs mutex on object A, does some work.
    Another thread/process grabs mutex on object B, does some work.
    Same thread/process needs to update object A, waits for mutex for A.
    Original thread/process needs to access object B, waits for mutex on B.
    . . . . . . . . . . . . @ . . . . . . . . . . .
    Silence except for the noise of the wind and a tumbleweed blowing across the landscape.
    Deadlocked.

Crossfade answered 10/11, 2014 at 16:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.