How to debug a futex contention shown in strace?
Asked Answered
U

1

16

I'm debugging an issue in a multi-threaded linux process, where a certain thread appears to not execute for few seconds. Looking at strace output revealed it waits for futex e.g.
1673109 14:36:28.600329 futex(0x44b8d20, FUTEX_WAIT_PRIVATE,
1673109 14:36:33.221850 <... futex resumed> ) = 0 <4.621514>

How can I find out out what this futex(0x44b8d20) refers to in user space, i.e. how to map this to a locking construct which is using futex internally.

Undercoat answered 27/7, 2016 at 22:5 Comment(0)
N
13

I would use a simple systemtap script so that would help you to quickly find out addresses of contended futex locks. When I say address, I am referring to the first argument of futex() syscall.

  1. You can download the simple system tap script that finds the contended user space locks here:
    https://sourceware.org/systemtap/examples/process/futexes.stp

If you have systemtap installed on your system,
just start this system tap script: stap futexes.stp

  1. Capture the strace output as you did before.

  2. If you end the system tap script execution by simply doing Ctrl-C,
    you will get the output of contended futexes.

  3. Finally in your strace output,
    search for futex calls in which the second argument (operation type) is FUTEX_WAIT.
    For example : futex(0x7f58a31999d0, FUTEX_WAIT, 4508, NULL) = 0

  4. Then you can search for the first argument in system tap script output.
    Something like : ome[4489] lock 0x7f58a31999d0 contended 1 times, 7807 avg us

If you look at this system tap script,
it nicely prints the process name and process/thread id for you.
This makes it easy to find what you are looking for.

However one note is that executing the systemtap script will actually hook a syscall system wide.


Script for reference if link goes unavailable:

#! /usr/bin/env stap

# This script tries to identify contended user-space locks by hooking
# into the futex system call.

global FUTEX_WAIT = 0 /*, FUTEX_WAKE = 1 */
global FUTEX_PRIVATE_FLAG = 128 /* linux 2.6.22+ */
global FUTEX_CLOCK_REALTIME = 256 /* linux 2.6.29+ */

global lock_waits # long-lived stats on (tid,lock) blockage elapsed time
global process_names # long-lived pid-to-execname mapping

global entry_times%, uaddrs%

probe syscall.futex {
  if ((op & ~(FUTEX_PRIVATE_FLAG|FUTEX_CLOCK_REALTIME)) != FUTEX_WAIT) next
  entry_times[tid()] = gettimeofday_us()
  uaddrs[tid()] = futex_uaddr
}

probe syscall.futex.return {
  if (!(entry_times[tid()])) next
  elapsed = gettimeofday_us() - entry_times[tid()]
  lock_waits[pid(), uaddrs[tid()]] <<< elapsed
  delete entry_times[tid()]
  delete uaddrs[tid()]
  if (!(pid() in process_names))
    process_names[pid()] = execname()
}

probe end {
  foreach ([pid+, lock] in lock_waits) 
    printf ("%s[%d] lock %p contended %d times, %d avg us\n",
            process_names[pid], pid, lock, @count(lock_waits[pid,lock]),
            @avg(lock_waits[pid,lock]))
}
Nicolasanicolau answered 28/7, 2016 at 23:5 Comment(3)
This is really smart, but unfortunately I'm using OS X. Thanks to your post I've now started investigating using DTrace with Ruby probes though. That's definitely an avenue I hadn't even thought of exploring before.Scoter
great answer! Sadly i seemed to have an issue with it as authored on ubuntu lts 14.04.05 sudo ./futexes.stp semantic error: while resolving probe point: identifier 'syscall' at ./futexes.stp:15:7 source: probe >syscall.futex { semantic error: no match Pass 2: analysis failed. [man error::pass2] Number of similar error messages suppressed: 1. Rerun with -v to see them. Minim
@Minim You need to run futexes.stp with the systemtap utility. On Ubuntu this can be installed with apt-get install systemtap and then executed (as root) stap futexes.stp. More info here: sourceware.org/systemtap/SystemTap_Beginners_Guide/…Manxman

© 2022 - 2024 — McMap. All rights reserved.