What instructions does qemu trace?
Asked Answered
I

2

4

I wrote the following piece of code that steps through /bin/ls and counts its instructions:

#include <stdio.h>
#include <sys/ptrace.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
#include <sys/user.h>
#include <sys/reg.h>    
#include <sys/syscall.h>

int main()
{   
    pid_t child;
    child = fork(); //create child
    
    if(child == 0) {
        ptrace(PTRACE_TRACEME, 0, NULL, NULL);
        char* child_argv[] = {"/bin/ls", NULL};
        execv("/bin/ls", child_argv);
    }
    else {
        int status;
        long long ins_count = 0;
        while(1)
        {
            //stop tracing if child terminated successfully
            wait(&status);
            if(WIFEXITED(status))
                break;

                ins_count++;
                ptrace(PTRACE_SINGLESTEP, child, NULL, NULL);
        }

    printf("\n%lld Instructions executed.\n", ins_count);

    }
    
    return 0;
}

Running this code gives me about 500.000 instructions executed. As far as I know, most of these instructions should be from the dynamic linker. When I trace /bin/ls with QEMU with qemu-x86_64 -singlestep -D log -d in_asm /bin/ls , I get about 17.000 instructions executed. What do I have to adjust to start and stop counting at the same points QEMU does? (aka. counting the same instructions).

I traced a "return null" program with QEMU and it resulted in 7840 instructions while my code gave me 109025, therefore QEMU seems to trace more than the main but less than my code.

My goal is to later compare these instructions, that is why I want to iterate through the same ones like QEMU.

Ivanovo answered 15/11, 2020 at 17:1 Comment(1)
loos like ptrace counts cycles in the kernel while qemu doesn't. 17.000 cycles is a bit to less to do all the kernel stuff anyways.Mascot
G
6

QEMU's "in_asm" logging is not a log of executed instructions. It logs every time an instruction is translated (ie when QEMU generates a bit of host code corresponding to it). That translation is then cached and if the guest loops around and executes the same instruction again QEMU will simply re-use the same translation, and so it won't be logged by in_asm. "in_asm reports many fewer instructions" is therefore expected.

Logging every executed instruction via the -d options is a bit tricky -- you need to look at the 'cpu' and 'exec' traces, to use the 'nochain' suboption of -d to disable a QEMU optimisation that would otherwise result in some blocks not being logged, to use '-singlestep' to force one instruction per block, and also to account for a few corner cases where we print an execution trace and then don't actually execute the instruction. This is because the -d option is not intended as a way for users to introspect the behaviour of their program -- it is a debug option intended to allow debugging of what QEMU and the guest program are doing together, and so it prints information that requires a little understanding of QEMU internals to interpret correctly.

You might find it simpler to write a QEMU "plugin" instead: https://qemu.readthedocs.io/en/latest/devel/tcg-plugins.html -- this is an API designed to be fairly straightforward to write instrumentation like "count instructions executed". If you're lucky then one of the sample plugins might even be sufficient for your purposes.

Gui answered 17/11, 2020 at 13:28 Comment(5)
are you familiar with the qemu source? i just looked into the src and found the funtion cpu_exec_nocache which executes code without caching. wouldnt that solve my problem of reused instructions? do you know where i would need to replace the normal execution function with the no nocache function?Ivanovo
That's part of QEMU's internals -- it gets called for certain odd corner cases where it is necessary to execute a bit of code without caching the TB. There is no support for running code with no TB caching at all.Gui
Does exec_tb from trace-events report every instruction?Brushwood
It reports every TB executed from C code, which is not quite the same thing unless you're using -singlestep to make sure every TB has exactly one instruction and using nochain to stop TBs being chained into each other. You'll also see an exec log for TBs that are abandoned without executing any instruction. And there's a weird special case for the M-profile SG instruction. But if you do all those things then it's kinda close enough.Gui
In other words, it's basically the same info as the -d exec logging.Gui
H
3

I modified your program to make it run on a dedicated CPU core (e.g. number 7) adding the following code before the fork():

#define _GNU_SOURCE
#include <sched.h>
[...]
  cpu_set_t set;
  int rc;

  CPU_ZERO(&set);
  CPU_SET(7, &set);

  // Migrate the calling process on the target cpu
  rc = sched_setaffinity(0, sizeof(cpu_set_t), &set);
  if (0 != rc) {
    fprintf(stderr, "sched_setaffinity(): '%m' (%d)\n", errno);
    return -1;
  }

  // Dummy system call to trigger the migration. Actually, the on line
  // manual says that the previous call will make the current process
  // migrate but I saw in cpuid's source code that the guy calls sleep(0)
  // to make sure that the migration will be done. In my opinion, it may
  // be safer to call sched_yield()
  rc = sched_yield();
  if (0 != rc) {
    fprintf(stderr, "sched_yield(): '%m' (%d)\n", errno);
    return -1;
  }

  // Create child
  child = fork();
[...]

My PC is running Ubuntu/Linux 5.4.0 on:

# Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
# Code name     : Ivy Bridge
# cpu family    : 6
# model     : 58
# microcode : 0x21
# Number of physical cores: 4
# Number of harware threads: 8
# Base frequency: 3,50 GHz
# Turbo frequency: 3,90 GHz
# cpu MHz: 1604.615
# cache size    : 8192 KB
# cache_alignment: 64
# Address sizes: 36 bits physical, 48 bits virtual
#
# PMU version: 3
# Maximum number of fixed counters: 3
# Fixed counter bit width: 48
# Maximum number of programmable counters: 4
# Programmable counter bit width: 48

If I launch the modified program with ptrace() activated, I get almost the same numbers as yours:

$ test/progexec
[...]
548765 Instructions executed.

I designed a tool which reads the Intel PMU counters. The fixed counter#0 is:

# INST_RETIRED.ANY
#
# Number of instructions that retire execution. For instructions that consist of multiple
# uops, this event counts the retirement of the last uop of the instruction. The counter
# continues counting during hardware interrupts, traps, and in-side interrupt handlers.
#

The reading of the above counter on the CPU core#7 where the program runs gives the following results:

  • 1871879 Instructions for User + Kernel space execution (rings 0-3)
  • 546874 Instructions for User space execution (ring 3)
  • 1324451 Instructions for Kernel space execution (ring 0)

So, according to the above numbers, the program with ptrace(PTRACE_SINGLESTEP) computes the number of instructions when the program is running in user space (Intel protection ring#3).

N.B.: Linux uses ring 0 for kernel space and ring 3 for user space.

Housebound answered 16/11, 2020 at 12:11 Comment(1)
perf stat -e task-clock,cycles:u,instructions:u will record the HW perf counter for you, programming it to count only in user-space. (:k I think for kernel only, and no modifier for both). As in Why do these instruction counts of ls differ so much? (ptrace vs perf vs qemu). You can also use perf stat --all-user ./a.out or --all-kernel in more modern perf versions.Comorin

© 2022 - 2024 — McMap. All rights reserved.