How to Configure and Sample Intel Performance Counters In-Process
Asked Answered
R

1

4

In a nutshell, I'm trying to achieve the following inside a userland benchmark process (pseudo-code, assuming x86_64 and a UNIX system):

results[] = ...
for (iteration = 0; iteration < num_iterations; iteration++) {
    pctr_start = sample_pctr();
    the_benchmark();
    pctr_stop = sample_pctr();
    results[iteration] = pctr_stop - pctr_start;
}

FWIW, the performance counter I am thinking of using is CPU_CLK_UNHALTED.THREAD_ALL, to read the number of core cycles independent of clock frequency changes (In an earlier question I had been planning to use the TSC register for this, but alas, that is not what this register measures at all).

My initial intention was to use inline assembler to first configure a counter using WRMSR, then to read the counter using RDPMC inside sample_pctr().

I stumbled at the first hurdle, as writing MSRs requires kernel privileges. It seems like you can in fact read the counters from user space (if configured correctly), but the act of configuring the counter (with an MSR) needs to be undertaken by the kernel.

Does anyone know a lightweight way to ask the kernel to configure the a performance counters from user-space so that I can then use RDPMC from within my benchmark harness?

Stuff I've looked into/thought about:

  • Perf tools for Linux. Seems to be geared up for sampling over the whole lifetime of a process, not within a process as specific points (before and after each iteration).
  • Use perf syscalls directly (i.e. perf_event_open). Looks like the counter value will only update periodically (using a sample rate) or after the counter exceeds a threshold. I need the counter value precisely at the moment I ask. This is why RDPMC seemed so attractive. I imagine that sampling frequently will itself skew the performance counter readings.
  • PAPI builds on perf, so probably inherits the above problem.
  • Write a kernel module -- too much effort, too error prone.

Ideally I would like a solution which works on OpenBSD and Linux, but somehow I think that is a tall order. Perhaps just for Linux for now.

Any help is most appreciated. Thanks.

EDIT: I just found the Linux msr device node, which would probably suffice. I'll leave the question up in case a better answer shows up.

Remonstrance answered 18/8, 2016 at 15:6 Comment(3)
You can program counters from user space, but you probably want to pin your threads to cores because PMCs aren't saved/restored on context switches. See agner.org/optimize for an already-written kernel module for Linux that gives you PMC access, and also stackoverflow.com/questions/38848914/… for some discussion of using them.Pericranium
Thanks! Can you comment on how much overhead perf's sampling model would impose? Would those perf routines themselves be included in any readings I take?Remonstrance
No idea; I just use perf by putting a loop I want to microbenchmark into its own stand-alone program and using perf stat.Pericranium
R
0

It seems the best way -- for Linux at least -- is to use the msr device node.

You simply open a device node, seek to the address of the MSR required, and read or write 8 bytes.

OpenBSD is harder, since (at the time of writing) there is no user-space proxy to the MSRs. So you would need to write a kernel module or implement a sysctl by hand.

Remonstrance answered 19/8, 2016 at 11:21 Comment(1)
Instead of lseek/write, just use pwrite to write at a specified offset, as described in stackoverflow.com/questions/38848914/…Pericranium

© 2022 - 2024 — McMap. All rights reserved.