rdpmc: surprising behavior

Asked 17/5, 2019 at 19:43 Answered 18/5, 2019 at 0:30

Solved performance assembly x86 performancecounter intel-pmu

I'm trying to understand the rdpmc instruction. As such I have the following asm code:

segment .text
global _start

_start:
    xor eax, eax
    mov ebx, 10
.loop:
    dec ebx
    jnz .loop

    mov ecx, 1<<30
    ; calling rdpmc with ecx = (1<<30) gives number of retired instructions
    rdpmc
    ; but only if you do a bizarre incantation: (Why u do dis Intel?)
    shl rdx, 32
    or  rax, rdx

    mov rdi, rax ; return number of instructions retired.
    mov eax, 60
    syscall

(The implementation is a translation of rdpmc_instructions().) I count that this code should execute 2*ebx+3 instructions before hitting the rdpmc instruction, so I expect (in this case) that I should get a return status of 23.

If I run perf stat -e instruction:u ./a.out on this binary, perf tells me that I've executed 30 instructions, which looks about right. But if I execute the binary, I get a return status of 58, or 0, not deterministic.

What have I done wrong here?

Casebook answered 17/5, 2019 at 19:43 Comment(15)

You can't get a return status of 306 because only least significant 8 bits of the exit value are returned to the parent process. – Archaeornis 17/5, 2019 at 20:3

Have you tried counting a delta between entry to _start vs. at the end? Have you tried increasing the iteration count to see if the result varies with instructions executed at all? – Circumnutate 17/5, 2019 at 20:4

code review: a better translation of for(i=0 ; i<1000; i++) is a mov-immediate to register with the loop counter. Or cmp eax, 1000. Using a dq 100 is just clutter; inline small read-only constants. (Use equ if you still want the definition ahead of code). The correct translation of 1<<30 is mov ecx, 1<<30, not a runtime shift. A more efficient loop structure is dec ebx / jnz .loop. rdpmc writes EAX and EDX, implicitly zero-extending into RAX and RDX, you don't need to zero them first. Also, you might as well ignore RDX unless it's possible for the count to be > 2^32. – Circumnutate 17/5, 2019 at 20:10

Also don't forget to use default rel so [a] uses a RIP-relative addressing mode. (Unless you're trying to experimenting with the difference between rel and abs addressing modes). – Circumnutate 17/5, 2019 at 20:12

Also, if you don't do anything special to reset the performance counter before your program runs, there's no reason to expect it to start counting from zero. That's the point of using perf. But you could take a delta. – Circumnutate 17/5, 2019 at 20:23

@RossRidge: Edited to make sure that the number of instruction is less than 256. – Casebook 17/5, 2019 at 20:24

Your forgot to update your text, so now it doesn't match the code. Like I just commented, try taking a delta because the counter probably starts at some arbitrary 64-bit value. – Circumnutate 17/5, 2019 at 20:25

@PeterCordes: Thanks! It's hard to find people who have good taste in assembly, so this was very helpful. I've tried commenting out the loop; I sometimes get 58, sometime 0. Result is not deterministic. – Casebook 17/5, 2019 at 20:25

@PeterCordes: Tried taking the delta, now I'm getting zero identically every time. – Casebook 17/5, 2019 at 20:28

Then probably the counter isn't enabled. Try running your program under perf, so it's profiling itself as well as being profiled by perf. That should get perf to have the fixed counters enabled. – Circumnutate 17/5, 2019 at 20:37

@PeterCordes: When I run it under perf, I get 27 instructions, deterministically, which is about right. – Casebook 17/5, 2019 at 20:44

Cool, that confirms my guess :) The fixed counters are only counting when enabled. – Circumnutate 17/5, 2019 at 20:47

@PeterCordes: So ostensibly there's some CPU flag that needs to be set to get the counters to operate? – Casebook 17/5, 2019 at 21:15

@Casebook The IA32_PERF_GLOBAL_CTRL and IA32_FIXED_CTR_CTRL MSRs have to be modified (see Chapter 18 in Volume 3 of Intel's "Software Developer’s Manual"). – Beleaguer 17/5, 2019 at 22:52

@AndreasAbel: Could you edit Peter's answer to set the correct bits in that register so we can have an authoritative answer to this question? I think it is of general interest. – Casebook 17/5, 2019 at 23:55

The fixed counters don't count all the time, only when software has enabled them. Normally (the kernel side of) perf does this, along with resetting them to zero before starting a program.

The fixed counters (like the programmable counters) have bits that control whether they count in user, kernel, or user+kernel (i.e. always). I assume Linux's perf kernel code leaves them set to count neither when nothing is using them.

If you want to use raw RDPMC yourself, you need to either program / enable the counters (by setting the corresponding bits in the IA32_PERF_GLOBAL_CTRL and IA32_FIXED_CTR_CTRL MSRs), or get perf to do it for you by still running your program under perf. e.g. perf stat ./a.out

If you use perf stat -e instructions:u ./perf ; echo $?, the fixed counter will actually be zeroed before entering your code so you get consistent results from using rdpmc once. Otherwise, e.g. with the default -e instructions (not :u) you don't know the initial value of the counter. You can fix that by taking a delta, reading the counter once at start, then once after your loop.

The exit status is only 8 bits wide, so this little hack to avoid printf or write() only works for very small counts.

It also means its pointless to construct the full 64-bit rdpmc result: the high 32 bits of the inputs don't affect the low 8 bits of a sub result because carry propagates only from low to high. In general, unless you expect counts > 2^32, just use the EAX result. Even if the raw 64-bit counter wrapped around during the interval you measured, your subtraction result will still be a correct small integer in a 32-bit register.

Simplified even more than in your question. Also note indenting the operands so they can stay at a consistent column even for mnemonics longer than 3 letters.

segment .text
global _start

_start:
    mov   ecx, 1<<30      ; fixed counter: instructions
    rdpmc
    mov   edi, eax        ; start

    mov   edx, 10
.loop:
    dec   edx
    jnz   .loop

    rdpmc               ; ecx = same counter as before

    sub   eax, edi       ; end - start

    mov   edi, eax
    mov   eax, 231
    syscall             ; sys_exit_group(rdpmc).  sys_exit isn't wrong, but glibc uses exit_group.

Running this under perf stat ./a.out or perf stat -e instructions:u ./a.out, we always get 23 from echo $? (instructions:u shows 30, which is 1 more than the actual number of instructions this program runs, including syscall)

23 instructions is exactly the number of instructions strictly after the first rdpmc, but including the 2nd rdpmc.

If we comment out the first rdpmc and run it under perf stat -e instructions:u, we consistently get 26 as the exit status, and 29 from perf. rdpmc is the 24th instruction to be executed. (And RAX starts out initialized to zero because this is a Linux static executable, so the dynamic linker didn't run before _start). I wonder if the sysret in the kernel gets counted as a "user" instruction.

But with the first rdpmc commented out, running under perf stat -e instructions (not :u) gives arbitrary values as the starting value of the counter isn't fixed. So we're just taking (some arbitrary starting point + 26) mod 256 as the exit status.

But note that RDPMC is not a serializing instruction, and can execute out of order. In general you maybe need lfence, or (as John McCalpin suggests in the thread you linked) giving ECX a false dependency on the results of instructions you care about. e.g. and ecx, 0 / or ecx, 1<<30 works, because unlike xor-zeroing, and ecx,0 is not dependency-breaking.

Nothing weird happens in this program because the front-end is the only bottleneck, so all the instructions execute basically as soon as they're issued. Also, the rdpmc is right after the loop, so probably a branch mispredict of the loop-exit branch prevents it from being issued into the OoO back-end before the loop finishes.

PS for future readers: one way to enable user-space RDPMC on Linux without any custom modules beyond what perf requires is documented in perf_event_open(2):

echo 2 | sudo tee /sys/devices/cpu/rdpmc    # enable RDPMC always, not just when a perf event is open

Circumnutate answered 17/5, 2019 at 21:36 Comment(14)

This instruction is somewhat strange in that it doesn't segfault when counters aren't enabled, it just. . . does the wrong thing. Also, I can't find anything in the Intel manual saying what needs to be done to get counters to run. – Casebook 17/5, 2019 at 22:0

Note that rdpmc is not a serializing instruction. To get reliable results, it has to be sandwiched between serializing instructions such as lfence. – Beleaguer 17/5, 2019 at 22:10

@AndreasAbel ah good point. This program doesn't include any bottlenecks other than the front-end, so instructions are all going to execute as quickly as their uops enter the out-of-order back end. And the branch miss on loop exit probably helps. One of John McCalpin's posts on the thread the OP linked includes the idea of giving ECX a false dependency on the result of code you want to measure. (e.g. and ecx,0 (not dep-breaking) / or ecx, 1<<30). – Circumnutate 17/5, 2019 at 22:21

@PeterCordes But this wouldn't prevent later instructions (in this example, e.g., mov eax, 60) from potentially being executed before rdpmc. – Beleaguer 17/5, 2019 at 22:34

@AndreasAbel: That's true, so you might still want lfence after rdpmc, even if you use that trick to avoid one before. But in this case we don't have to worry about later instructions: they can't retire before rdpmc executes, because retirement is in-order. The 1<<30 fixed counter counts inst_retired.any, IIRC. – Circumnutate 17/5, 2019 at 22:50

OK, but then there is, in general, no guarantee that the earlier instructions have retired by the time rdpmc is executed, so the trick with the false dependency on ECX doesn't seem to be correct. – Beleaguer 17/5, 2019 at 22:59

@AndreasAbel: That's true. It maybe makes sense for a counter like uops_executed.thread or dispatched, rather than a retirement event counter. Or a cycle counter, not instructions/uops. Or for being approximately in the right place for memory events. But if it's at the end of a long dependency chain, if all older instructions have to have executed for a result to be ready (given limited ROB size), then it could still be useful. OTOH, if you already know that there are few to no in-flight uops, then you might as well just use lfence unless you're counting cycles and don't want overhead – Circumnutate 17/5, 2019 at 23:23

@PeterCordes What if no IA32_PERFEVTSEL is programmed to count a specific perf event one set as an rdpmc operand. – Goldwin 30/4, 2020 at 21:33

@SomeName: I really don't know. I assume the event counter doesn't increment and you'll get the same output every time from rdpmc. Possibly always 0, IDK. – Circumnutate 1/5, 2020 at 0:19

@PeterCordes fixed counter returns crap for me (it's enabled and i tried 1 << 30 and I also tried 1 << 32 seeing as thats the position of the control bit in the control register -- i'm only interested in counter 0 (INST_RETIRED.ANY)). programmed counters work fine. UOPS_RETIRED.ALL works as expected, despite supposedly not being supported on kbl/skl – Whimsy 17/4, 2021 at 18:14

@SomeName do you mean if the PMC is disabled, if the PMC doesnt exist, or if the PMC hasn't been programmed yet, or if the PMC has been disabled then reenabled, or if the PMC is programmed with a non event? I can find out – Whimsy 17/4, 2021 at 18:22

@PeterCordes oops 1 << 32 wouldn't even fit in ecx anyway. To get INST_RETIRED.ANY on my KBL you have to program perfevtsel0 event 0 umask 1 and then rdpmc (1<<30). It returns 0 if you do rdpmc(0) – Whimsy 17/4, 2021 at 19:20

@PeterCordes if your using perf_event_open(2) with multiple programmable PERF_TYPE_RAW counters (say LSD_UOPS and IDQ_DSB_UOPS) how do you figure out which programmable event corresponds with which index value you should set for ecx for a given rdpmc 'call'? Seeing some weird values using ecx=0 for the first opened and ecx=1 for the second. So far unable to find guide. jevents has something close but its using sampling rather than counting. – Confidant 21/7, 2021 at 1:28

@Noah: IDK, I've only used perf stat and perf record, not the syscall interface. – Circumnutate 21/7, 2021 at 1:34

The first step is to ensure that the performance counters you want to use are enabled in the IA32_PERF_GLOBAL_CTRL MSR register, whose layout is shown in Figure 18-8 of the Intel Manual Volume 3 (January 2019). You can easily do this by loading the MSR kernel module (sudo modprobe msr) and executing the following command:

sudo rdmsr -a 0x38F

The value 0x38F is the address of the IA32_PERF_GLOBAL_CTRL MSR register and the -a option specifies that the rdmsr instruction should be executed on all logical cores. By default, this should print 7000000ff (when HT is disabled) or 70000000f (when HT is enabled) for all logical cores. For the INST_RETIRED.ANY fixed-function performance counter, the bit at index 32 is the one that enables it, so it should be 1. The value 7000000ff that all of the three fixed-function counters and all of the eight programmable counters are enabled.

The IA32_PERF_GLOBAL_CTRL register has one enable bit for each performance counter per logical core. Each programmable performance counter has also its dedicated control register and there is a control register for all of the fixed-function counters. In particular, the control register for the INST_RETIRED.ANY fixed-function performance counter is IA32_FIXED_CTR_CTRL, whose layout is shown in Figure 18-7 of the Intel Manual Volume 3. There are 12 defined bits in the register, the first 4 bits can be used to control the behavior of the the first fixed-function counter, i.e., INST_RETIRED.ANY (the order is shown in Table 19-2). Before modifying the register, you should first check how it got initialized by the OS by executing:

sudo rdmsr -a 0x38D

It should print 0xb0, by default. This indicates that the second fixed-function counter (unhalted core cycles) is enabled and configured to count in both supervisor mode and user mode. To enable INST_RETIRED.ANY and configure it to count only user mode events while keeping the unhalted core cycles counter as is, execute the following command:

sudo wrmsr -a 0x38D 0xb2

Once this command is executed, the events are counted immediately. You can check this by reading the first fixed-function counter IA32_PERF_FIXED_CTR0 (see Table 19-2):

sudo rdmsr -a 0x309

You can execute that command multiple times and see how the counts on each core are changing. Unfortunately, this means that by the time your program is run, the current value in IA32_PERF_FIXED_CTR0 will be basically some random value. You can try to reset the counter by executing:

sudo wrmsr -a 0x309 0

But the fundamental problem remains; you cannot instantaneously reset the counter and run your program. As suggested in @Peter's answer, the right way to use any performance counter is to wrap the region of interest between rdpmc instructions and take the difference.

The MSR kernel module is very convenient because the only way to access MSR registers is in kernel mode. However, there is an alternative to wrapping the code between rdpmc instructions. You can write your own kernel module and place your code in the kernel module immediately after the instruction that enables the counter. You can even disable interrupts. Typically, this level of accuracy is not worth the effort.

You can use the -p option instead of -a to specify a particular logical core. However, you'll have to make sure that the program is run on the same core with taskset -c 3 ./a.out to run on core #3, for example.

Fluid answered 18/5, 2019 at 0:30 Comment(1)

I've ran through these instructions, and they work! – Casebook 18/5, 2019 at 21:8

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags