The fixed counters don't count all the time, only when software has enabled them. Normally (the kernel side of) perf
does this, along with resetting them to zero before starting a program.
The fixed counters (like the programmable counters) have bits that control whether
they count in user, kernel, or user+kernel (i.e. always). I assume Linux's perf
kernel code leaves them set to count neither when nothing is using them.
If you want to use raw RDPMC yourself, you need to either program / enable the counters (by setting the corresponding bits in the IA32_PERF_GLOBAL_CTRL
and IA32_FIXED_CTR_CTRL
MSRs), or get perf to do it for you by still running your program under perf
. e.g. perf stat ./a.out
If you use perf stat -e instructions:u ./perf ; echo $?
, the fixed counter will actually be zeroed before entering your code so you get consistent results from using rdpmc
once. Otherwise, e.g. with the default -e instructions
(not :u) you don't know the initial value of the counter. You can fix that by taking a delta, reading the counter once at start, then once after your loop.
The exit status is only 8 bits wide, so this little hack to avoid printf or write()
only works for very small counts.
It also means its pointless to construct the full 64-bit rdpmc
result: the high 32 bits of the inputs don't affect the low 8 bits of a sub
result because carry propagates only from low to high. In general, unless you expect counts > 2^32, just use the EAX result. Even if the raw 64-bit counter wrapped around during the interval you measured, your subtraction result will still be a correct small integer in a 32-bit register.
Simplified even more than in your question. Also note indenting the operands so they can stay at a consistent column even for mnemonics longer than 3 letters.
segment .text
global _start
_start:
mov ecx, 1<<30 ; fixed counter: instructions
rdpmc
mov edi, eax ; start
mov edx, 10
.loop:
dec edx
jnz .loop
rdpmc ; ecx = same counter as before
sub eax, edi ; end - start
mov edi, eax
mov eax, 231
syscall ; sys_exit_group(rdpmc). sys_exit isn't wrong, but glibc uses exit_group.
Running this under perf stat ./a.out
or perf stat -e instructions:u ./a.out
, we always get 23
from echo $?
(instructions:u
shows 30, which is 1 more than the actual number of instructions this program runs, including syscall
)
23 instructions is exactly the number of instructions strictly after the first rdpmc
, but including the 2nd rdpmc
.
If we comment out the first rdpmc
and run it under perf stat -e instructions:u
, we consistently get 26
as the exit status, and 29
from perf
. rdpmc
is the 24th instruction to be executed. (And RAX starts out initialized to zero because this is a Linux static executable, so the dynamic linker didn't run before _start
). I wonder if the sysret
in the kernel gets counted as a "user" instruction.
But with the first rdpmc
commented out, running under perf stat -e instructions
(not :u) gives arbitrary values as the starting value of the counter isn't fixed. So we're just taking (some arbitrary starting point + 26) mod 256 as the exit status.
But note that RDPMC is not a serializing instruction, and can execute out of order. In general you maybe need lfence
, or (as John McCalpin suggests in the thread you linked) giving ECX a false dependency on the results of instructions you care about. e.g. and ecx, 0
/ or ecx, 1<<30
works, because unlike xor-zeroing, and ecx,0
is not dependency-breaking.
Nothing weird happens in this program because the front-end is the only bottleneck, so all the instructions execute basically as soon as they're issued. Also, the rdpmc
is right after the loop, so probably a branch mispredict of the loop-exit branch prevents it from being issued into the OoO back-end before the loop finishes.
PS for future readers: one way to enable user-space RDPMC on Linux without any custom modules beyond what perf
requires is documented in perf_event_open(2)
:
echo 2 | sudo tee /sys/devices/cpu/rdpmc # enable RDPMC always, not just when a perf event is open
_start
vs. at the end? Have you tried increasing the iteration count to see if the result varies with instructions executed at all? – Circumnutatefor(i=0 ; i<1000; i++)
is a mov-immediate to register with the loop counter. Orcmp eax, 1000
. Usinga dq 100
is just clutter; inline small read-only constants. (Useequ
if you still want the definition ahead of code). The correct translation of1<<30
ismov ecx, 1<<30
, not a runtime shift. A more efficient loop structure isdec ebx / jnz .loop
.rdpmc
writes EAX and EDX, implicitly zero-extending into RAX and RDX, you don't need to zero them first. Also, you might as well ignore RDX unless it's possible for the count to be > 2^32. – Circumnutatedefault rel
so[a]
uses a RIP-relative addressing mode. (Unless you're trying to experimenting with the difference between rel and abs addressing modes). – Circumnutateperf
. But you could take a delta. – Circumnutateperf
, so it's profiling itself as well as being profiled byperf
. That should getperf
to have the fixed counters enabled. – Circumnutate