Update 2: I think Brendan's answer is right. I should maybe delete this, but the ocperf.py
suggestion is still useful for future readers, I think. And it might explain extra TLB misses on CPUs without Process-Context-Identifiers with kernels that mitigate Meltdown.
Update: the below guess was wrong. New guess: mmap
has to modify your process's page table, so perhaps there's some TLB invalidation of something just from that. My recommendation to use ocperf.py record
to try to figure out which asm instructions are causing TLB misses still stands. Even with optimization enabled, the code will store to the stack when pushing/popping a return address for the glibc wrapper function calls.
Perhaps your kernel has kernel / user page-table isolation enabled to mitigate Meltdown, so on return from kernel to user, all TLB entries have been invalidated (by modifying CR3 to point to page tables that don't include the kernel mappings at all).
Look for Kernel/User page tables isolation: enabled
in your dmesg output. You can try booting with kpti=off
as a kernel option to disable it, if you don't mind being vulnerable to Meltdown while testing.
Because you're using C, you're using the mmap
and munmap
system calls through their glibc wrappers, not with inline syscall
instructions directly. The ret
instruction in that wrapper needs to load the return address from the stack, which TLB misses.
The extra store misses probably come from call
instructions pushing a return address, although I'm not sure that's right because the current stack page should already be in the TLB from the ret
from the previous system call.
You can profile with ocperf.py to get symbolic names for uarch-specific events. Assuming you're on a recent Intel CPU, ocperf.py record -e mem_inst_retired.stlb_miss_stores,page-faults,dTLB-load-misses
to find which instructions cause store misses. (Then use ocperf.py report -Mintel
). If report
doesn't make it easy to choose which event to see counts for, only record with a single event.
mem_inst_retired.stlb_miss_stores
is a "precise" event, unlike most of the other store TLB events, so the counts should be for the real instruction, rather than maybe some later instructions like imprecise perf events. (See Andy Glew's trap vs. exception answer for some details about why some performance-counters can't easily be precise; many store events aren't.)
MAP_POPULATE
, even though the OP mentioned page faults. derp. – Crossways