On Intel x86, Linux uses the event l1d.replacements
to implement its L1-dcache-load-misses
event.
This event is defined as follows:
Counts L1D data line replacements including opportunistic replacements, and replacements that require stall-for-replace or block-for-replace.
Perhaps naively, I would have expected perf
to use something like mem_load_retired.l1_miss
, which supports PEBS and is defined as:
Counts retired load instructions with at least one uop that missed in the L1 cache. (Supports PEBS)
The event values are usually not exactly very close, and sometimes they vary wildly. For example:
$ocperf stat -e mem_inst_retired.all_loads,l1d.replacement,mem_load_retired.l1_hit,mem_load_retired.l1_miss,mem_load_retired_fb_hit head -c100M /dev/urandom > /dev/null
Performance counter stats for 'head -c100M /dev/urandom':
445,662,315 mem_inst_retired_all_loads
92,968 l1d_replacement
443,864,439 mem_load_retired_l1_hit
1,694,671 mem_load_retired_l1_miss
28,080 mem_load_retired_fb_hit
There are more than 17 times more "L1 misses" as measured by mem_load_retired.l1_miss
as compared to l1d.replacement
. Conversely, you can also find examples where l1d.replacement
is much higher than the mem_load_retired
counters.
What exactly is l1d.replacement
measuring, why was it chosen in the kernel, and is it a better proxy for L1 d-cache misses than mem_load_retired.l1_miss
?
l1d.replacements
measure lines that miss, I'd assume, instead of instructions that miss. So there's some sense in that. (The name implies it measures evictions or allocations in L1d). But that would also measure store misses, whichL1-dcache-load-misses
claiming not to. Yuck. Looks like yet another reason not to trust those generic event names, along with how to interpret perf iTLB-loads,iTLB-load-misses. – Pillowcasemem_load_retired
also makes that distinction by breaking L1 load accesses into three categories:l1_hit
,l1_miss
andfb_hit
. So you should only get onel1_miss
per missed line, more or less, and the rest would befb_hit
. Although maybefb_hit
isn't working as I think - because if it does I can't reconcile the numbers above. – PreconsciousDWORD
s you'd only get one true L1 miss for the firstDWORD
and then 15 morel1-miss-but-hit-FB
for the next 15 accesses to the same line, and you wouldn't want to fill up all your FBs. – Preconsciousfb_hit
event might not be exclusive with thel1_miss
event. So a load instruction generates anl1_miss
if it isn't satisfied on the fast path, and also anfb_hit
event if that happens. Does that fit the data? Very few of your l1d misses are to the same line?l1d_replacement
seems very low, though, for that manyl1_miss
with few of the misses beingfb_hit
s. Does store-forwarding count as al1_miss
? – Pillowcasel1_miss
case is inclusive offb_hit
because in many cases thefb_hit
count ends up higher, e.g.,ocperf stat -e mem_inst_retired.all_loads,l1d.replacement,mem_load_retired.l1_hit,mem_load_retired.l1_miss,mem_load_retired_fb_hit true
. Note that in most workloads thereplacement
value isn't an order of magnitude different, but this was just an interesting one where it is. Good question about store-forwarding. – Preconsciousl1_miss
being inclusive offb_hit
, then, to narrow down the space for guesswork, or at least mention it. (I guess you're really asking for an authoritative answer, but still we're often inclined to guess.) – Pillowcasel1_hit|miss|fb
counts don't add up to exactly theinst_retired.all_loads
value. Also the language for the "inst retired" events talks about "any uop from the instruction" so I guess some instructions that do two memory loads could increment two counters but only increment the inst counter by 1 (but the observed counting problem is in the opposite direction). – Preconscious75125
discrepancy? perf would collect all the PEBS data at once if an interrupt triggered, though, right? Rather than accumulatingl1_miss
events while collecting thel1_hit
events? If you're right that there's a real discrepancy, then maybe store-forwarding? Re: multiple accesses per instruction: that's rare unless cache-line splits count.cmps
, gather, maybe memory-dstadc
? – Pillowcasel1_fb_hit
to be lost in the noise. Maybe randomly select a (normally cold in L1D) cache line, then do 4 dword loads from it? If we can predict what the HW is probably doing, we might divine what the counters mean. – Pillowcase