Why does Linux perf use event l1d.replacement for "L1 dcache misses" on x86?
Asked Answered
P

0

7

On Intel x86, Linux uses the event l1d.replacements to implement its L1-dcache-load-misses event.

This event is defined as follows:

Counts L1D data line replacements including opportunistic replacements, and replacements that require stall-for-replace or block-for-replace.

Perhaps naively, I would have expected perf to use something like mem_load_retired.l1_miss, which supports PEBS and is defined as:

Counts retired load instructions with at least one uop that missed in the L1 cache. (Supports PEBS)

The event values are usually not exactly very close, and sometimes they vary wildly. For example:

$ocperf stat -e mem_inst_retired.all_loads,l1d.replacement,mem_load_retired.l1_hit,mem_load_retired.l1_miss,mem_load_retired_fb_hit head -c100M /dev/urandom > /dev/null 

 Performance counter stats for 'head -c100M /dev/urandom':

       445,662,315      mem_inst_retired_all_loads                                   
            92,968      l1d_replacement                                             
       443,864,439      mem_load_retired_l1_hit                                     
         1,694,671      mem_load_retired_l1_miss                                    
            28,080      mem_load_retired_fb_hit                                     

There are more than 17 times more "L1 misses" as measured by mem_load_retired.l1_miss as compared to l1d.replacement. Conversely, you can also find examples where l1d.replacement is much higher than the mem_load_retired counters.

What exactly is l1d.replacement measuring, why was it chosen in the kernel, and is it a better proxy for L1 d-cache misses than mem_load_retired.l1_miss?

Preconscious answered 4/9, 2018 at 20:20 Comment(11)
l1d.replacements measure lines that miss, I'd assume, instead of instructions that miss. So there's some sense in that. (The name implies it measures evictions or allocations in L1d). But that would also measure store misses, which L1-dcache-load-misses claiming not to. Yuck. Looks like yet another reason not to trust those generic event names, along with how to interpret perf iTLB-loads,iTLB-load-misses.Pillowcase
@PeterCordes - but the mem_load_retired also makes that distinction by breaking L1 load accesses into three categories: l1_hit, l1_miss and fb_hit. So you should only get one l1_miss per missed line, more or less, and the rest would be fb_hit. Although maybe fb_hit isn't working as I think - because if it does I can't reconcile the numbers above.Preconscious
Hmm, can a load miss in L1 and then hit in a fill-buffer instead of initiating a new line fill? I haven't played with those events.Pillowcase
@PeterCordes - definitely! The fill buffer would be quite terrible if it didn't work that way. The basic idea is when you miss in L1, the next place you look is the fill buffers and if the line you missed in is already in a FB you just sleep the load, since you don't want to allocated a redundant FB. This behavior is pretty critical since in a normal linear access of say DWORDs you'd only get one true L1 miss for the first DWORD and then 15 more l1-miss-but-hit-FB for the next 15 accesses to the same line, and you wouldn't want to fill up all your FBs.Preconscious
Right, I already expected the hardware to work that way, but I mean the fb_hit event might not be exclusive with the l1_miss event. So a load instruction generates an l1_miss if it isn't satisfied on the fast path, and also an fb_hit event if that happens. Does that fit the data? Very few of your l1d misses are to the same line? l1d_replacement seems very low, though, for that many l1_miss with few of the misses being fb_hits. Does store-forwarding count as a l1_miss?Pillowcase
I don't think the l1_miss case is inclusive of fb_hit because in many cases the fb_hit count ends up higher, e.g., ocperf stat -e mem_inst_retired.all_loads,l1d.replacement,mem_load_retired.l1_hit,mem_load_retired.l1_miss,mem_load_retired_fb_hit true. Note that in most workloads the replacement value isn't an order of magnitude different, but this was just an interesting one where it is. Good question about store-forwarding.Preconscious
Might want to update your question with some other outputs that rule out l1_miss being inclusive offb_hit, then, to narrow down the space for guesswork, or at least mention it. (I guess you're really asking for an authoritative answer, but still we're often inclined to guess.)Pillowcase
@PeterCordes another weird thing is the l1_hit|miss|fb counts don't add up to exactly the inst_retired.all_loads value. Also the language for the "inst retired" events talks about "any uop from the instruction" so I guess some instructions that do two memory loads could increment two counters but only increment the inst counter by 1 (but the observed counting problem is in the opposite direction).Preconscious
With PEBS I guess we should expect the total to be very close, they were all exclusive and covered every possibility? Any chance that context-switching or handling of perf interrupts could account for the 75125 discrepancy? perf would collect all the PEBS data at once if an interrupt triggered, though, right? Rather than accumulating l1_miss events while collecting the l1_hit events? If you're right that there's a real discrepancy, then maybe store-forwarding? Re: multiple accesses per instruction: that's rare unless cache-line splits count. cmps, gather, maybe memory-dst adc?Pillowcase
@Peter - yeah it seems like more than the normal "non atomic reads" issue with perf counters, and I feel like perf stat should be mostly immune to that anyways when you are reading the counters for the lifetime of the application. Good point about PEBS, I'm not totally sure how it works. When you have multiple PEBS events I guess they all go the same buffer? Perf also has this distinction between events that can use "large PEBS" (a buffer for more than one event) and those where they just use size is n buffers but AFAICT it's hard to tell which is being used.Preconscious
Can you construct a test-case with more concurrent misses to the same cache line, and fewer hits? So counts are more evenly distributed between the three events, with no chance for l1_fb_hit to be lost in the noise. Maybe randomly select a (normally cold in L1D) cache line, then do 4 dword loads from it? If we can predict what the HW is probably doing, we might divine what the counters mean.Pillowcase

© 2022 - 2024 — McMap. All rights reserved.