PERF STAT does not count memory-loads but counts memory-stores

Asked 9/6, 2017 at 21:7 Answered 23/1, 2019 at 0:52

Linux Kernel : 4.10.0-20-generic (also tried this on 4.11.3)

Ubuntu : 17.04

I have been trying to collect stats of memory-accesses using perf stat. I am able to collect stats for memory-stores but the count for memory-loads return me a 0 value.

The below is the details for memory-stores :-

perf stat -e cpu/mem-stores/u ./libquantum_base.arnab 100
N = 100, 37 qubits required
Random seed: 33
Measured 3277 (0.200012), fractional approximation is 1/5.
Odd denominator, trying to expand by 2.
Possible period is 10.
100 = 4 * 25

 Performance counter stats for './libquantum_base.arnab 100':

       158,115,510      cpu/mem-stores/u                                            

       0.559922797 seconds time elapsed

For memory-loads, I get a 0 count as can be seen below :-

perf stat -e cpu/mem-loads/u ./libquantum_base.arnab 100
N = 100, 37 qubits required
Random seed: 33
Measured 3277 (0.200012), fractional approximation is 1/5.
Odd denominator, trying to expand by 2.
Possible period is 10.
100 = 4 * 25

 Performance counter stats for './libquantum_base.arnab 100':

                 0      cpu/mem-loads/u                                             

       0.563806170 seconds time elapsed

I cannot understand why this does not count properly. Should I use a different event in any way to get proper data ?

Angieangil answered 9/6, 2017 at 21:7 Comment(6)

Stack Overflow is a site for programming and development questions. This question appears to be off-topic because it is not about programming or development. See What topics can I ask about here in the Help Center. Perhaps Super User or Unix & Linux Stack Exchange would be a better place to ask. Also see Where do I post questions about Dev Ops? – Chopine 10/6, 2017 at 1:30

Hardware performance events are specific to the CPU used by you. What is the exact model? Not every possible perf hardware event is mapped to some real event (I think around half of them is not; some CPUs may have no raw L1 loads/stores counters at all). For intel CPUs use ocperf.py of pmu-tools github.com/andikleen/pmu-tools/blob/master/ocperf.py to encode real supported events into raw encodings of perf_event API (perf_event_open, -e rXXXXX event specifiers of perf CLI tool). – Monogamous 10/6, 2017 at 3:52

Hi @osgx, it is a Broadwell server CPU. The model is E5-2620 v4. It runs @ 2.10GHz. I will try using Andi Kleen's PMU tools to see if I can get memory-load events counted. – Angieangil 11/6, 2017 at 18:2

I have the same problem with a Skylake i7-6700HQ, so it seems like the mem-loads event is broken on this hardware on recent kernels. – Wingo 16/4, 2018 at 21:6

Yes @Wingo you are correct. IIRC I had to use a symbolic event value to obtain the memory-load events, I will update my answer as soon as I can determine the symbolic event number. – Angieangil 16/4, 2018 at 22:31

@ArnabjyotiKalita - cool, I look forward to it. – Wingo 17/4, 2018 at 1:2

The mem-loads event is mapped to the MEM_TRANS_RETIRED.LOAD_LATENCY_GT_3 performance monitoring unit event on Intel processors. The events MEM_TRANS_RETIRED.LOAD_LATENCY_* are special and can only be counted by using the p modifier. That is, you have to specify mem-loads:p to perf to use the event correctly.

MEM_TRANS_RETIRED.LOAD_LATENCY_* is a precise event and it only makes sense to be counted at the precise level. According to this Intel article (emphasis mine):

When a user elects to sample one of these events, special hardware is used that can keep track of a data load from issue to completion. This is more complicated than simply counting instances of an event (as with normal event-based sampling), and so only some loads are tracked. Loads are randomly chosen, the latency determined for each, and the correct event(s) incremented (latency >4, >8, >16, etc). Due to the nature of the sampling for this event, only a small percentage of an application's data loads can be tracked at any one time.

As you can see, MEM_TRANS_RETIRED.LOAD_LATENCY_* by no means count the total number of loads and it is not designed for that purpose at all.

If you want to to determine which instructions in your code are issuing load requests that take more than a specific number of cycles to complete, then MEM_TRANS_RETIRED.LOAD_LATENCY_* is the right performance event to use. In fact, that is exactly the purpose of perf-mem and it achieves its purpose by using this event.

If you want to count the total number of load uops retired, then you should use L1-dcache-loads, which is mapped to the MEM_UOPS_RETIRED.ALL_LOADS performance event on Intel processors.

On the other hand, mem-stores and L1-dcache-stores are mapped to the exact same performance event on all current Intel processors, namely, MEM_UOPS_RETIRED.ALL_STORES, which does count all retired store uops.

So in summary, if you are using perf-stat, you should (almost) always use L1-dcache-loads and L1-dcache-stores to count retired loads and stores, respectively. These are mapped to the raw events you have used in the answer you posted, only more portable because they also work on AMD processors.

Underexpose answered 23/1, 2019 at 0:52 Comment(0)

I have used a Broadwell(CPU e5-2620) server machine to collect all of the below events.

To collect memory-load events, I had to use a numeric event value. I basically ran the below command -

./perf record -e "r81d0:u" -c 1 -d -m 128 ../../.././libquantum_base 20

Here r81d0 represents the raw event for counting "memory loads amongst all instructions retired". "u" as can be understood represents user-space.

The below command, on the other hand,

./perf record -e "r82d0:u" -c 1 -d -m 128 ../../.././libquantum_base 20

has "r82d0:u" as a raw event representing "memory stores amongst all instructions retired in userspace".

Angieangil answered 18/4, 2018 at 8:21 Comment(6)

What hardware? Not all CPUs have the same event numbers. Also, that's a numeric event value, not symbolic. On my Skylake i7-6700k with perf 4.15, mem-loads is also broken. But ocperf.py stat -e mem_inst_retired.all_loads works. (Note that that's only counting retired load instructions, not loads from the page walker or instruction-fetch, if mem-loads is supposed to count any of those. And not loads from mis-speculated load instructions that never retired.) – Dextrogyrate 18/4, 2018 at 8:33

Hi Peter, these numeric events do indeed count the number of retired loads and stores. But the "retired store" event counts did match the actual number of mem-stores event. Whether this similarity would also apply to mem-loads or not I am not sure. – Angieangil 18/4, 2018 at 8:49

Yeah, probably mem-loads was supposed to be mapped to mem_inst_retired.all_loads, then. The ocperf.py wrapper uses perf stat -e cpu/event=0xd0,umask=0x81,name=mem_inst_retired_all_loads/for that. I didn't see an event for loads that includes page-walks and/or mis-speculated load instructions. Hmm, I also don't see a specific event for L1-dcache-loads (which wouldn't include MOVNTDQA loads from WC memory, I hope). Maybe it's another load instruction counter, not actually total L1 references. Possibly synthesized from mem_load_retired.l1_hit + ..._miss on Skylake? – Dextrogyrate 18/4, 2018 at 8:55

Exactly Peter. I had the same concerns too when I was using the numeric event numbers, and I also looked into other sources of load events but did not find any event measuring them. – Angieangil 18/4, 2018 at 9:6

There are events for dtlb_load_misses.miss_causes_a_walk and ..._store_..., but those don't tell you how many actual L1d accesses the HW page walker did, and how many it saved by keeping page-directory entries cached internally. For non-retired stores, there's uops_dispatched_port.port_4 (the store-data port). Oh, stores never commit to L1d if they don't reach retirement, stopping speculative stores from becoming globally visible is part of the point of the store buffer >.< It'd be cool to have an event for actual commits to L1d to measure merging adjacent stores in the store buffer. – Dextrogyrate 18/4, 2018 at 9:15

But for speculative loads, uops_dispatched_port.port_2 and 3 could help, except that store-address uops run on those ports, too. – Dextrogyrate 18/4, 2018 at 9:19

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags