Perf tool stat output: multiplex and scaling of "cycles"

Asked 24/1, 2018 at 4:24 Answered 25/1, 2018 at 17:38

I am trying to understand the multiplex and scaling of "cycles" event in the "perf" output.

The following is the output of perf tool:

 144094.487583      task-clock (msec)         #    1.017 CPUs utilized
  539912613776      instructions              #    1.09  insn per cycle           (83.42%)
  496622866196      cycles                    #    3.447 GHz                      (83.48%)
     340952514      cache-misses              #   10.354 % of all cache refs      (83.32%)
    3292972064      cache-references          #   22.854 M/sec                    (83.26%)
 144081.898558      cpu-clock (msec)          #    1.017 CPUs utilized
       4189372      page-faults               #    0.029 M/sec
             0      major-faults              #    0.000 K/sec
       4189372      minor-faults              #    0.029 M/sec
    8614431755      L1-dcache-load-misses     #    5.52% of all L1-dcache hits    (83.28%)
  156079653667      L1-dcache-loads           # 1083.223 M/sec                    (66.77%)

 141.622640316 seconds time elapsed

I understand that the kernel uses multiplexing to give each event a chance to access the hardware; and hence the final output is the estimate.

The "cycles" event shows (83.48%). I am trying to understand how was this number derived ?

I am running "perf" on Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz.

Ervin answered 24/1, 2018 at 4:24 Comment(2)

FWIW, if you turn off hyperthreading, you'll get double the number of counters (e.g., 8 programmable counters). – Functionary 24/1, 2018 at 19:17

I know it's been over a year, but do you remember which kernel version you were using and whether hyperthreading was enabled? – Millhon 8/8, 2019 at 20:52

Peter Cordes' answer is on the right track.

PMU events are quite complicated, the amount of counters is limited, some events are special, some logical events may be composed of multiple hardware events or there even may be conflicts between events.

I believe Linux isn't aware of these limitation it just tries to activate events - to be more precise event groups - from the list. It stops if it cannot activate all events, and it activates multiplexing. Whenever the multiplexing timer is over, it will rotate the list of events effectively now starting the activation with the second one, and then the third, ... Linux doesn't know that it could still activate the cycles events because it's special.

There is a hardly documented option to pin certain events to give them priority, by adding :D after the name. Example on my system:

$ perf stat -e cycles -e instructions -e cache-misses -e cache-references -e  L1-dcache-load-misses -e L1-dcache-loads ...

   119.444.297.774      cycles:u                                                      (55,88%)
   130.133.371.858      instructions:u            #    1,09  insn per cycle                                              (67,81%)
        38.277.984      cache-misses:u            #    7,780 % of all cache refs      (72,92%)
       491.979.655      cache-references:u                                            (77,00%)
     3.892.617.942      L1-dcache-load-misses:u   #   15,57% of all L1-dcache hits    (82,19%)
    25.004.563.072      L1-dcache-loads:u                                             (43,85%)

Pinning instructions and cycles:

$ perf stat -e cycles:D -e instructions:D -e cache-misses -e cache-references -e  L1-dcache-load-misses -e L1-dcache-loads ...
   120.683.697.083      cycles:Du                                                   
   132.185.743.504      instructions:Du           #    1,10  insn per cycle                                            
        27.917.126      cache-misses:u            #    4,874 % of all cache refs      (61,14%)
       572.718.930      cache-references:u                                            (71,05%)
     3.942.313.927      L1-dcache-load-misses:u   #   15,39% of all L1-dcache hits    (80,38%)
    25.613.635.647      L1-dcache-loads:u                                             (51,37%)

Which results in the same multiplexing as with omitting cycles and instructions does:

$ perf stat -e cache-misses -e cache-references -e  L1-dcache-load-misses -e L1-dcache-loads ...

    35.333.318      cache-misses:u            #    7,212 % of all cache refs      (62,44%)
   489.922.212      cache-references:u                                            (73,87%)
 3.990.504.529      L1-dcache-load-misses:u   #   15,40% of all L1-dcache hits    (84,99%)
25.918.321.845      L1-dcache-loads:u

Note you can also group events (-e \{event1,event2\}) - which means events are always read together - or not at all if the combination cannot be activated together.

^{1: There is an exception for software events that can always be added. The relevant parts of kernel code are in kernel/events/core.c.}

Nutt answered 25/1, 2018 at 17:38 Comment(3)

But why do the events require multiplexing in this particular case? I would have expected instructions and cycles to be counted using fixed counters and the other 4 events to be counted using the 4 programmable counters available on Broadwell (even when HT is enabled). – Millhon 7/8, 2019 at 19:27

That's very curious. I don't have a Broadwell system, but on a Skylake-SP system they are all counted, on a Haswell-EP system they are multiplexed, even though SKL/BDW/HSW should all have the same configuration of fixed and general purpose counters. All tested with Linux 4.15.0 and HT enabled. – Nutt 8/8, 2019 at 7:51

I went through the scheduling algorithm's source code. On Broadwell, most probably the OP has hyperthreading enabled and the NMI watchdog is also enabled. So 5 general-purpose counters are actually needed, but only 4 are available. I've also tested this on a Broadwell processor with HT disabled, and no multiplexing occurred in this configuration. This applies to all kernel versions that support Broadwell. – Millhon 16/9, 2019 at 18:17

IDK why there's any multiplexing at all for cycles or instructions, because there are dedicated counters for those 2 events on your CPU, which can't be programmed to count anything else.

But for the others, I'm pretty sure the percentages are in terms of the fraction of CPU time there was a hardware counter counting that event.

e.g. cache-references was counted for 83.26% of the 144094.487583 CPU-milliseconds your program was running for, or ~119973.07 ms. The total count is extrapolated from the time it was counting.

Ecumenicist answered 24/1, 2018 at 18:35 Comment(9)

AFAIK, perf doesn't use the fixed counters, at least when you specify things like cycles on the command line. I'm not sure if they use them with the "default" event list (i.e., no -e ... on the command line), but that's not very interesting either way since I don't think you can specify "default + extra events", so you once you go non-default you are stuck listing everything. Using the fixed counters for perf isn't as straightforward as you might imagine, because even though the event is fixed they still have programmability (e.g,. user vs kernel counting), so sharing is complex. – Functionary 24/1, 2018 at 19:16

@BeeOnRope: With HT enabled on Skylake, I can count cycles, instructions, and 4 other events, without multiplexing, but adding one more introduces multiplexing. However, once there is statistical sampling, there's a % in all the HW counter fields including cycles and instructions, but not in task-clock or page-faults or other kernel software counters. Leaving out cycles and instructions seems to change the percentages listed for the other counters, more for some, less for others. I have perf 4.14 on Linux 4.14.11 (on Arch Linux), but it's been like this for years, IIRC. – Ecumenicist 24/1, 2018 at 23:32

That code is in fact untouched since 2010. Linux treats cycles/instructions just like any other HW event - exceptions are only made for SW events. – Nutt 25/1, 2018 at 17:41

@Functionary it doesn't seem to matter how the event is used, just that it is used regardless of other HW events causing conflicts. Pinning does that successfully for cycles and instructions. – Nutt 25/1, 2018 at 17:44

@Nutt - when you say they are treated the same as any other HW counter, you mean that they cannot use the fixed counters, right? That's not consistent with my or Peter's observation above: it does seem that multiplexing doesn't occur if you record N events plus some of the fixed events, but if you record N+1 events with none fixed you get multiplexing immediately, where N is the number of programmable counters. For example on my system with HT disabled, N is 8, and if I use any combination of up to 8 events there is no multiplexing (as expected). If I used 9 or 10 events, there is still ... – Functionary 25/1, 2018 at 19:46

... no multiplexing as long as 1 or 2 events are eligible for the fixed counters, strongly implying the fixed counters are used. That's the same thing @PeterCordes found above (I had never noticed it before). I also see the same thing as Peter when you do get multiplexing: the %active value shows even for the events that could use the fixed counters. Weird... Zulan I didn't understand your second comment above. I'm missing some context, I think... – Functionary 25/1, 2018 at 19:47

@Nutt - I hadn't noticed you had a whole big answer mostly explaining everything. If I understood correctly, the system is aware of the fixed counters, and will try to fit everything using the fixed and programmable counters, and if it fails will use multiplexing, at which point the fixed counter events aren't really special anymore (perhaps they are still using the fixed counter, but the multiplexing round-robining works in the granularity of the general purpose counters. – Functionary 25/1, 2018 at 20:15

@Functionary I haven't really dug into the x86 implementation of events. It mostly depends on the architecture agnostic stuff which just asks the arch-implementation to activate an event. If that fails, it assumes that no more HW events can be activated. – Nutt 25/1, 2018 at 23:17

@Nutt - thanks, it makes sense with the observed behavior: the fixed counters are effectively used up until the point that multiplexing is needed, but once that occurs the (probably agnostic) implementation of multiplexing isn't going to know that the fixed counter events can be treated specially, I suppose. – Functionary 25/1, 2018 at 23:21

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags