Why can't my ultraportable laptop CPU maintain peak performance in HPC
Asked Answered
B

1

16

I have developed a high performance Cholesky factorization routine, which should have peak performance at around 10.5 GFLOPs on a single CPU (without hyperthreading). But there is some phenomenon which I don't understand when I test its performance. In my experiment, I measured the performance with increasing matrix dimension N, from 250 up to 10000.

  • In my algorithm I have applied caching (with tuned blocking factor), and data are always accessed with unit stride during computation, so cache performance is optimal; TLB and paging problem are eliminated;
  • I have 8GB available RAM, and the maximum memory footprint during experiment is under 800MB, so no swapping comes across;
  • During experiment, no resource demanding process like web browser is running at the same time. Only some really cheap background process is running to record CPU frequency as well as CPU temperature data every 2s.

I would expect the performance (in GFLOPs) should maintain at around 10.5 for whatever N I am testing. But a significant performance drop is observed in the middle of the experiment as shown in the first figure.

CPU frequency and CPU temperature are seen in the 2nd and 3rd figure. The experiment finishes in 400s. Temperature was at 51 degree when experiment started, and quickly rose up to 72 degree when CPU got busy. After that it grew slowly to the highest at 78 degree. CPU frequency is basically stable, and it did not drop when temperature got high.

So, my question is:

  • since CPU frequency did not drop, why performance suffers?
  • how exactly does temperature affect CPU performance? Does the increment from 72 degree to 78 degree really make things worse? enter image description here enter image description here enter image description here

CPU info

System: Ubuntu 14.04 LTS
Laptop model: Lenovo-YOGA-3-Pro-1370
Processor: Intel Core M-5Y71 CPU @ 1.20 GHz * 2

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0,1
Off-line CPU(s) list:  2,3
Thread(s) per core:    1
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 61
Stepping:              4
CPU MHz:               1474.484
BogoMIPS:              2799.91
Virtualisation:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              4096K
NUMA node0 CPU(s):     0,1

CPU 0, 1
driver: intel_pstate
CPUs which run at the same hardware frequency: 0, 1
CPUs which need to have their frequency coordinated by software: 0, 1
maximum transition latency: 0.97 ms.
hardware limits: 500 MHz - 2.90 GHz
available cpufreq governors: performance, powersave
current policy: frequency should be within 500 MHz and 2.90 GHz.
                The governor "performance" may decide which speed to use
                within this range.
current CPU frequency is 1.40 GHz.
boost state support:
  Supported: yes
  Active: yes

update 1 (control experiment)

In my original experiment, CPU is kept busy working from N = 250 to N = 10000. Many people (primarily those whose saw this post before re-editing) suspected that the overheating of CPU is the major reason for performance hit. Then I went back and installed lm-sensors linux package to track such information, and indeed, CPU temperature rose up.

But to complete the picture, I did another control experiment. This time, I give CPU a cooling time between each N. This is achieved by asking the program to pause for a number of seconds at the start of iteration of the loop through N.

  • for N between 250 and 2500, the cooling time is 5s;
  • for N between 2750 and 5000, the cooling time is 20s;
  • for N between 5250 and 7500, the cooling time is 40s;
  • finally for N between 7750 and 10000, the cooling time is 60s.

Note that the cooling time is much larger than the time spent for computation. For N = 10000, only 30s are needed for Cholesky factorization at peak performance, but I ask for a 60s cooling time.

This is certainly a very uninteresting setting in high performance computing: we want our machine to work all the time at peak performance, until a very large task is completed. So this kind of halt makes no sense. But it helps to better know the effect of temperature on performance.

This time, we see that peak performance is achieved for all N, just as theory supports! The periodic feature of CPU frequency and temperature is the result of cooling and boost. Temperature still has an increasing trend, simply because as N increases, the work load is getting bigger. This also justifies more cooling time for a sufficient cooling down, as I have done.

The achievement of peak performance seems to rule out all effects other than temperature. But this is really annoying. Basically it says that computer will get tired in HPC, so we can't get expected performance gain. Then what is the point of developing HPC algorithm?


OK, here are the new set of plots: enter image description here enter image description here

I don't know why I could not upload the 6th figure. SO simply does not allow me to submit the edit when adding the 6th figure. So I am sorry I can't attach the figure for CPU frequency.


update 2 (how I measure CPU frequency and temperature)

Thanks to Zboson for adding the x86 tag. The following bash commands are what I used for measurement:

while true
do 
  cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq >> cpu0_freq.txt  ## parameter "freq0"
  cat sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq >> cpu1_freq.txt  ## parameter "freq1"
  sensors | grep "Core 0" >> cpu0_temp.txt  ## parameter "temp0"
  sensors | grep "Core 1" >> cpu1_temp.txt  ## parameter "temp1"
  sleep 2
done

Since I did not pin the computation to 1 core, the operating system will alternately use two different cores. It makes more sense to take

freq[i] <- max (freq0[i], freq1[i])
temp[i] <- max (temp0[i], temp1[i])

as the overall measurement.

Bleary answered 1/4, 2016 at 18:41 Comment(8)
very guessing? Power saving settings? battery? Cooling? Monitor the physical parameters of the laptop while doing this? cpu temp etc. If you can rule out the hardware limits then it would be useful? Paging?Selfassurance
even more guessing: I have used similar programs to these - internet search: monitor laptop hardware temperatures - e.g. openhardwaremonitor.org, also: cpuid.com/softwares/hwmonitor.html. Search for your specific laptop. imo, I suspect hardware limits as running CPU's flatout for long periods will tax the hardware and it will 'throttle'. It may be worthwhile increasing the priority of the matrix tasks. Please be aware - I really am guessing - you need to do some data collection.Selfassurance
The drop in performance with increasing matrix size is probably due to cache utilization. The second one suspiciously looks like your CPU getting hot and therefore reducing the clock frequency. But it just as well could be other processes running on that machine. Also you should pin the task to a specific core. Time measurements are a tricky thing to do. How exactly do you determine the FLOPS?Unbearable
There are programs for working the machine really hard -- they will tell you the limits of your hardware.Selfassurance
TLB misses when the matrix gets large? Can you try a different, larger page size?Typographer
Well, you typically don't do HPC on your laptop :) That's why you have air-conditioned server rooms, plan for the air-streams through the servers, etc. In such an environment, the CPU temperature is much better controlled than in your laptop which gets rid of most of these effects.Unbearable
@AlphaBetaGamma, how are you measuring the CPU frequency? When bench marking I usually disable frequency scaling in the BIOS (if possible). I'm not sure this is 100% possible because the CPU may have take some safety measure to throttle the frequency if the CPU gets to hot. But in my tests I never saw this happen. In any case most of the apps for Linux don't measure frequency correctly. The only one I found that did it correct was powertop. On windows cpuz worked well.Lax
@Zboson: I usually run grep MHz /proc/cpuinfo, but the OP's /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq is probably good, if it knows about turbo. If not, your probably need to use .../cpuinfo_cur_freq (which requires root to even read, implying it might be a more expensive operation than reading the scaling governor's current decision. That would make sense if it queries the hardware about turbo, but /proc/cpuinfo's current frequency can be in the turbo range.)Stringfellow
S
9

TL:DR: Your conclusion is correct. Your CPU's sustained performance is nowhere near its peak. This is normal: the peak perf is only available as a short term "bonus" for bursty interactive workloads, above its rated sustained performance, given the light-weight heat-sink, fans, and power-delivery.

You can develop / test on this machine, but benchmarking will be hard. You'll want to run on a cluster, server, or desktop, or at least a gaming / workstation laptop.


From the CPU info you posted, you have a dual-core-with-hyperthreading Intel Core M with a rated sustainable frequency of 1.20 GHz, Broadwell generation. Its max turbo is 2.9GHz, and it's TDP-up sustainable frequency is 1.4GHz (at 6W).

For short bursts, it can run much faster and make much more heat than it requires its cooling system to handle. This is what Intel's "turbo" feature is all about. It lets low-power ultraportable laptops like yours have snappy UI performance in stuff like web browsers, because the CPU load from interactive is almost always bursty.

Desktop/server CPUs (Xeon and i5/i7, but not i3) do still have turbo, but the sustained frequency is much closer to the max turbo. e.g. a Haswell i7-4790k has a sustained "rated" frequency of 4.0GHz. At that frequency and below, it won't use (and convert to heat) more than its rated TDP of 88W. Thus, it needs a cooling system that can handle 88W. When power/current/temperature allow, it can clock up to 4.4GHz and use more than 88W of power. (The sliding window for calculating the power history to keep the sustained power with 88W is sometimes configurable in the BIOS, e.g. 20sec or 5sec. Depending on what code is running, 4.4GHz might not increase the electrical current demand to anywhere near peak. e.g. code with lots of branch mispredicts that's still limited by CPU frequency, but that doesn't come anywhere near saturating the 256b AVX FP units like Prime95 would.)

Your laptop's max turbo is a factor of 2.4x higher than rated frequency. That high-end Haswell desktop CPU can only upclock by 1.1x. The max sustained frequency is already pretty close to the max peak limits, because it's rated to need a good cooling system that can keep up with that kind of heat production. And a solid power supply that can supply that much current.

The purpose of Core M is to have a CPU that can limit itself to ultra low power levels (rated TDP of 4.5 W at 1.2GHz, 6W at 1.4GHz). So the laptop manufacturer can safely design a cooling and power delivery system that's small and light, and only handles that much power. The "Scenario Design Power" is only 3.5W, and that's supposed to represent the thermal requirements for real-world code, not max-power stuff like Prime95.

Even a "normal" ULV laptop CPU is rated for 15W sustained, and high power gaming/workstation laptop CPUs at 45W. And of course laptop vendors put those CPUs into machines with beefier heat-sinks and fans. See a table on wikipedia, and compare desktop / server CPUs (also on the same page).


The achievement of peak performance seems to rule out all effects other than temperature. But this is really annoying. Basically it says that computer will get tired in HPC, so we can't get expected performance gain. Then what is the point of developing HPC algorithm?

The point is to run them on hardware that's not so badly thermally limited! An ultra-low-power CPU like a Core M makes a decent dev platform, but not a good HPC compute platform.

Even a laptop with an xxxxM CPU, rather than a xxxxU CPU, will do ok. (e.g. a "gaming" or "workstation" laptop that's designed to run CPU-intensive stuff for sustained periods). Or in Skylake-family, "xxxxH" or "HK" are the 45W mobile CPUs, at least quad-core.


Further reading:

Stringfellow answered 3/4, 2016 at 20:16 Comment(6)
@AlphaBetaGamma, I am a little surprised that somebody upvoted your comment that it was not necessary to disable turbo in the bios because the frequency is stable. Doesn't Peter's answer argue that it's not stable. That it goes in bursts. I wrote some of the authors of Eigen about GEMM and they told me in bench marking that turbo is disabled. When I do my tests on my Haswell Intel NUC I disabled the turbo. The base frequency on it's xxxxU CPU is sadly much lower (like half) but I mostly develop on the NUC anyway so it don't care.Lax
@Zboson: Reducing frequency a lot can make something CPU-bound instead of memory-bound. There's no really safe way to extrapolate from a laptop CPU to a high-power CPU if memory bandwidth/latency is a factor. If you're sure it's CPU-bound, just using perf counters to count core clock cycles should be pretty reasonable. (I've mostly looked at microbenchmarks where timing the whole program was not a problem, so I didn't have to worry about only counting time spend in some code in a process.)Stringfellow
@PeterCordes, that's an interesting point. I had not thought about lowering the frequency biasing the result because it does not change the memory bandwidth.Lax
@Zboson: It comes up when people compare ARM benchmarks against x86 and then argue about how good ARM would be if anyone made a chip that was clocked as high as x86 desktop CPUs. You can't always just linearly scale benchmark results by frequency. There are other effects in that case, because the ARM designs might need longer pipelines to reach those clock speeds, so branch mispredict penalties would be worse as well. That's not a problem for Intel chips, because it is the exact same pipeline downclocked, so it's pretty much just memory latency/bw, and possibly L3.Stringfellow
@PeterCordes, that goes for GPUs as well. Two cores running at half nominal frequency use 40% of the power of a single core running at nominal frequency. Assuming you can get the same performance with the two cores it obviously pays to lower the frequency and scaling the number of cores. I never though of this in memory bandwidth but that's another win! That's really interesting. Scale down the core frequency lower than the memory bandwidth and scale up the the number of cores. Of course it turns out it 's often hard to parallelize algorithms so well even disregarding memory bandwith.Lax
@PeterCordes, here is where I discussed the 40% reference. No wonder GPUs beat CPUs in many cases. My ray tracer still runs much faster on my 6 year old GPU arch than every Intel processor I have tried it out (including a 24 core IVB dual socket Xeon server).Lax

© 2022 - 2024 — McMap. All rights reserved.