Significant performance difference of std clock between different machines
Asked Answered
L

1

18

Testing something else I stumbled across something that I haven't managed to figure out yet.

Let's look at this snippet:

#include <iostream>
#include <chrono>

int main () {
  int i = 0;
  using namespace std::chrono_literals;

  auto const end = std::chrono::system_clock::now() + 5s;
  while (std::chrono::system_clock::now() < end) {
    ++i;
  }
  std::cout << i;
}

I've noticed that the counts heavily depend on the machine I execute it on.
I've compiled with gcc 7.3,8.2, and clang 6.0 with std=c++17 -O3.

On i7-4790 (4.17.14-arch1-1-ARCH kernel): ~3e8
but on a Xeon E5-2630 v4 (3.10.0-514.el7.x86_64): ~8e6

Now this is a difference that I would like to understand so I've checked with perf stat -d

on the i7:

       4999.419546      task-clock:u (msec)       #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
               120      page-faults:u             #    0.024 K/sec                  
    19,605,598,394      cycles:u                  #    3.922 GHz                      (49.94%)
    33,601,884,120      instructions:u            #    1.71  insn per cycle           (62.48%)
     7,397,994,820      branches:u                # 1479.771 M/sec                    (62.53%)
            34,788      branch-misses:u           #    0.00% of all branches          (62.58%)
    10,809,601,166      L1-dcache-loads:u         # 2162.171 M/sec                    (62.41%)
            13,632      L1-dcache-load-misses:u   #    0.00% of all L1-dcache hits    (24.95%)
             3,944      LLC-loads:u               #    0.789 K/sec                    (24.95%)
             1,034      LLC-load-misses:u         #   26.22% of all LL-cache hits     (37.42%)

       5.003180401 seconds time elapsed

       4.969048000 seconds user
       0.016557000 seconds sys

Xeon:

       5001.000000      task-clock (msec)         #    0.999 CPUs utilized          
                42      context-switches          #    0.008 K/sec                  
                 2      cpu-migrations            #    0.000 K/sec                  
               412      page-faults               #    0.082 K/sec                  
    15,100,238,798      cycles                    #    3.019 GHz                      (50.01%)
       794,184,899      instructions              #    0.05  insn per cycle           (62.51%)
       188,083,219      branches                  #   37.609 M/sec                    (62.49%)
            85,924      branch-misses             #    0.05% of all branches          (62.51%)
       269,848,346      L1-dcache-loads           #   53.959 M/sec                    (62.49%)
           246,532      L1-dcache-load-misses     #    0.09% of all L1-dcache hits    (62.51%)
            13,327      LLC-loads                 #    0.003 M/sec                    (49.99%)
             7,417      LLC-load-misses           #   55.65% of all LL-cache hits     (50.02%)

       5.006139971 seconds time elapsed

What pops out is the low amount of instructions per cycle on the Xeon as well as the nonzero context-switches that I don't understand. However, I wasn't able to use these diagnostics to come up with an explanation.

And to add a bit more weirdness to the problem, when trying to debug I've also compiled statically on one machine and executed on the other.

On the Xeon the statically compiled executable gives a ~10% lower output with no difference between compiling on xeon or i7.
Doing the same thing on the i7 both the counter actually drops from 3e8 to ~2e7

So in the end I'm now left with two questions:

  • Why do I see such a significant difference between the two machines.
  • Why does a statically linked exectuable perform worse while I would expect the oposite?

Edit: after updating the kernel on the centos 7 machine to 4.18 we actually see an additional drop from ~ 8e6 to 5e6.

perf interestingly shows different numbers though:

   5002.000000      task-clock:u (msec)       #    0.999 CPUs utilized          
             0      context-switches:u        #    0.000 K/sec                  
             0      cpu-migrations:u          #    0.000 K/sec                  
           119      page-faults:u             #    0.024 K/sec                  
   409,723,790      cycles:u                  #    0.082 GHz                      (50.00%)
   392,228,592      instructions:u            #    0.96  insn per cycle           (62.51%)
   115,475,503      branches:u                #   23.086 M/sec                    (62.51%)
        26,355      branch-misses:u           #    0.02% of all branches          (62.53%)
   115,799,571      L1-dcache-loads:u         #   23.151 M/sec                    (62.51%)
        42,327      L1-dcache-load-misses:u   #    0.04% of all L1-dcache hits    (62.50%)
            88      LLC-loads:u               #    0.018 K/sec                    (49.96%)
             2      LLC-load-misses:u         #    2.27% of all LL-cache hits     (49.98%)

   5.005940327 seconds time elapsed

   0.533000000 seconds user
   4.469000000 seconds sys

It's interesting that there are no more context switches and istructions per cycle went up significantly but the cycles and therefore colck are super low!

Lorusso answered 16/8, 2018 at 8:15 Comment(12)
The timing source for the high resolution timer is a very strong hardware dependency, motherboards are not created equal. Xeon is relevant because it is often used in multi-processor designs. That makes the NOTE for SMP systems section in the man page relevant, you have good evidence that you don't have to worry about that :)Glume
I find no significant difference between i7 6700K (4.17.14-arch1-1-ARCH) which gets 2.9e8 with gcc 8.2.0 and 3.0e8 with icpc 18.0.3, Xeon Gold 6144 (which was a bit busy at the time too) (4.9.0-4-amd64 Debian) which gets 2.5e8 with gcc 6.3.0 and 2.4e8 with icpc 18.0.2 and Xeon E5-1620 v3 (4.9.0-6-amd64 Debian) which gets 2.7e8 with gcc 6.3.0 and 2.6e8 with icpc 18.0.2. The 0.1 difference is probably just fluctuation, I only did 1 run on each. On that note, is it possible the large difference in kernel versions could have some impact on this particular call?Vally
Thank you for checking on so many platforms! I've just had the change to check on a different system (xeon e5-2630 v3 but 4.17.3-1.el7.elrepo.x86_64) which yields 2.7e8 . As this is very similar to the hardware in my question I guess this does point towards the kernel. However, I'm very surprised that the kernel has such a large effect on such a simple task.Lorusso
Regarding the static link, it reduces the count by a bit more than a factor of 10 on all machines. It seems possible that the non-static implementation makes use of some OS functionality (that might not be available in the old kernel) whereas the static creates a general implementation (which could of course be much slower).Vally
But should the function not remain the same? Isn't it just a matter of either dynamically calling a function in a library or pulling the exact same code for that function into your own executable?Lorusso
I'd say yes and no, the function might make use of things that are not available on all systems. So yes, it will take a function, but it might not be the same function, it might be a more general implementation that is more likely to run on other machines. But I do not know what happens in this case, it's just a thought.Vally
@Vally so apparently the newer kernel version doesn't help in this case, see the above edit.Lorusso
@Christoph I would be a little worried about that cycle value there, 0.082GHz doesn't sound okay. But if you normalise to that clock (for whatever reason), it seems much better than before. I don't see how the code has anything to do with the clock speed though, that's probably some issue with the OS. What is the cpu governor set to?Vally
@Vally thanks, the clock value had me worried as well. Which is why I did a sysbench run on this machine and compared it to a different one. In that test, I didn't see any clock throttling... Also ` cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor` shows that it's set to performanceLorusso
@Christoph Huh, interesting. I don't know what could be causing the low clock speed then, if you say it's fine elsewhere. It says it uses 0.999 of the CPU so I see no reason for it to clock down (assuming we trust perf).Vally
Have you looked at which clock source the machines are using? It should be in /sys/devices/system/clocksource/clocksource0/current_clocksource . You can change it to any listed in /sys/devices/system/clocksource/clocksource0/available_clocksource. It made a big difference in AWS for me as the default clocksource (xen) was much slower than tsc.Tamworth
@Tamworth The i7 spits out tsc and the xeon is on hpet. Available ones on the Xeon are only hpet and acpi_pm though. I don't have the rights on the Xeon to change this right now, but settting the i7 to hpet drops the counts to ~6e6 so can reproduce the xeon behaviour! can you turn this into an answer?Lorusso
L
5

I've been able to reproduce the respective measurements on the two machines, thanks to @Imran's comment above. (Posting this answer to close the question, if Imran should post one I'm happy to accept his instead)

It indeed is related to the available clocksource. The XEON, unfortunately, had the notsc flag in its kernel parameters which is why the tsc clocksource wasn't available and selected.

Thus for anyone running into this problem:
1. check your clocksource in /sys/devices/system/clocksource/clocksource0/current_clocksource
2. check available clocksources in /sys/devices/system/clocksource/clocksource0/available_clocksource
3. If you can't find tsc, check dmesg | grep tsc to check you kernel parameters for notsc

Lorusso answered 22/2, 2019 at 8:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.