Testing something else I stumbled across something that I haven't managed to figure out yet.
Let's look at this snippet:
#include <iostream>
#include <chrono>
int main () {
int i = 0;
using namespace std::chrono_literals;
auto const end = std::chrono::system_clock::now() + 5s;
while (std::chrono::system_clock::now() < end) {
++i;
}
std::cout << i;
}
I've noticed that the counts heavily depend on the machine I execute it on.
I've compiled with gcc 7.3,8.2, and clang 6.0 with std=c++17 -O3
.
On i7-4790 (4.17.14-arch1-1-ARCH kernel): ~3e8
but on a Xeon E5-2630 v4 (3.10.0-514.el7.x86_64): ~8e6
Now this is a difference that I would like to understand so I've checked with perf stat -d
on the i7:
4999.419546 task-clock:u (msec) # 0.999 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
120 page-faults:u # 0.024 K/sec
19,605,598,394 cycles:u # 3.922 GHz (49.94%)
33,601,884,120 instructions:u # 1.71 insn per cycle (62.48%)
7,397,994,820 branches:u # 1479.771 M/sec (62.53%)
34,788 branch-misses:u # 0.00% of all branches (62.58%)
10,809,601,166 L1-dcache-loads:u # 2162.171 M/sec (62.41%)
13,632 L1-dcache-load-misses:u # 0.00% of all L1-dcache hits (24.95%)
3,944 LLC-loads:u # 0.789 K/sec (24.95%)
1,034 LLC-load-misses:u # 26.22% of all LL-cache hits (37.42%)
5.003180401 seconds time elapsed
4.969048000 seconds user
0.016557000 seconds sys
Xeon:
5001.000000 task-clock (msec) # 0.999 CPUs utilized
42 context-switches # 0.008 K/sec
2 cpu-migrations # 0.000 K/sec
412 page-faults # 0.082 K/sec
15,100,238,798 cycles # 3.019 GHz (50.01%)
794,184,899 instructions # 0.05 insn per cycle (62.51%)
188,083,219 branches # 37.609 M/sec (62.49%)
85,924 branch-misses # 0.05% of all branches (62.51%)
269,848,346 L1-dcache-loads # 53.959 M/sec (62.49%)
246,532 L1-dcache-load-misses # 0.09% of all L1-dcache hits (62.51%)
13,327 LLC-loads # 0.003 M/sec (49.99%)
7,417 LLC-load-misses # 55.65% of all LL-cache hits (50.02%)
5.006139971 seconds time elapsed
What pops out is the low amount of instructions per cycle on the Xeon as well as the nonzero context-switches that I don't understand. However, I wasn't able to use these diagnostics to come up with an explanation.
And to add a bit more weirdness to the problem, when trying to debug I've also compiled statically on one machine and executed on the other.
On the Xeon the statically compiled executable gives a ~10% lower output with no difference between compiling on xeon or i7.
Doing the same thing on the i7 both the counter actually drops from 3e8
to ~2e7
So in the end I'm now left with two questions:
- Why do I see such a significant difference between the two machines.
- Why does a statically linked exectuable perform worse while I would expect the oposite?
Edit: after updating the kernel on the centos 7 machine to 4.18 we actually see an additional drop from ~ 8e6
to 5e6
.
perf interestingly shows different numbers though:
5002.000000 task-clock:u (msec) # 0.999 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
119 page-faults:u # 0.024 K/sec
409,723,790 cycles:u # 0.082 GHz (50.00%)
392,228,592 instructions:u # 0.96 insn per cycle (62.51%)
115,475,503 branches:u # 23.086 M/sec (62.51%)
26,355 branch-misses:u # 0.02% of all branches (62.53%)
115,799,571 L1-dcache-loads:u # 23.151 M/sec (62.51%)
42,327 L1-dcache-load-misses:u # 0.04% of all L1-dcache hits (62.50%)
88 LLC-loads:u # 0.018 K/sec (49.96%)
2 LLC-load-misses:u # 2.27% of all LL-cache hits (49.98%)
5.005940327 seconds time elapsed
0.533000000 seconds user
4.469000000 seconds sys
It's interesting that there are no more context switches and istructions per cycle went up significantly but the cycles and therefore colck are super low!
2.7e8
. As this is very similar to the hardware in my question I guess this does point towards the kernel. However, I'm very surprised that the kernel has such a large effect on such a simple task. – Lorusso/sys/devices/system/clocksource/clocksource0/current_clocksource
. You can change it to any listed in/sys/devices/system/clocksource/clocksource0/available_clocksource
. It made a big difference in AWS for me as the default clocksource (xen) was much slower thantsc
. – Tamworth