Each physical core has its own TSC; the microcode doesn't have to go off-core, so there's no shared resource that they compete for. Going off-core at all would make it much slower, and made the implementation more complex. Having a counter physically inside each core is a simpler implementation, just counting ticks of a reference-clock signal that's distributed to all cores.
With HyperThreading, the logical cores sharing a physical always compete for execution resources. From Agner Fog's instruction tables, we know that RDTSC on Skylake is 20 uops for the front-end, and has 1 per 25 cycle throughput. At less than 1 uop per clock while executing nothing but RDTSC instructions, competing for the front-end is probably not a problem.
Probably most of those uops can run on any execution port, so it's quite possible that both logical threads can run rdtsc
with that throughput.
But maybe there's a not-fully-pipelined execution unit that they'd compete for.
You can test it by putting times 20 rdtsc
inside a loop that runs a few 10s of millions of iterations, and running that microbenchmark on a core by itself, and then running it twice pinned to the logical cores of one physical core.
I got curious and did that myself on Linux with perf
on a Skylake i7-6700k, with taskset -c 3
and taskset -c 7
(the way Linux enumerates the cores on this CPU, those numbers are the logical cores of the 4th physical core. You can check /proc/cpuinfo to find out on your system.)
To avoid interleaving the output lines if they both finish nearly simultaneously, I used bash process substitution with cat <(cmd1) <(cmd2)
to run them both simultaneously and get the output printed in a fixed order. The commands were
taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles:u,instructions:u,branches:u,branch-misses:u,uops_issued.any:u,uops_executed.thread:u,cpu_clk_thread_unhalted.one_thread_active:u -r2 ./testloop
to count core clock cycles (not reference cycles, so I don't have to be paranoid about turbo / idle clock frequencies).
testloop
is a static executable with a hand-written asm loop containing times 20 rdtsc
(NASM repeat operator) and dec ebp
/jnz
, with the top of the loop aligned by 64 in case that ever matters. Before the loop, mov ebp, 10000000
initializes the counter. (See Can x86's MOV really be "free"? Why can't I reproduce this at all? for details on how I do microbenchmarks this way. Or Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths another example of a simple NASM program with a loop using times
to repeat instructions.)
Performance counter stats for './testloop' (2 runs):
1,278.19 msec task-clock:u # 1.000 CPUs utilized ( +- 0.19% )
4 context-switches # 0.004 K/sec ( +- 11.11% )
0 cpu-migrations # 0.000 K/sec
2 page-faults # 0.002 K/sec
5,243,270,118 cycles:u # 4.102 GHz ( +- 0.01% ) (71.37%)
219,949,542 instructions:u # 0.04 insn per cycle ( +- 0.01% ) (85.68%)
10,000,692 branches:u # 7.824 M/sec ( +- 0.03% ) (85.68%)
32 branch-misses:u # 0.00% of all branches ( +- 93.65% ) (85.68%)
4,010,798,914 uops_issued.any:u # 3137.885 M/sec ( +- 0.01% ) (85.68%)
4,010,969,168 uops_executed.thread:u # 3138.018 M/sec ( +- 0.00% ) (85.78%)
0 cpu_clk_thread_unhalted.one_thread_active:u # 0.000 K/sec (57.17%)
1.27854 +- 0.00256 seconds time elapsed ( +- 0.20% )
Performance counter stats for './testloop' (2 runs):
1,278.26 msec task-clock:u # 1.000 CPUs utilized ( +- 0.18% )
6 context-switches # 0.004 K/sec ( +- 9.09% )
0 cpu-migrations # 0.000 K/sec
2 page-faults # 0.002 K/sec ( +- 20.00% )
5,245,894,686 cycles:u # 4.104 GHz ( +- 0.02% ) (71.27%)
220,011,812 instructions:u # 0.04 insn per cycle ( +- 0.02% ) (85.68%)
9,998,783 branches:u # 7.822 M/sec ( +- 0.01% ) (85.68%)
23 branch-misses:u # 0.00% of all branches ( +- 91.30% ) (85.69%)
4,010,860,476 uops_issued.any:u # 3137.746 M/sec ( +- 0.01% ) (85.68%)
4,012,085,938 uops_executed.thread:u # 3138.704 M/sec ( +- 0.02% ) (85.79%)
4,174 cpu_clk_thread_unhalted.one_thread_active:u # 0.003 M/sec ( +- 9.91% ) (57.15%)
1.27876 +- 0.00265 seconds time elapsed ( +- 0.21% )
vs. running alone:
Performance counter stats for './testloop' (2 runs):
1,223.55 msec task-clock:u # 1.000 CPUs utilized ( +- 0.52% )
4 context-switches # 0.004 K/sec ( +- 11.11% )
0 cpu-migrations # 0.000 K/sec
2 page-faults # 0.002 K/sec
5,003,825,966 cycles:u # 4.090 GHz ( +- 0.00% ) (71.31%)
219,905,884 instructions:u # 0.04 insn per cycle ( +- 0.04% ) (85.66%)
10,001,852 branches:u # 8.174 M/sec ( +- 0.04% ) (85.66%)
17 branch-misses:u # 0.00% of all branches ( +- 52.94% ) (85.78%)
4,012,165,560 uops_issued.any:u # 3279.113 M/sec ( +- 0.03% ) (85.78%)
4,010,429,819 uops_executed.thread:u # 3277.694 M/sec ( +- 0.01% ) (85.78%)
28,452,608 cpu_clk_thread_unhalted.one_thread_active:u # 23.254 M/sec ( +- 0.20% ) (57.01%)
1.22396 +- 0.00660 seconds time elapsed ( +- 0.54% )
(The counter for cpu_clk_thread_unhalted.one_thread_active:u
only counts at some slow rate; the system was fairly idle during this test so it should have had the core to itself the whole time. i.e. that ~23.2 M counts / sec does represent single-thread mode.)
vs. the 0 and near-0 counts for running together show that I succeeded in having these tasks run simultaneously on the same core, with hyperthreading, for basically the whole time (~1.2 seconds repeated twice, or 2.4 seconds).
So 5.0038G cycles / 10M iters / 20 rdtsc/iter = 25.019 cycles per RDTSC single-threaded, pretty much what Agner Fog measured.
Averaging across both processes for the HT test, that's about 5.244G cycles / 10M iter / 20 rdtsc/iter = 26.22 cycles on average.
So running RDTSC on both logical cores simultaneously on Skylake gives a nearly linear speedup, with very minimal competition for throughput resources. Whatever RDTSC bottlenecks on, it's not something that both threads compete for or slow each other down with.
Having the other core busy running high-throughput code (that could sustain 4 uops per clock if it had a core to itself) would probably hurt an RDTSC thread more than another thread that also just running RDTSC. Maybe we could even figure out if there's one specific port that RDTSC needs more than others, e.g. port 1 is easy to saturate because it's the only port that can run integer multiply instructions.
testloop
code? And what is thetimes
instruction? Cannot find anything. probably because of its ambiguous name. – Snowblind