Can different processes run RDTSC at the same time?
Asked Answered
S

1

5

Can different processes run RDTSC at the same time? Or is this a resource that only one core can operate on at the same time? TSC is in every core (at least you can adjust it separately for every core), so it should be possible. But what about Hyper Treading?

How can I test this?

Snowblind answered 4/6, 2019 at 8:2 Comment(0)
E
8

Each physical core has its own TSC; the microcode doesn't have to go off-core, so there's no shared resource that they compete for. Going off-core at all would make it much slower, and made the implementation more complex. Having a counter physically inside each core is a simpler implementation, just counting ticks of a reference-clock signal that's distributed to all cores.

With HyperThreading, the logical cores sharing a physical always compete for execution resources. From Agner Fog's instruction tables, we know that RDTSC on Skylake is 20 uops for the front-end, and has 1 per 25 cycle throughput. At less than 1 uop per clock while executing nothing but RDTSC instructions, competing for the front-end is probably not a problem.

Probably most of those uops can run on any execution port, so it's quite possible that both logical threads can run rdtsc with that throughput.

But maybe there's a not-fully-pipelined execution unit that they'd compete for.

You can test it by putting times 20 rdtsc inside a loop that runs a few 10s of millions of iterations, and running that microbenchmark on a core by itself, and then running it twice pinned to the logical cores of one physical core.

I got curious and did that myself on Linux with perf on a Skylake i7-6700k, with taskset -c 3 and taskset -c 7 (the way Linux enumerates the cores on this CPU, those numbers are the logical cores of the 4th physical core. You can check /proc/cpuinfo to find out on your system.)

To avoid interleaving the output lines if they both finish nearly simultaneously, I used bash process substitution with cat <(cmd1) <(cmd2) to run them both simultaneously and get the output printed in a fixed order. The commands were taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles:u,instructions:u,branches:u,branch-misses:u,uops_issued.any:u,uops_executed.thread:u,cpu_clk_thread_unhalted.one_thread_active:u -r2 ./testloop to count core clock cycles (not reference cycles, so I don't have to be paranoid about turbo / idle clock frequencies).

testloop is a static executable with a hand-written asm loop containing times 20 rdtsc (NASM repeat operator) and dec ebp/jnz, with the top of the loop aligned by 64 in case that ever matters. Before the loop, mov ebp, 10000000 initializes the counter. (See Can x86's MOV really be "free"? Why can't I reproduce this at all? for details on how I do microbenchmarks this way. Or Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths another example of a simple NASM program with a loop using times to repeat instructions.)

 Performance counter stats for './testloop' (2 runs):

          1,278.19 msec task-clock:u              #    1.000 CPUs utilized            ( +-  0.19% )
                 4      context-switches          #    0.004 K/sec                    ( +- 11.11% )
                 0      cpu-migrations            #    0.000 K/sec                  
                 2      page-faults               #    0.002 K/sec                  
     5,243,270,118      cycles:u                  #    4.102 GHz                      ( +-  0.01% )  (71.37%)
       219,949,542      instructions:u            #    0.04  insn per cycle           ( +-  0.01% )  (85.68%)
        10,000,692      branches:u                #    7.824 M/sec                    ( +-  0.03% )  (85.68%)
                32      branch-misses:u           #    0.00% of all branches          ( +- 93.65% )  (85.68%)
     4,010,798,914      uops_issued.any:u         # 3137.885 M/sec                    ( +-  0.01% )  (85.68%)
     4,010,969,168      uops_executed.thread:u    # 3138.018 M/sec                    ( +-  0.00% )  (85.78%)
                 0      cpu_clk_thread_unhalted.one_thread_active:u #    0.000 K/sec                    (57.17%)

           1.27854 +- 0.00256 seconds time elapsed  ( +-  0.20% )


 Performance counter stats for './testloop' (2 runs):

          1,278.26 msec task-clock:u              #    1.000 CPUs utilized            ( +-  0.18% )
                 6      context-switches          #    0.004 K/sec                    ( +-  9.09% )
                 0      cpu-migrations            #    0.000 K/sec                  
                 2      page-faults               #    0.002 K/sec                    ( +- 20.00% )
     5,245,894,686      cycles:u                  #    4.104 GHz                      ( +-  0.02% )  (71.27%)
       220,011,812      instructions:u            #    0.04  insn per cycle           ( +-  0.02% )  (85.68%)
         9,998,783      branches:u                #    7.822 M/sec                    ( +-  0.01% )  (85.68%)
                23      branch-misses:u           #    0.00% of all branches          ( +- 91.30% )  (85.69%)
     4,010,860,476      uops_issued.any:u         # 3137.746 M/sec                    ( +-  0.01% )  (85.68%)
     4,012,085,938      uops_executed.thread:u    # 3138.704 M/sec                    ( +-  0.02% )  (85.79%)
             4,174      cpu_clk_thread_unhalted.one_thread_active:u #    0.003 M/sec                    ( +-  9.91% )  (57.15%)

           1.27876 +- 0.00265 seconds time elapsed  ( +-  0.21% )

vs. running alone:

 Performance counter stats for './testloop' (2 runs):

          1,223.55 msec task-clock:u              #    1.000 CPUs utilized            ( +-  0.52% )
                 4      context-switches          #    0.004 K/sec                    ( +- 11.11% )
                 0      cpu-migrations            #    0.000 K/sec                  
                 2      page-faults               #    0.002 K/sec                  
     5,003,825,966      cycles:u                  #    4.090 GHz                      ( +-  0.00% )  (71.31%)
       219,905,884      instructions:u            #    0.04  insn per cycle           ( +-  0.04% )  (85.66%)
        10,001,852      branches:u                #    8.174 M/sec                    ( +-  0.04% )  (85.66%)
                17      branch-misses:u           #    0.00% of all branches          ( +- 52.94% )  (85.78%)
     4,012,165,560      uops_issued.any:u         # 3279.113 M/sec                    ( +-  0.03% )  (85.78%)
     4,010,429,819      uops_executed.thread:u    # 3277.694 M/sec                    ( +-  0.01% )  (85.78%)
        28,452,608      cpu_clk_thread_unhalted.one_thread_active:u #   23.254 M/sec                    ( +-  0.20% )  (57.01%)

           1.22396 +- 0.00660 seconds time elapsed  ( +-  0.54% )

(The counter for cpu_clk_thread_unhalted.one_thread_active:u only counts at some slow rate; the system was fairly idle during this test so it should have had the core to itself the whole time. i.e. that ~23.2 M counts / sec does represent single-thread mode.)

vs. the 0 and near-0 counts for running together show that I succeeded in having these tasks run simultaneously on the same core, with hyperthreading, for basically the whole time (~1.2 seconds repeated twice, or 2.4 seconds).

So 5.0038G cycles / 10M iters / 20 rdtsc/iter = 25.019 cycles per RDTSC single-threaded, pretty much what Agner Fog measured.

Averaging across both processes for the HT test, that's about 5.244G cycles / 10M iter / 20 rdtsc/iter = 26.22 cycles on average.

So running RDTSC on both logical cores simultaneously on Skylake gives a nearly linear speedup, with very minimal competition for throughput resources. Whatever RDTSC bottlenecks on, it's not something that both threads compete for or slow each other down with.

Having the other core busy running high-throughput code (that could sustain 4 uops per clock if it had a core to itself) would probably hurt an RDTSC thread more than another thread that also just running RDTSC. Maybe we could even figure out if there's one specific port that RDTSC needs more than others, e.g. port 1 is easy to saturate because it's the only port that can run integer multiply instructions.

Emlen answered 4/6, 2019 at 8:41 Comment(8)
Can you post the full testloop code? And what is the times instruction? Cannot find anything. probably because of its ambiguous name.Snowblind
times is a NASM operator that repeats the instruction that many times. Like I said in my answer, Can x86's MOV really be "free"? Why can't I reproduce this at all? has the full source code, just replace the loop body with times 20 rdtsc.Emlen
It's amazing that something has superficially simple as rdtsc takes 20 μops. Anyone have an idea why this is the case? I would have expected it to just be reading some timestamp registerGoethite
@BrennanVincent: I was surprised, too. Maybe has something to do with virtualization being able to scale and offset it for guest VMs? (And even if not running inside a VM, it always decodes the same way.)Emlen
Don't know how recent CPU model should be but my laptop's Intel Pentium T4300 with 2 cores increments RDTSC values with different frequencies, sometimes with rate of 0.48 ns/cycle (most of time), sometimes 0.96 ns/cycle (rarely when overheated). Measured this with C++ program and __rdtsc() intrinsic. I bought my laptop around year 2008.Giuliana
Do you probably know what else besides RDTSC I can use to measure precise time? I need such operation that 1) very fast, meaning that using this operation takes just several CPU cycles at most 2) very precise, meaning that precision is not more than 5-10 nanoseconds. 3) it has same (unchangable) frequency on all CPUs (even old). 4) it may measure any units of time (nanoseconds, cycles, ticks, etc), as far as I can convert them to nanoseconds.Giuliana
@Arty: First part answered in comments How to get the CPU cycle count in x86_64 from C++? where you mentioned the same thing. Besides RDTSC, there's also RDPMC which can be lower overhead, but the "cycles" event will be core cycles (and thus variable with frequency). RDTSC is the best you can get; most systems never throttle so constant_tsc is sufficient for fine-grained offsets from the last timer increment, as long as the CPU doesn't go to sleep.Emlen
@Arty: Before constant_tsc, there's no good option (both precise and low overhead); that's why CPU vendors repurposed RDTSC to be usable as a time source even with variable CPU frequency, because of software demand for such. Newer CPUs with invariant_tsc make it even better, not halting during sleep states.Emlen

© 2022 - 2024 — McMap. All rights reserved.