How to count clock cycles with RDTSC in GCC x86? [duplicate]

R

4

22

With Visual Studio I can read the clock cycle count from the processor as shown below. How do I do the same thing with GCC?

#ifdef _MSC_VER             // Compiler: Microsoft Visual Studio

    #ifdef _M_IX86                      // Processor: x86

        inline uint64_t clockCycleCount()
        {
            uint64_t c;
            __asm {
                cpuid       // serialize processor
                rdtsc       // read time stamp counter
                mov dword ptr [c + 0], eax
                mov dword ptr [c + 4], edx
            }
            return c;
        }

    #elif defined(_M_X64)               // Processor: x64

        extern "C" unsigned __int64 __rdtsc();
        #pragma intrinsic(__rdtsc)
        inline uint64_t clockCycleCount()
        {
            return __rdtsc();
        }

    #endif

#endif

Radiochemical answered 27/3, 2012 at 10:32 Comment(1)

arm: #40454657 – Flamethrower 15/4, 2018 at 11:46

T

24

Update: reposted and updated this answer on a more canonical question. I'll probably delete this at some point once we sort out which question to use as the duplicate target for closing all the similar rdtsc questions.

You don't need and shouldn't use inline asm for this. There's no benefit; compilers have built-ins for rdtsc and rdtscp, and (at least these days) all define a __rdtsc intrinsic if you include the right headers. https://gcc.gnu.org/wiki/DontUseInlineAsm

Unfortunately MSVC disagrees with everyone else about which header to use for non-SIMD intrinsics. (Intel's intriniscs guide says #include <immintrin.h> for this, but with gcc and clang the non-SIMD intrinsics are mostly in x86intrin.h.)

#ifdef _MSC_VER
#include <intrin.h>
#else
#include <x86intrin.h>
#endif

// optional wrapper if you don't want to just use __rdtsc() everywhere
inline
unsigned long long readTSC() {
    // _mm_lfence();  // optionally wait for earlier insns to retire before reading the clock
    return __rdtsc();
    // _mm_lfence();  // optionally block later instructions until rdtsc retires
}

Compiles with all 4 of the major compilers: gcc/clang/ICC/MSVC, for 32 or 64-bit. See the results on the Godbolt compiler explorer.

For more about using lfence to improve repeatability of rdtsc, see @HadiBrais' answer on clflush to invalidate cache line via C function.

See also Is LFENCE serializing on AMD processors? (TL:DR yes with Spectre mitigation enabled, otherwise kernels leave the relevant MSR unset.)

`rdtsc` counts reference cycles, not CPU core clock cycles

It counts at a fixed frequency regardless of turbo / power-saving, so if you want uops-per-clock analysis, use performance counters. rdtsc is exactly correlated with wall-clock time (except for system clock adjustments, so it's basically steady_clock). It ticks at the CPU's rated frequency, i.e. the advertised sticker frequency.

If you use it for microbenchmarking, include a warm-up period first to make sure your CPU is already at max clock speed before you start timing. Or better, use a library that gives you access to hardware performance counters, or a trick like perf stat for part of program if your timed region is long enough that you can attach a perf stat -p PID. You usually will still want to avoid CPU frequency shifts during your microbenchmark, though.

It's also not guaranteed that the TSCs of all cores are in sync. So if your thread migrates to another CPU core between __rdtsc(), there can be an extra skew. (Most OSes attempt to sync the TSCs of all cores, though.) If you're using rdtsc directly, you probably want to pin your program or thread to a core, e.g. with taskset -c 0 ./myprogram on Linux.

How good is the asm from using the intrinsic?

It's at least as good as anything you could do with inline asm.

A non-inline version of it compiles MSVC for x86-64 like this:

unsigned __int64 readTSC(void) PROC                             ; readTSC
    rdtsc
    shl     rdx, 32                             ; 00000020H
    or      rax, rdx
    ret     0
  ; return in RAX

For 32-bit calling conventions that return 64-bit integers in edx:eax, it's just rdtsc/ret. Not that it matters, you always want this to inline.

In a test caller that uses it twice and subtracts to time an interval:

uint64_t time_something() {
    uint64_t start = readTSC();
    // even when empty, back-to-back __rdtsc() don't optimize away
    return readTSC() - start;
}

All 4 compilers make pretty similar code. This is GCC's 32-bit output:

# gcc8.2 -O3 -m32
time_something():
    push    ebx               # save a call-preserved reg: 32-bit only has 3 scratch regs
    rdtsc
    mov     ecx, eax
    mov     ebx, edx          # start in ebx:ecx
      # timed region (empty)

    rdtsc
    sub     eax, ecx
    sbb     edx, ebx          # edx:eax -= ebx:ecx

    pop     ebx
    ret                       # return value in edx:eax

This is MSVC's x86-64 output (with name-demangling applied). gcc/clang/ICC all emit identical code.

# MSVC 19  2017  -Ox
unsigned __int64 time_something(void) PROC                            ; time_something
    rdtsc
    shl     rdx, 32                  ; high <<= 32
    or      rax, rdx
    mov     rcx, rax                 ; missed optimization: lea rcx, [rdx+rax]
                                     ; rcx = start
     ;; timed region (empty)

    rdtsc
    shl     rdx, 32
    or      rax, rdx                 ; rax = end

    sub     rax, rcx                 ; end -= start
    ret     0
unsigned __int64 time_something(void) ENDP                            ; time_something

All 4 compilers use or+mov instead of lea to combine the low and high halves into a different register. I guess it's kind of a canned sequence that they fail to optimize.

But writing it in inline asm yourself is hardly better. You'd deprive the compiler of the opportunity to ignore the high 32 bits of the result in EDX, if you're timing such a short interval that you only keep a 32-bit result. Or if the compiler decides to store the start time to memory, it could just use two 32-bit stores instead of shift/or / mov. If 1 extra uop as part of your timing bothers you, you'd better write your whole microbenchmark in pure asm.

Tait answered 18/8, 2018 at 10:3 Comment(1)

Although I agree with the DontUseInlineAsm advice in general, it seems like a call to rdtsc (just that single instruction, with proper input and output dependencies: seems like it will solve the "ignore edx problem") is pretty much a case where it never is going to be a problem. I'm mostly just annoyed that x86intrin.h is a giant header taking 300ms just to parse on my system. – Rosauraroscius 26/4, 2020 at 23:26

J

33

The other answers work, but you can avoid inline assembly by using GCC's __rdtsc intrinsic, available by including x86intrin.h.

It is defined at: gcc/config/i386/ia32intrin.h:

/* rdtsc */
extern __inline unsigned long long
__attribute__((__gnu_inline__, __always_inline__, __artificial__))
__rdtsc (void)
{
  return __builtin_ia32_rdtsc ();
}

Julianajuliane answered 2/12, 2014 at 19:56 Comment(2)

It should be noted that the effect will be pretty much the same (but much more readable!), since this intrinsic typically has the signature extern __inline unsigned long long __attribute__((__gnu_inline__, __always_inline__, __artificial__)) __rdtsc (void), i.e. it will still be inlined in the resulting binary. – Ashby 4/2, 2016 at 13:34

I was using __rdtsc() with gcc, but then I switched to g++ and __rdtsc no longer works. – Brachypterous 9/9, 2019 at 16:26

P

28

On recent versions of Linux gettimeofday will incorporate nanosecond timings.

If you really want to call RDTSC you can use the following inline assembly:

http://www.mcs.anl.gov/~kazutomo/rdtsc.html

#if defined(__i386__)

static __inline__ unsigned long long rdtsc(void)
{
    unsigned long long int x;
    __asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
    return x;
}

#elif defined(__x86_64__)

static __inline__ unsigned long long rdtsc(void)
{
    unsigned hi, lo;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}

#endif

Pau answered 27/3, 2012 at 10:36 Comment(5)

Yes, I really do need RDTSC, and now I have it. Thank you. – Garold 27/3, 2012 at 10:42

this code lacks a serializing instruction, so on any modern processor (which is out-of-order), it'll yield incorrect results. usually cpuid is used. – Forfend 23/11, 2016 at 16:1

The 64-bit version generates poor assembly with gcc. To improve it, shift rdx 32 bits to the left and or it with rax manually. The result is in rax. – Footworn 9/3, 2017 at 10:49

@Forfend - incorrect is pretty strong here. It's probably more accurate to say that without cpuid the actual moment in time at which the timestamp is returned will be spread over a number of instructions before and after where the actual rdtsc call occurs. If you are trying to time a small section of code this may be a bad thing, but if you are generating say a kind of timestamp it might be fine. For example, the Linux kernel uses rdtsc as part of it's time-calculation flow without cpuid. – Rosauraroscius 2/8, 2017 at 22:47

You don't need inline asm for this at all. I added a modern answer using __rdtsc() which compiled on all 4 major x86 compilers. – Tait 18/8, 2018 at 10:4

T

24

Update: reposted and updated this answer on a more canonical question. I'll probably delete this at some point once we sort out which question to use as the duplicate target for closing all the similar rdtsc questions.

You don't need and shouldn't use inline asm for this. There's no benefit; compilers have built-ins for rdtsc and rdtscp, and (at least these days) all define a __rdtsc intrinsic if you include the right headers. https://gcc.gnu.org/wiki/DontUseInlineAsm

Unfortunately MSVC disagrees with everyone else about which header to use for non-SIMD intrinsics. (Intel's intriniscs guide says #include <immintrin.h> for this, but with gcc and clang the non-SIMD intrinsics are mostly in x86intrin.h.)

#ifdef _MSC_VER
#include <intrin.h>
#else
#include <x86intrin.h>
#endif

// optional wrapper if you don't want to just use __rdtsc() everywhere
inline
unsigned long long readTSC() {
    // _mm_lfence();  // optionally wait for earlier insns to retire before reading the clock
    return __rdtsc();
    // _mm_lfence();  // optionally block later instructions until rdtsc retires
}

Compiles with all 4 of the major compilers: gcc/clang/ICC/MSVC, for 32 or 64-bit. See the results on the Godbolt compiler explorer.

For more about using lfence to improve repeatability of rdtsc, see @HadiBrais' answer on clflush to invalidate cache line via C function.

See also Is LFENCE serializing on AMD processors? (TL:DR yes with Spectre mitigation enabled, otherwise kernels leave the relevant MSR unset.)

`rdtsc` counts reference cycles, not CPU core clock cycles

It counts at a fixed frequency regardless of turbo / power-saving, so if you want uops-per-clock analysis, use performance counters. rdtsc is exactly correlated with wall-clock time (except for system clock adjustments, so it's basically steady_clock). It ticks at the CPU's rated frequency, i.e. the advertised sticker frequency.

If you use it for microbenchmarking, include a warm-up period first to make sure your CPU is already at max clock speed before you start timing. Or better, use a library that gives you access to hardware performance counters, or a trick like perf stat for part of program if your timed region is long enough that you can attach a perf stat -p PID. You usually will still want to avoid CPU frequency shifts during your microbenchmark, though.

It's also not guaranteed that the TSCs of all cores are in sync. So if your thread migrates to another CPU core between __rdtsc(), there can be an extra skew. (Most OSes attempt to sync the TSCs of all cores, though.) If you're using rdtsc directly, you probably want to pin your program or thread to a core, e.g. with taskset -c 0 ./myprogram on Linux.

How good is the asm from using the intrinsic?

It's at least as good as anything you could do with inline asm.

A non-inline version of it compiles MSVC for x86-64 like this:

unsigned __int64 readTSC(void) PROC                             ; readTSC
    rdtsc
    shl     rdx, 32                             ; 00000020H
    or      rax, rdx
    ret     0
  ; return in RAX

For 32-bit calling conventions that return 64-bit integers in edx:eax, it's just rdtsc/ret. Not that it matters, you always want this to inline.

In a test caller that uses it twice and subtracts to time an interval:

uint64_t time_something() {
    uint64_t start = readTSC();
    // even when empty, back-to-back __rdtsc() don't optimize away
    return readTSC() - start;
}

All 4 compilers make pretty similar code. This is GCC's 32-bit output:

# gcc8.2 -O3 -m32
time_something():
    push    ebx               # save a call-preserved reg: 32-bit only has 3 scratch regs
    rdtsc
    mov     ecx, eax
    mov     ebx, edx          # start in ebx:ecx
      # timed region (empty)

    rdtsc
    sub     eax, ecx
    sbb     edx, ebx          # edx:eax -= ebx:ecx

    pop     ebx
    ret                       # return value in edx:eax

This is MSVC's x86-64 output (with name-demangling applied). gcc/clang/ICC all emit identical code.

# MSVC 19  2017  -Ox
unsigned __int64 time_something(void) PROC                            ; time_something
    rdtsc
    shl     rdx, 32                  ; high <<= 32
    or      rax, rdx
    mov     rcx, rax                 ; missed optimization: lea rcx, [rdx+rax]
                                     ; rcx = start
     ;; timed region (empty)

    rdtsc
    shl     rdx, 32
    or      rax, rdx                 ; rax = end

    sub     rax, rcx                 ; end -= start
    ret     0
unsigned __int64 time_something(void) ENDP                            ; time_something

All 4 compilers use or+mov instead of lea to combine the low and high halves into a different register. I guess it's kind of a canned sequence that they fail to optimize.

But writing it in inline asm yourself is hardly better. You'd deprive the compiler of the opportunity to ignore the high 32 bits of the result in EDX, if you're timing such a short interval that you only keep a 32-bit result. Or if the compiler decides to store the start time to memory, it could just use two 32-bit stores instead of shift/or / mov. If 1 extra uop as part of your timing bothers you, you'd better write your whole microbenchmark in pure asm.

Tait answered 18/8, 2018 at 10:3 Comment(1)

Although I agree with the DontUseInlineAsm advice in general, it seems like a call to rdtsc (just that single instruction, with proper input and output dependencies: seems like it will solve the "ignore edx problem") is pretty much a case where it never is going to be a problem. I'm mostly just annoyed that x86intrin.h is a giant header taking 300ms just to parse on my system. – Rosauraroscius 26/4, 2020 at 23:26

C

8

On Linux with gcc, I use the following:

/* define this somewhere */
#ifdef __i386
__inline__ uint64_t rdtsc() {
  uint64_t x;
  __asm__ volatile ("rdtsc" : "=A" (x));
  return x;
}
#elif __amd64
__inline__ uint64_t rdtsc() {
  uint64_t a, d;
  __asm__ volatile ("rdtsc" : "=a" (a), "=d" (d));
  return (d<<32) | a;
}
#endif

/* now, in your function, do the following */
uint64_t t;
t = rdtsc();
// ... the stuff that you want to time ...
t = rdtsc() - t;
// t now contains the number of cycles elapsed

Coset answered 27/3, 2012 at 10:41 Comment(0)

`rdtsc` counts reference cycles, not CPU core clock cycles

How good is the asm from using the intrinsic?

`rdtsc` counts reference cycles, not CPU core clock cycles

How good is the asm from using the intrinsic?

Recommended topics

Hot tags

rdtsc counts reference cycles, not CPU core clock cycles

How good is the asm from using the intrinsic?

rdtsc counts reference cycles, not CPU core clock cycles

How good is the asm from using the intrinsic?

Recommended topics

Hot tags

`rdtsc` counts reference cycles, not CPU core clock cycles

`rdtsc` counts reference cycles, not CPU core clock cycles