RDTSC always writes its 64-bit result split into hi/lo halves in EDX and EAX, even in 64-bit mode (see the manual), unfortunately not packing the 64-bit TSC into just RAX. That's why extra work is needed after the asm statement.
To make a single 64-bit integer from it, you need to shift hi
to the place it belongs as part of an unsigned long
. lo
is already in the right place, and writing those 32-bit register zeroed the upper bits of both registers, so we can just OR the (shifted) halves together without having to AND the low half.
In x86-64 Linux, unsigned long
is a 64-bit type so the kernel actually uses both halves of the RDTSC return value.
The only reason the 32-bit version is simpler is that the kernel is truncating the result to 32-bit by throwing away the high half. If you do want a 64-bit TSC in 32-bit mode, the same C source works there, too (with uint64_t
or unsigned long long
), although it wouldn't compile to shift and OR instructions. The compiler would just know that it has a 64-bit integer whose halves are in EDX and EAX.
See also How to get the CPU cycle count in x86_64 from C++? - and for real use, don't forget to make these asm volatile
. Otherwise the compiler can assume that repeated executions of this produce the same output, e.g. end-start
= 0 after optimization.
volatile
on the asm; they're not safe for timing if the compiler can see the start and end. I wonder if that's intentional in Linux because they're never using it for microbenchmarking inside the kernel? But IDK where in Linux it could usefully CSE. – Maugham__rdtsc()
intrinsic instead). – Maugham