I'm looking into measuring benchmark performance using the time-stamp register (TSR) found in x86 CPUs. It's a useful register, since it measures in a monotonic unit of time which is immune to the clock speed changing. Very cool.
Here is an Intel document showing asm snippets for reliably benchmarking using the TSR, including using cpuid for pipeline synchronisation. See page 16:
To read the start time, it says (I annotated a bit):
__asm volatile (
"cpuid\n\t" // writes e[abcd]x
"rdtsc\n\t" // writes edx, eax
"mov %%edx, %0\n\t"
"mov %%eax, %1\n\t"
//
:"=r" (cycles_high), "=r" (cycles_low) // outputs
: // inputs
:"%rax", "%rbx", "%rcx", "%rdx"); // clobber
I'm wondering why scratch registers are used to take the values of edx
and eax
. Why not remove the movs and read the TSR value right out of edx
and eax
? Like this:
__asm volatile(
"cpuid\n\t"
"rdtsc\n\t"
//
: "=d" (cycles_high), "=a" (cycles_low) // outputs
: // inputs
: "%rbx", "%rcx"); // clobber
By doing this, you save two registers, reducing the likelihood of the C compiler needing to spill.
Am I right? Or those MOVs are somehow strategic?
(I agree that you do need scratch registers to read the stop time, as in that scenario the order of the instructions is reversed: you have rdtscp, ..., cpuid. The cpuid instruction destroys the result of rdtscp).
Thanks
rdtsc
be surrounded by serializing instructions, not just before? I usually uselfence
in favor ofCPUID
since is it locally serializing and doesn't clobber any register. – Contradictorylfence
, do you have a source which demonstrates? – Curchlfence
, what demonstration are you looking for?lfence
can be found on the Intel Manual 2 where is said it is locally serializing. – Contradictorylfence
it in the context of benchmarking with the TSR. I wonder if thecpuid
calls serve the same purpose... – Curchrdtsc
to measure CPU time (since context switches can occur at any moment). Use OS specific functions (on Linux, see time(7) then use clock_gettime(2)...) – Bookbindery