Is Intel's timestamp reading asm code example using two more registers than are necessary?

Asked 17/8, 2016 at 10:49 Answered 17/8, 2016 at 17:31

Solved c assembly benchmarking inline-assembly rdtsc

I'm looking into measuring benchmark performance using the time-stamp register (TSR) found in x86 CPUs. It's a useful register, since it measures in a monotonic unit of time which is immune to the clock speed changing. Very cool.

Here is an Intel document showing asm snippets for reliably benchmarking using the TSR, including using cpuid for pipeline synchronisation. See page 16:

http://www.intel.com/content/www/us/en/embedded/training/ia-32-ia-64-benchmark-code-execution-paper.html

To read the start time, it says (I annotated a bit):

__asm volatile (
    "cpuid\n\t"             // writes e[abcd]x
    "rdtsc\n\t"             // writes edx, eax
    "mov %%edx, %0\n\t" 
    "mov %%eax, %1\n\t"
    //
    :"=r" (cycles_high), "=r" (cycles_low)  // outputs
    :                                       // inputs
    :"%rax", "%rbx", "%rcx", "%rdx");       // clobber

I'm wondering why scratch registers are used to take the values of edx and eax. Why not remove the movs and read the TSR value right out of edx and eax? Like this:

__asm volatile(                                                             
    "cpuid\n\t"
    "rdtsc\n\t"
    //
    : "=d" (cycles_high), "=a" (cycles_low) // outputs
    :                                       // inputs
    : "%rbx", "%rcx");                      // clobber

By doing this, you save two registers, reducing the likelihood of the C compiler needing to spill.

Am I right? Or those MOVs are somehow strategic?

(I agree that you do need scratch registers to read the stop time, as in that scenario the order of the instructions is reversed: you have rdtscp, ..., cpuid. The cpuid instruction destroys the result of rdtscp).

Thanks

Curch answered 17/8, 2016 at 10:49 Comment(7)

I'm not expert on GCC inline syntax, but I'd guess that in the second version GCC will generate the movs by itself, so it is a matter of readability. Side note: Shouldn't rdtscbe surrounded by serializing instructions, not just before? I usually use lfence in favor of CPUID since is it locally serializing and doesn't clobber any register. – Contradictory 17/8, 2016 at 11:7

I would expect a semi-clever compiler to re-use the output register for the local variable, but I might be wrong. – Curch 17/8, 2016 at 11:9

Regarding the lfence, do you have a source which demonstrates? – Curch 17/8, 2016 at 11:10

True, indeed. Regarding lfence, what demonstration are you looking for? lfence can be found on the Intel Manual 2 where is said it is locally serializing. – Contradictory 17/8, 2016 at 11:16

I was wondering if you had seen lfence it in the context of benchmarking with the TSR. I wonder if the cpuid calls serve the same purpose... – Curch 17/8, 2016 at 11:21

There is something interesting here. – Contradictory 17/8, 2016 at 12:2

Don't use rdtsc to measure CPU time (since context switches can occur at any moment). Use OS specific functions (on Linux, see time(7) then use clock_gettime(2)...) – Bookbindery 18/8, 2018 at 15:28

You're correct, the example is clunky. Usually if mov is the first or last instruction in an inline-asm statement, you're doing it wrong, and should have used a constraint to tell the compiler where you want the input, or where the output is.

See my GNU C inline asm guides / links collection, and other links in the inline-assembly tag wiki. (The x86 tag wiki is full of good stuff for asm in general, too.)

Or for rdtsc specifically, see Get CPU cycle count? for the __rdtsc() intrinsic, and good inline asm in @Mysticial's answer.

it measures in a monotonic unit of time which is immune to the clock speed changing.

Yes, on CPUs made within the last 10 years or so.

For profiling, it's often more useful to have times in core clock cycles, not wall-clock time, so your microbenchmark results don't depend on power-saving / turbo. Performance counters can do this and much more.

Still, if real time is what you want, rdtsc is the lowest-overhead way to get it.

And re: discussion in comments: yes cpuid is there to serialize, making sure that rdtsc and following instructions can't begin executing until after CPUID. You could put another CPUID after RDTSC, but that would increase measurement overhead, and I think give near-zero gain in accuracy / precision.

LFENCE is a cheaper alternative that's useful with RDTSC. The instruction ref manual entry documents the fact that it doesn't let later instructions start executing until it and previous instructions have retired (from the ROB/RS in the out-of-order part of the core). See Are loads and stores the only instructions that gets reordered?, and for a specific example of using it, see clflush to invalidate cache line via C function. Unlike true serializing instructions like cpuid, it doesn't flush the store buffer.

(On recent AMD CPUs without Spectre mitigation enabled, lfence is not even partially serializing, and runs at 4 per clock according to Agner Fog's testing. Is LFENCE serializing on AMD processors?)

Margaret Bloom dug up this useful link, which also confirms that LFENCE serializes RDTSC according to Intel's SDM, and has some other stuff about how to do serialization around RDTSC.

Lithesome answered 17/8, 2016 at 17:28 Comment(4)

Thanks for your answer! Actually, we did not want time at all! We wanted a measure of work independent from time, so that frequency changes cannot skew the result. I've found a few performance counters which may help, now I will be looking into a lightweight way to access them without using the sledgehammer that is perf. Hopefully you can program the counters from user-space asm code. – Curch 18/8, 2016 at 10:37

Or maybe you can't: stackoverflow.com/questions/39021662/… – Curch 18/8, 2016 at 15:12

You can program counters from user space, but you probably want to pin your threads to cores because PMCs aren't saved/restored on context switches. See agner.org/optimize for an existing kernel module that gives you PMC access, and also stackoverflow.com/questions/38848914/… for some discussion of using them. – Lithesome 18/8, 2016 at 15:19

(clarification to previous comment: you can program PMU counters from user-space only via system calls, not directly. Privileged instructions are required. Once programmed, rdpmc can work in user-space if the kernel allows it to read those counters even more cheaply than rdtsc) – Lithesome 20/2, 2023 at 7:56

No, there doesn't seem to be a good reason for the redundant MOV instructions in the inline assembly. The paper first introduces inline assembly with the following statement:

asm volatile (
    "RDTSC\n\t"
    "mov %%edx, %0\n\t"
    "mov %%eax, %1\n\t": "=r" (cycles_high1), "=r" (cycles_low1));

This has the obvious problem that it doesn't tell the compiler that EAX and EDX have been modified by the RDTSC instruction. The paper points out this mistake and corrects it using clobbers:

asm volatile ("RDTSC\n\t"
    "mov %%edx, %0\n\t"
    "mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low)::
    “%eax”, “%edx”)

No other justification is given for writing it this way other than correcting the mistake in the previous example. It appears that the paper's author is simply unaware that it could be written more simply as:

asm volatile ("RDTSC\n\t"
    : "=d" (cycles_high), "=a" (cycles_low));

Similarly the author is apparently unaware that there's a simpler version of the improved asm statement that uses RDTSC in combination with CPUID, as you demonstrate in your post.

Note that the author of the paper repeatedly misuses the term "IA64" to refer the 64-bit x86 instruction set and architecture (variously referred as x86_64, AMD64 and Intel 64). The IA-64 architecture is actually something completely different, it's the one used by Intel's Itaninum CPUs. It has no EAX or RAX registers, and no RDTSC instruction.

While the it doesn't really matter that the authors inline assembly is more complex than it needs to be, this fact combined with the misuse of IA64, something that should've caught by Intel's editors, makes me doubt the credibility of this paper.

Oedipus answered 17/8, 2016 at 17:31 Comment(4)

Thanks for your answer. If I could mark two answers correct, I would! They do use cpuid in the document, that's where I got it from, see page 16. – Curch 18/8, 2016 at 10:35

@EddBarrett Yah, I know, I'm saying that the author also doesn't know that the CPUID version of the asm statement in the paper can also be simplified in the same way. – Oedipus 18/8, 2016 at 15:28

Are you sure they misused the term IA64? Or they actually meant Itaninum? – Imhoff 29/7, 2018 at 21:35

@Imhoff I'm sure. An Itanium CPU has no EAX or RAX registers, and no RDTSC instruction. – Oedipus 30/7, 2018 at 0:11

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags