Difference between rdtscp, rdtsc : memory and cpuid / rdtsc?
Asked Answered
I

2

69

Assume we're trying to use the tsc for performance monitoring and we we want to prevent instruction reordering.

These are our options:

1: rdtscp is a serializing call. It prevents reordering around the call to rdtscp.

__asm__ __volatile__("rdtscp; "         // serializing read of tsc
                     "shl $32,%%rdx; "  // shift higher 32 bits stored in rdx up
                     "or %%rdx,%%rax"   // and or onto rax
                     : "=a"(tsc)        // output to tsc variable
                     :
                     : "%rcx", "%rdx"); // rcx and rdx are clobbered

However, rdtscp is only available on newer CPUs. So in this case we have to use rdtsc. But rdtsc is non-serializing, so using it alone will not prevent the CPU from reordering it.

So we can use either of these two options to prevent reordering:

2: This is a call to cpuid and then rdtsc. cpuid is a serializing call.

volatile int dont_remove __attribute__((unused)); // volatile to stop optimizing
unsigned tmp;
__cpuid(0, tmp, tmp, tmp, tmp);                   // cpuid is a serialising call
dont_remove = tmp;                                // prevent optimizing out cpuid

__asm__ __volatile__("rdtsc; "          // read of tsc
                     "shl $32,%%rdx; "  // shift higher 32 bits stored in rdx up
                     "or %%rdx,%%rax"   // and or onto rax
                     : "=a"(tsc)        // output to tsc
                     :
                     : "%rcx", "%rdx"); // rcx and rdx are clobbered

3: This is a call to rdtsc with memory in the clobber list, which prevents reordering

__asm__ __volatile__("rdtsc; "          // read of tsc
                     "shl $32,%%rdx; "  // shift higher 32 bits stored in rdx up
                     "or %%rdx,%%rax"   // and or onto rax
                     : "=a"(tsc)        // output to tsc
                     :
                     : "%rcx", "%rdx", "memory"); // rcx and rdx are clobbered
                                                  // memory to prevent reordering

My understanding for the 3rd option is as follows:

Making the call __volatile__ prevents the optimizer from removing the asm or moving it across any instructions that could need the results (or change the inputs) of the asm. However it could still move it with respect to unrelated operations. So __volatile__ is not enough.

Tell the compiler memory is being clobbered: : "memory"). The "memory" clobber means that GCC cannot make any assumptions about memory contents remaining the same across the asm, and thus will not reorder around it.

So my questions are:

  • 1: Is my understanding of __volatile__ and "memory" correct?
  • 2: Do the second two calls do the same thing?
  • 3: Using "memory" looks much simpler than using another serializing instruction. Why would anyone use the 3rd option over the 2nd option?
Illegality answered 28/9, 2012 at 0:7 Comment(6)
You seem to confuse reordering of instructions generated by the compiler, which you can avoid by using volatile and memory and reordering of instructions executed by the processor (aka out of order execution), which you avoid by using cpuid.Prince
@hirschhornsalz but won't having memory in the clobber list prevent the processor reordering the instructions? Doesn't memory act like a memory fence?Illegality
or perhaps the memory in the clobber list is only emitted to gcc, and the resulting machine code doesn't expose this to the processor?Illegality
No, memory fences are a different thing, and the compiler will not insert those if you use a "memory" clobber. These are about reordering loads/stores by the processors and are used in conjunction with instructions with weak memory ordering in respect to multithreaded environments, like movntdq. Most of the time you do not need a memory fence on Intel/AMD processors, as these processors have strong memory ordering by default. And yes, memory only affects the order in which instructions are emitted by the compiler, it does not make the compiler emit additional instructions.Prince
rdtscp doesn't prevent reordering, it only ensures all previous instructions have finished executing: The RDTSCP instruction waits until all previous instructions have been executed before reading the counter. However, subsequent instructions may begin execution before the read operation is performed., I suggest you read this whitepaper from intel if you are considering using this for benchmarking etc: download.intel.com/embedded/software/IA/324264.pdf (it actually shows that you need both rdtsc + cpuid and rdtscp + cpuid for correct measurements)Mcnalley
@Mcnalley Very interesting paperPrince
H
53

As mentioned in a comment, there's a difference between a compiler barrier and a processor barrier. volatile and memory in the asm statement act as a compiler barrier, but the processor is still free to reorder instructions.

Processor barriers are special instructions that must be explicitly given, e.g. rdtscp, cpuid, memory fence instructions (mfence, lfence, ...) etc. lfence is also an execution barrier (on Intel, and more recently AMD), so it's interesting in combination with rdtsc (which isn't a memory operation, and is only ordered by *fence instructions if something in a manual says so). Fun fact: x86's strongly-ordered memory model makes lfence basically useless for memory ordering, leaving execution ordering as its main use-case.

As an aside, while using cpuid as a barrier before rdtsc is common, it can also be very bad from a performance perspective, since virtual machine platforms often trap and emulate the cpuid instruction in order to impose a common set of CPU features across multiple machines in a cluster (to ensure that live migration works). Thus it's better to use a cheaper execution fence instruction like lfence, or serialize on very recent CPUs (which is also a memory barrier and fully serializes the pipeline like cpuid but without a vmexit, so putting it before rdtsc would wait for stores to commit as well, unlike lfence which just waits for instructions to finish executing.)

The Linux kernel used to use mfence;rdtsc on AMD platforms and lfence;rdtsc on Intel. As of Linux kernel 5.4, lfence is used to serialize rdtsc on both Intel and AMD. See this commit "x86: Remove X86_FEATURE_MFENCE_RDTSC": https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be261ffce6f13229dad50f59c5e491f933d3167f

Hollow answered 28/9, 2012 at 6:41 Comment(26)
The cpuid; rdtsc is not about memory fences, it's about serializing the instruction stream. Usually it is used for benchmarking purposes to make sure no "old" instructions remain in the reorder buffer/reservation station. The execution time of cpuid (which is quite long, I remember >200 cycles) is then to be subtracted. If the result is more "exact" this way is not quite clear to me, I experimented with and without and the differences seems less the the natural error of measurement, even in single user mode with nothing else running at all.Prince
I am not sure, but I possibly the fence instruction used this way in the kernel are not useful at all ^^Prince
@hirschhornsalz: According to the git commit logs, AMD and Intel confirmed that the m/lfence will serialize rdtsc on currently available CPU's. I suppose Andi Kleen can provide more details on what exactly was said, if you're interested and ask him.Hollow
@hirschhornsalz: ... IIRC the argument basically goes that while the fence instructions only serialize wrt. instructions that read/write memory, in practice there's no point in reordering non-mem instructions wrt rdtsc and thus it's not done. Although per the architecture manual it's in principle allowed.Hollow
That's exactly what I think, in practice (=non-benchmarking code) there is no point in avoiding the reordering of instructions. I would even go one step further and argue that there isn't even a point in avoiding the reordering of memory instructions, since rdtsc is only used as a non memory depended timer source here and so drop the fences. But I should really ask Andy :-)Prince
Is the memory clobber part of the asm still necessary? I notice the code in Intel's white paper makes no mention of it: intel.com/content/dam/www/public/us/en/documents/white-papers/…Opponent
Are you sure that mfence; rdtsc on Intel really serializes the instruction stream? lfence is now officially / more-clearly documented as serializing, (so it can be used to mitigate Spectre mis-speculation of bounds-check branches). But I'm not sure mfence serializes the instruction stream on Intel. (Maybe it does, but it's not clearly documented). Fun fact: on Core2, mfence has better throughput than lfence (when that's all the machine is running, no other instructions mixed in. source: Agner Fog's tests).Andryc
It's probably important to use lfence on Intel and mfence on AMD; any argument about "stronger barrier" is totally inapplicable because we're talking about the instruction stream and additional micro-architectural effects, not the well-documented memory-ordering effects. For example, LFENCE isn't fully serializing on AMD: it has 4-per-clock throughput Bulldozer-family / Ryzen! Maybe it does serialize rdtsc but not itself or some other instructions? Or more likely it's very cheap on AMD because their memory-ordering implementation works differently.Andryc
@JosephGarvin: A "memory clobber" is an explicit notice to a compiler that a piece of code may be dependent upon memory ordering in ways the compiler should not expect to understand. Some compilers are prone to assume that memory order only matters in situations where they can see explicit reasons why it might; others assume it may matter in cases where they can't prove it doesn't. Such considerations are orthogonal to anything a processor might do with memory ordering.Willettewilley
@supercat: I understand that, but confusingly the kernel code linked does not use the memory constraint. Maybe because GCC understands it is implied by lfence?Opponent
@JosephGarvin: If the kernel code uses lfence and mfence without memory clobbers, it likely does so because the authors thought it obvious that any quality compiler should recognize them as including an implied memory clobber; whether a gratuitously clever compiler would regard them likewise, or instead exploit the fact that more "optimizations" would be possible without a memory clobber, is a separate issue.Willettewilley
@supercat: GNU C Extended asm("template" ::: clobbers) do not have an implicit "memory" clobber. Some version of GCC like maybe 7 or 8 made Basic asm statements with non-empty template strings have an implicit "memory" clobber as a sop to code that uses Basic Asm (no constraints). Inside the body of a non-naked function, asm statements should always be Extended asm so you can specify a memory clobber or not as appropriate. (gcc.gnu.org/wiki/ConvertBasicAsmToExtended).Andryc
supercat and @JosephGarvin: The Linux kernel definitely relies on GNU C Extended asm in other places, so they don't need to care about compilers that only accept simpler syntax like asm("lfence"). (e.g. ARMCC?) If any kernel devs have engaged in the kind of wishful thinking you describe about how you think it should work, that's a clear bug. godbolt.org/z/oGYYnWaha demonstrates that GCC will reorder stores across an asm statement that lacks a "memory" clobber, e.g. for dead store elimination or keeping a global in a register.Andryc
Anyway, this code doesn't use a "memory" clobber because it's only trying to wait for earlier instructions to retire from the ROB before taking a timestamp, not for stores to become globally visible if we did happen to use mfence on AMD. And not to interfere with optimization or actually to order source accesses. lfence is basically a no-op as far as memory order is concerned anyway, only waiting for anything if there are NT loads from WC memory in flight, e.g. from video RAM. They're trying to make a drop-in replacement for rdtscp for CPUs that lack it, which does no mem ordering.Andryc
Anyway, this answer is still wrong; mfence doesn't order rdtsc on Intel CPUs (except maybe Skylake and others where mfence was strengthened by a microcode update to include blocking reordering of NT loads from WC memory.) mfence was only ever a (bad) option on AMD. OoO exec of ALU instruction across mfence can happen on Haswell, IIRC from discussion on earlier questions. The strength of its memory-ordering is irrelevant for instruction ordering.Andryc
Related: solution to rdtsc out of order execution? discusses rdtsc and execution barriers.Andryc
@PeterCordes: What disadvantage would there be to having a compiler treat a single-clause ASM statement as having memory clobbers be default, and saying that programmers who know both that none is necessary and that avoiding the performance cost of a memory clobber is more important than compatibility with other tools, should use the extended syntax? The only "downside" I can see is that more programmers would write code in a manner that is compatible with commercial compilers, rather than only with clang and gcc.Willettewilley
@supercat: Not really any downside, since nobody should use GNU C basic asm inside a non-naked function anyway. Given real compiler behaviour, it's not portable to rely on the undocumented implicit "memory" clobber in an asm("cli"), since it doesn't have one in GCC6 and earlier; only GCC7 and later added training wheels. godbolt.org/z/GMG3zxj85 (Surprisingly, clang misses dead-store elimination around even an Extended asm that explicitly omits a "memory" clobber, so IDK about it). Irrelevant for Linux's rdtsc[p] asm since it has register outputs and needs Extended asm.Andryc
@PeterCordes: On compilers that treat inline ASM as a memory clobber, there would be no problem using a simple asm directive for "cli", or for an empty string [to force compiler memory sequencing]. If there's no real disadvantage to compilers treating things in that fashion, why behave differently other than to be deliberately incompatible?Willettewilley
@supercat: You'd have to ask GCC devs about the design decisions made decades ago which only recently changed to what you're suggesting. Early GCC was less capable and presumably less aggressive, so maybe it wasn't a problem in practice most of the time. I'm not suggesting removing the implicit memory clobber default from Basic asm, I'm just saying that old compilers don't have it, so it's not safe to rely on in code that might be compiled by GCC6 or earlier (and IDK about clang). BTW, no, an empty string is special and asm("") doesn't get a "memory" clobber even in GCC7 and later.Andryc
@PeterCordes: I removed the mention of using mfence;rdtsc on Intel; is that better?Hollow
Better, yes. But your answer still had a mention of memory barriers being relevant for RDTSC. They aren't. Only the execution barrier effect matters if you're timing something. I made an edit, you might want to review it and see if I made anything too verbose, e.g. you might trim out the mention of the new serialize instruction. (IDK why Linux uses lfence at all when it just wants a timestamp from around now to get the current time. Unless the lfence;rdtsc or rdtscp version is mostly intended for Spectre mitigation. Anyway, the question has benchmarking use-cases in mind.)Andryc
@PeterCordes: Your edits look fine, thanks. (I wasn't aware of the new serialize instruction). I don't think the Linux use of (l/m)fence before rdtsc has anything to do with Spectre, this was used at least back in 2012 when I first answered this question, long before Spectre popped up on the radar.Hollow
Perhaps for calibrating rdtsc or the jiffies loop? Or for short delay loops using a TSC deadline to make sure later stuff actually waited long enough? I don't remember an lfence in the clock_gettime code exported in the VDSO for user-space to execute, so hopefully it's just doing bare rdtsc without lfence when it wants to know what time it is, e.g. for a file timestamp in a rename system call.Andryc
@PeterCordes: One problem with C's development is that compilers generally didn't document any way of blocking optimizations that they wouldn't even consider performing, but people who wanted to perform such optimizations were unwilling to recognize the legitimacy of code that relied upon its absence. The simple solution should have been to introduce new syntactic forms both for the "invite optimization" and "block optimization" forms, and deprecate code in the old form while acknowledging its legitimacy. If C89 had been willing to adopt such an approach, decades of technical debt...Willettewilley
...could have been avoided, but I don't fault the authors of C89 for not doing so. I do fault the authors of C99 for failing to acknowledge the need for such constructs when they could have been added easily, and every subsequent Committee for failing to recognize a problem that was obvious well before 2011.Willettewilley
A
4

you can use it like shown below:

asm volatile (
"CPUID\n\t"/*serialize*/
"RDTSC\n\t"/*read the clock*/
"mov %%edx, %0\n\t"
"mov %%eax, %1\n\t": "=r" (cycles_high), "=r"
(cycles_low):: "%rax", "%rbx", "%rcx", "%rdx");
/*
Call the function to benchmark
*/
asm volatile (
"RDTSCP\n\t"/*read the clock*/
"mov %%edx, %0\n\t"
"mov %%eax, %1\n\t"
"CPUID\n\t": "=r" (cycles_high1), "=r"
(cycles_low1):: "%rax", "%rbx", "%rcx", "%rdx");

In the code above, the first CPUID call implements a barrier to avoid out-of-order execution of the instructions above and below the RDTSC instruction. With this method we avoid to call a CPUID instruction in between the reads of the real-time registers

The first RDTSC then reads the timestamp register and the value is stored in memory. Then the code that we want to measure is executed. The RDTSCP instruction reads the timestamp register for the second time and guarantees that the execution of all the code we wanted to measure is completed. The two “mov” instructions coming afterwards store the edx and eax registers values into memory. Finally a CPUID call guarantees that a barrier is implemented again so that it is impossible that any instruction coming afterwards is executed before CPUID itself.

Arst answered 8/1, 2013 at 11:44 Comment(5)
Hi, it appears that you copied this answer from Gabriele Paolinis white paper "How to Benchmark Code Execution Times on Intel® IA-32 and IA-64 Instruction Set Architectures" (you missed a line break though). You're using someone else's work without giving the author credit. Why not add an attribution?War
Yes, indeed, it is coped. I'm also wondering if the two movs in reading the start time is necessary: stackoverflow.com/questions/38994549/…Ulbricht
Is there a specific reason to have two variables high and low?Priebe
Yes, @ExOfDe, there is a reason. The RDTSC[P] instruction returns a 64-bit value, but it returns it in two 32-bit halves: the upper half in the EDX register and the lower half in the EAX register (as is the common convention for returning 64-bit values on 32-bit x86 systems). You can, of course, combine those two 32-bit halves into a single 64-bit value if you want, but that requires either (A) a 64-bit processor (and the RDTSC[P] instruction was introduced to the ISA long before 64-bit integers were natively supported), or (B) compiler/library support for 64-bit ints.Dalessio
If you're going to use your own inline asm instead of a builtin/intrinsic, at least write efficient inline asm that uses constraints to tell the compiler which registers to look at, instead of using mov instructions.Andryc

© 2022 - 2024 — McMap. All rights reserved.