TL;DR

rdtscp and lfence/rdtsc have the same exact upstream serialization properties On Intel processors. On AMD processors with a dispatch-serializing lfence, both sequences have also the same upstream serialization properties. With respect to later instructions, rdtsc in the lfence/rdtsc sequence may be dispatched for execution simultaneously with later instructions. This behavior may not be desirable if you also want to precisely time these later instructions as well. This is generally not a problem because the reservation station scheduler prioritizes older uops for dispatching as long as there are no structural hazards. After lfence retires, rdtsc uops would be the oldest in the RS with probably no structural hazards, so they will be immediately dispatched (possibly together with some later uops). You could also put an lfence after rdtsc.

The Intel manual V2 says the following about rdtscp (emphasis mine):

The RDTSCP instruction is not a serializing instruction, but it does wait until all previous instructions have executed and all previous loads are globally visible. But it does not wait for previous stores to be globally visible, and subsequent instructions may begin execution before the read operation is performed.

The "read operation" part here refers to reading the time-stamp counter. This suggests that rdtscp internally works like lfence followed by rdtsc + reading IA32_TSC_AUX. That is, lfence is performed first then the two reads from the registers are executed (possibly at the same time).

On most Intel and AMD processors that support these instructions, lfence/rdtsc have a slightly larger number of uops than rdtscp. The number of lfence uops mentioned in Agner's tables is for the case where the lfence instructions are executed back-to-back, which makes it appear that lfence is decoded into a smaller number of uops (1 or 2) than what a single lfence is actually decoded into (5 or 6 uops). Usually, lfence is used without other back-to-back lfences. That's why lfence/rdtsc contains more uops than rdtscp. Agner's tables also show that on some processors, rdtsc and rdtscp have the same number of uops, which I'm not sure is correct. It makes more sense for rdtscp to have one or more uops than rdtsc. That said, the latency may be more important than the difference in the number of uops because that's what directly impacts the measurement overhead.

In terms of portability, rdtsc is older than rdtscp; rdtsc was first supported on the Pentium processors while the first processors that support rdtscp were released in 2005-2006 (See: What is the gcc cpu-type that includes support for RDTSCP?). But most Intel and AMD processors that are in use today support rdtscp. Another dimension for comparing between the two sequences is that rdtscp pollutes one more register (i.e., ECX) than rdtsc.

In summary, if you don't care about reading the IA32_TSC_AUX MSR, there is no particularly big reason why you should choose one over the other. I would use rdtscp and fall back to lfence/rdtsc (or lfence/rdtsc/lfence) on processors that don't support it. If you want maximum timing precision, use the method discussed in Memory latency measurement with time stamp counter.

As Andreas Abel pointed out, you still need an lfence after the last rdtsc(p) as it is not ordered w.r.t. subsequent instructions:

lfence                    lfence
rdtsc      -- ALLOWED --> B
B                         rdtsc

rdtscp     -- ALLOWED --> B
B                         rdtscp

This is also addressed in the manuals.

Regarding the use of rdtscp, it seems correct to me to think of it as a compact lfence + rdtsc.
The manuals use different terminology for the two instructions (e.g. "completed locally" vs "globally visible" for loads) but the behavior described seems to be the same.
I'm assuming so in the rest of this answer.

However rdtscp is a single instruction, while lfence + rdtscp are two, making the lfence part of the profiled code.
Granted that lfence should be lightweight in terms of backend execution resources (it is just a marker) it still occupies front-end resources (two uops?) and a slot in the ROB.
rdtscp is decoded into a greater number of uops due to its ability to read IA32_TSC_AUX, so while it saves front-end (part of) resources, it occupies the backend more.
If the read of the TSC is done first (or concurrently) with the processor ID then this extra uops are only relevant for the subsequent code.
This could be a reason why it is used at the end but not at the start of the benchmark (where the extra uops would affect the code). This is enough to bias/complicate some micro-architectural benchmarks.

You cannot avoid the lfence after an rdtsc(p) but you can avoid the one before with rdtscp.
This seems unnecessary for the first rdtsc as the preceding lfence is not profiled anyway.

Another reason to use rdtscp at the end is that it was (according to Intel) meant to detect a migration to a different CPU (that's why it atomically also load IA32_TSC_AUX), so at the end of the profiled code you may want to check that the code has not been scheduled to another CPU.

User mode software can use RDTSCP to detect if CPU migration has occurred between successive reads of the TSC.

This, of course, requires to have read IA32_TSC_AUX before (to have something to compare to) so one should have a rdpid or rdtscp before the profiling code.
If one can afford to not use ecx, the first rdtsc can be a rdtscp too (but see above), otherwise (rather than storing the processor id while in the profiled code), rdpid can be used first (thus, having a rdtsc + rdtscp pair around the profiled code).

This is open to ABA problem, so I don't think Intel has a strong point on this (unless we restrict ourselves to code short enough to be rescheduled at most once).

EDIT As PeterCordes pointed out, from the point of view of the elapsed time measure, having a migration A->B->A is not an issue as the reference clock is the same.

More information on why rdtsc(p) is not fully serializing: Why isn't RDTSC a serializing instruction? .

TL;DR

Recommended topics

Hot tags