For many years x86 CPUs supported the rdtsc
instruction, which reads the "time stamp counter" of the current CPU. The exact definition of this counter has changed over time, but on recent CPUs it is a counter that increments at a fixed frequency with respect to wall clock time, so it is very useful as building block for a fast, accurate clock or measuring the time taken by small segments of code.
One important fact about the rdtsc
instruction isn't ordered in any special way with the surrounding code. Like most instructions, it can be freely reordered with respect to other instructions which it isn't in a dependency relationship with. This is actually "normal" and for most instructions it's just a mostly invisible way of making the CPU faster (it's just a long winder way of saying out-of-order execution).
For rdtsc
it is important because it means you might not be timing the code you expect to be timing. For example, given the following sequence1:
rdtsc
mov ecx, eax
mov rdi, [rdi]
mov rdi, [rdi]
rdtsc
You might expect rdtsc
to measure the latency of the two pointer chasing loading loads mov rdi, [rdi]
. In practice, however, even if both of these loads take a look time (100s of cycles if they miss in the cache), you'll get a fairly small reading for the rdtsc
pair. The problem is that the second rdtsc
doens't wait for the loads to finish, it just executes out of order, so you aren't timing the interval you think you are. Perhaps both rdtsc
instruction actually even execute before the first load even starts, depending how rdi
was calculated in the code prior to this example.
So far, this is sounding more like an answer to a question nobody asked than a real question, but I'm getting there.
You have two basic use-cases for rdtsc
:
- As a quick timestamp, in which can you usually don't care exactly how it reorders with the surrounding code, since you probably don't have have an instruction-level concept of where the timestamp should be taken, anyways.
As a precise timing mechanism, e.g., in a micro-benchmark. In this case you'll usually protect your
rdtsc
from re-ordering with thelfence
instruction. For the example above, you might do something like:lfence rdtsc lfence mov ecx, eax ... lfence rdtsc
To ensure the timed instructions (
...
) don't escape outside of the timed region, and also to ensure instructions from inside the time region don't come in (probably less of a problem, but they may compete for resources with the code you want to measure).
Years later, Intel looked down upon us poor programmers and came up with a new instruction: rdtscp
. Like rdtsc
it returns a reading of the time stamp counter, and this guy does something more: it reads a core-specific MSR value atomically with the timestamp reading. On most OSes this contains a core ID value. I think the idea is that this value can be used to properly adjust the returned value to real time on CPUs that may have different TSC offsets per core.
Great.
The other thing rdtscp
introduced was half-fencing in terms of out-of-order execution:
From the manual:
The RDTSCP instruction is not a serializing instruction, but it does wait until all previous instructions have executed and all previous loads are globally visible.1 But it does not wait for previous stores to be globally visible, and subsequent instructions may begin execution before the read operation is performed.
So it's like putting an lfence
before the rdtscp
, but not after. What is the point of this half-fencing behavior? If you want a general timestamp and don't care about instruction ordering, the unfenced behavior is what you want. If you want to use this for timing short code sections, the half-fencing behavior is useful only for the second (final) reading, but not for the initial reading, since the fence is on the "wrong" side (in practice you want fences on both sides, but having them on the inside is probably the most important).
What purpose does such half-fencing serve?
1 I'm ignoring the upper 32-bits of the counter in this case.
lfence
's instruction-serializing behaviour, andlfence; rdtsc
, wasn't as widely known when RDTSCP was designed as it is now (after Spectre)? IDK if Intel cared about a portable (to AMD) guarantee thatrdtscp
could be used at the end of timed regions without including a heavy serializing instruction likecpuid
, once AMD implemented it. That seems unlikely, but maybe making sure people could avoidcpuid;rdtsc
with Intel CPUs was a goal. (Thecpuid;rdtsc
at the top of a timed region is "fine" becausecpuid
is outside the timed region.) – Biercelfence;rdtsc;lfence
is usually a good thing at the top of a timed region so it samples the time before letting the timed region start. – Biercerdtscp
can be used to determined whether two critical sections or parts of some transactions overlapped, and do something about it if that happened. I believe that the ability to serialize previous loads and determining core migrations are required for that purpose. Serializing later loads is not needed. See this and this. In addition, this can also be useful for crash consistency. But I don't know much about this area, so I'd rather not write an answer. – Coiffeurrdtscp
on modern implementations does not actually have "half fence" behavior, but rather full fence likelfence
: intstructions on either size of anrdtscp
call seem not be able to overlap. – Moncrief