What's up with the "half fence" behavior of rdtscp?
Asked Answered
M

0

9

For many years x86 CPUs supported the rdtsc instruction, which reads the "time stamp counter" of the current CPU. The exact definition of this counter has changed over time, but on recent CPUs it is a counter that increments at a fixed frequency with respect to wall clock time, so it is very useful as building block for a fast, accurate clock or measuring the time taken by small segments of code.

One important fact about the rdtsc instruction isn't ordered in any special way with the surrounding code. Like most instructions, it can be freely reordered with respect to other instructions which it isn't in a dependency relationship with. This is actually "normal" and for most instructions it's just a mostly invisible way of making the CPU faster (it's just a long winder way of saying out-of-order execution).

For rdtsc it is important because it means you might not be timing the code you expect to be timing. For example, given the following sequence1:

rdtsc
mov ecx, eax
mov rdi, [rdi]
mov rdi, [rdi]
rdtsc

You might expect rdtsc to measure the latency of the two pointer chasing loading loads mov rdi, [rdi]. In practice, however, even if both of these loads take a look time (100s of cycles if they miss in the cache), you'll get a fairly small reading for the rdtsc pair. The problem is that the second rdtsc doens't wait for the loads to finish, it just executes out of order, so you aren't timing the interval you think you are. Perhaps both rdtsc instruction actually even execute before the first load even starts, depending how rdi was calculated in the code prior to this example.

So far, this is sounding more like an answer to a question nobody asked than a real question, but I'm getting there.

You have two basic use-cases for rdtsc:

  • As a quick timestamp, in which can you usually don't care exactly how it reorders with the surrounding code, since you probably don't have have an instruction-level concept of where the timestamp should be taken, anyways.
  • As a precise timing mechanism, e.g., in a micro-benchmark. In this case you'll usually protect your rdtsc from re-ordering with the lfence instruction. For the example above, you might do something like:

    lfence
    rdtsc
    lfence
    mov ecx, eax
    ...
    lfence
    rdtsc
    

    To ensure the timed instructions (...) don't escape outside of the timed region, and also to ensure instructions from inside the time region don't come in (probably less of a problem, but they may compete for resources with the code you want to measure).

Years later, Intel looked down upon us poor programmers and came up with a new instruction: rdtscp. Like rdtsc it returns a reading of the time stamp counter, and this guy does something more: it reads a core-specific MSR value atomically with the timestamp reading. On most OSes this contains a core ID value. I think the idea is that this value can be used to properly adjust the returned value to real time on CPUs that may have different TSC offsets per core.

Great.

The other thing rdtscp introduced was half-fencing in terms of out-of-order execution:

From the manual:

The RDTSCP instruction is not a serializing instruction, but it does wait until all previous instructions have executed and all previous loads are globally visible.1 But it does not wait for previous stores to be globally visible, and subsequent instructions may begin execution before the read operation is performed.

So it's like putting an lfence before the rdtscp, but not after. What is the point of this half-fencing behavior? If you want a general timestamp and don't care about instruction ordering, the unfenced behavior is what you want. If you want to use this for timing short code sections, the half-fencing behavior is useful only for the second (final) reading, but not for the initial reading, since the fence is on the "wrong" side (in practice you want fences on both sides, but having them on the inside is probably the most important).

What purpose does such half-fencing serve?


1 I'm ignoring the upper 32-bits of the counter in this case.

Moncrief answered 4/9, 2018 at 3:53 Comment(5)
I wonder if lfence's instruction-serializing behaviour, and lfence; rdtsc, wasn't as widely known when RDTSCP was designed as it is now (after Spectre)? IDK if Intel cared about a portable (to AMD) guarantee that rdtscp could be used at the end of timed regions without including a heavy serializing instruction like cpuid, once AMD implemented it. That seems unlikely, but maybe making sure people could avoid cpuid;rdtsc with Intel CPUs was a goal. (The cpuid;rdtsc at the top of a timed region is "fine" because cpuid is outside the timed region.)Bierce
And yes, I know that lfence;rdtsc;lfence is usually a good thing at the top of a timed region so it samples the time before letting the timed region start.Bierce
rdtscp can be used to determined whether two critical sections or parts of some transactions overlapped, and do something about it if that happened. I believe that the ability to serialize previous loads and determining core migrations are required for that purpose. Serializing later loads is not needed. See this and this. In addition, this can also be useful for crash consistency. But I don't know much about this area, so I'd rather not write an answer.Coiffeur
@HadiBrais - excellent references, thanks. Still looking at them.Moncrief
FWIW, based on my testing, rdtscp on modern implementations does not actually have "half fence" behavior, but rather full fence like lfence: intstructions on either size of an rdtscp call seem not be able to overlap.Moncrief

© 2022 - 2024 — McMap. All rights reserved.