Is there any difference in between (rdtsc + lfence + rdtsc) and (rdtsc + rdtscp) in measuring execution time?
Asked Answered
S

1

5

As far as I know, the main difference in runtime ordering in a processor with respect to rdtsc and rdtscp instruction is that whether the execution waits until all previous instructions are executed locally.

In other words, it means lfence + rdtsc = rdtscp because lfence preceding the rdtsc instruction makes the following rdtsc to be executed after all previous instruction finish locally.

However, I've seen some example code that uses rdtsc at the start of measurement and rdtscp at the end. Is there any difference in between making use of two rdtsc and rdtsc + rdtscp?

    lfence
    rdtsc
    lfence
    ...
    ...
    ...
    lfence
    rdtsc
    lfence
    lfence
    rdtsc
    lfence
    ...
    ...
    ...
    rdtscp
    lfence
Slipsheet answered 15/1, 2020 at 21:10 Comment(2)
To get meaningful results, there should also be an lfence after the last rdtsc(p).Laudanum
Yeah you are right to prevent the last rdtsc(p) instruction to be reordered with following instructions.Slipsheet
L
9

TL;DR

rdtscp and lfence/rdtsc have the same exact upstream serialization properties On Intel processors. On AMD processors with a dispatch-serializing lfence, both sequences have also the same upstream serialization properties. With respect to later instructions, rdtsc in the lfence/rdtsc sequence may be dispatched for execution simultaneously with later instructions. This behavior may not be desirable if you also want to precisely time these later instructions as well. This is generally not a problem because the reservation station scheduler prioritizes older uops for dispatching as long as there are no structural hazards. After lfence retires, rdtsc uops would be the oldest in the RS with probably no structural hazards, so they will be immediately dispatched (possibly together with some later uops). You could also put an lfence after rdtsc.

The Intel manual V2 says the following about rdtscp (emphasis mine):

The RDTSCP instruction is not a serializing instruction, but it does wait until all previous instructions have executed and all previous loads are globally visible. But it does not wait for previous stores to be globally visible, and subsequent instructions may begin execution before the read operation is performed.

The "read operation" part here refers to reading the time-stamp counter. This suggests that rdtscp internally works like lfence followed by rdtsc + reading IA32_TSC_AUX. That is, lfence is performed first then the two reads from the registers are executed (possibly at the same time).

On most Intel and AMD processors that support these instructions, lfence/rdtsc have a slightly larger number of uops than rdtscp. The number of lfence uops mentioned in Agner's tables is for the case where the lfence instructions are executed back-to-back, which makes it appear that lfence is decoded into a smaller number of uops (1 or 2) than what a single lfence is actually decoded into (5 or 6 uops). Usually, lfence is used without other back-to-back lfences. That's why lfence/rdtsc contains more uops than rdtscp. Agner's tables also show that on some processors, rdtsc and rdtscp have the same number of uops, which I'm not sure is correct. It makes more sense for rdtscp to have one or more uops than rdtsc. That said, the latency may be more important than the difference in the number of uops because that's what directly impacts the measurement overhead.

In terms of portability, rdtsc is older than rdtscp; rdtsc was first supported on the Pentium processors while the first processors that support rdtscp were released in 2005-2006 (See: What is the gcc cpu-type that includes support for RDTSCP?). But most Intel and AMD processors that are in use today support rdtscp. Another dimension for comparing between the two sequences is that rdtscp pollutes one more register (i.e., ECX) than rdtsc.

In summary, if you don't care about reading the IA32_TSC_AUX MSR, there is no particularly big reason why you should choose one over the other. I would use rdtscp and fall back to lfence/rdtsc (or lfence/rdtsc/lfence) on processors that don't support it. If you want maximum timing precision, use the method discussed in Memory latency measurement with time stamp counter.


As Andreas Abel pointed out, you still need an lfence after the last rdtsc(p) as it is not ordered w.r.t. subsequent instructions:

lfence                    lfence
rdtsc      -- ALLOWED --> B
B                         rdtsc

rdtscp     -- ALLOWED --> B
B                         rdtscp

This is also addressed in the manuals.


Regarding the use of rdtscp, it seems correct to me to think of it as a compact lfence + rdtsc.
The manuals use different terminology for the two instructions (e.g. "completed locally" vs "globally visible" for loads) but the behavior described seems to be the same.
I'm assuming so in the rest of this answer.

However rdtscp is a single instruction, while lfence + rdtscp are two, making the lfence part of the profiled code.
Granted that lfence should be lightweight in terms of backend execution resources (it is just a marker) it still occupies front-end resources (two uops?) and a slot in the ROB.
rdtscp is decoded into a greater number of uops due to its ability to read IA32_TSC_AUX, so while it saves front-end (part of) resources, it occupies the backend more.
If the read of the TSC is done first (or concurrently) with the processor ID then this extra uops are only relevant for the subsequent code.
This could be a reason why it is used at the end but not at the start of the benchmark (where the extra uops would affect the code). This is enough to bias/complicate some micro-architectural benchmarks.

You cannot avoid the lfence after an rdtsc(p) but you can avoid the one before with rdtscp.
This seems unnecessary for the first rdtsc as the preceding lfence is not profiled anyway.


Another reason to use rdtscp at the end is that it was (according to Intel) meant to detect a migration to a different CPU (that's why it atomically also load IA32_TSC_AUX), so at the end of the profiled code you may want to check that the code has not been scheduled to another CPU.

User mode software can use RDTSCP to detect if CPU migration has occurred between successive reads of the TSC.

This, of course, requires to have read IA32_TSC_AUX before (to have something to compare to) so one should have a rdpid or rdtscp before the profiling code.
If one can afford to not use ecx, the first rdtsc can be a rdtscp too (but see above), otherwise (rather than storing the processor id while in the profiled code), rdpid can be used first (thus, having a rdtsc + rdtscp pair around the profiled code).

This is open to ABA problem, so I don't think Intel has a strong point on this (unless we restrict ourselves to code short enough to be rescheduled at most once).

EDIT As PeterCordes pointed out, from the point of view of the elapsed time measure, having a migration A->B->A is not an issue as the reference clock is the same.


More information on why rdtsc(p) is not fully serializing: Why isn't RDTSC a serializing instruction? .

Liegeman answered 15/1, 2020 at 21:11 Comment(13)
I think at the bottom of a timed region, you really do want lfence;rdtsc;lfence, or rdtscp;lfence. I'm not sure exactly why stopping later instructions from running while the final TSC read is happening matters, but it does give more consistent results. (e.g. Hadi recommended it for measuring cache miss latency). Oh, I think I just understood your "valid" arrow diagram: you're showing reordering allowed by the CPU which you don't want. CPUs normally execute oldest-ready-first, thoughBeckerman
If you do manage to have an ABA migration within one timed region (e.g. another interrupt a few instructions after entering user-space after first migration), you'll still be measuring elapsed time accurately because you're looking at the same clock for start and end times. RDTSCP lets you detect the case of an apparently-reasonable time interval when actually you were subtracting times from two non-synced clocks. (Usually TSC is synced between cores because they all power up at the same time, and CPUs have constant_tsc / nonstop_tsc. But software can modify the TSC MSR and desync them.)Beckerman
I made some changes to the TL;DR section. I'm not sure whether I should have posted my own answer instead of editing your answer, but I think you were on the right direction. (Let me know if you'd like me to remove my edit and post it as a separate answer.) Anyway, note that some changes may also be needed in the other parts of the answer to make the whole thing consistent.Nicko
@PeterCordes, good point about the elapsed time! I also changed VALID with ALLOWED. It should be clearer.Liegeman
@HadiBrais Thank you very much Hadi! I've made the answer community wiki as your edit was significant. Feel free to grab anything you need if you want to post an answer of yours.Liegeman
What is "upstream serialization"?Divest
@Divest I think it means "serialization of all earlier, in program order, instructions".Liegeman
@MargaretBloom - thanks. I would say this term is not clear, at least to me. Clearer might be "earlier" or "older". I can't really tell if the first paragraph is trying to draw a distinction between lfence; rdtsc and rdtscp or not: since the serialization property is limited to upstream serialization, I wonder then if there is a corresponding downstream serialization which is different (but I think the answer kind of goes on to say there isn't, but then why bring upstream into the picture at all?).Divest
@Divest Maybe a more correct interpretation up "upstream" and "downstream" serialization could be "no reordering with earlier/older instructions" and "no reordering with later/younger instructions" respectively (both wrt program order). The "downstream" serialization of an instruction after lfence prevents at most concurrent execution (still a form of reordering, IMO) with later independent uops since the scheduler scans in program order. I would not have used the "upstream" and "downstream" but they still make sense to me. You should probably ping HadiBrais for more highlights.Liegeman
Thanks @MargaretBloom. I understand now that this answer has been written by more than one person, which explains further questions I had.Divest
I am still confusing because of the Hadi's comment on the other answer. /* rdtscp is not suitbale for measuing very small sections of code because the write to its parameter occurs after sampling the TSC and it impacts compiler optimizations and code gen, thereby perturbing the measurement */ Does it mean that because RDTSCP instruction is converted to microops such as lfence + rdtsc + reading IA32_TSC_AUX, and the converted instruction can be reordered at runtime such that the reading IA32_TSC_AUX can be executed before it reads the TSC?Slipsheet
It might be wrong because there is no load-load reordering at runtime on intel cores. Then, does it refer only to the case when the code uses rdtscp with no proper barrier? Is the intrinsic (__rdtscp()) implemented with no barrier following the RDTSCP? Which means does it store the TSC to the address pointed to by its parameter without having any barrier in between rdtscp and the store instruction? I am not quite sure why using RDTSCP instead of RDTSC results in less precise timing measurement.Slipsheet
@JaehyukLee Yeah that's not accurate. I've updated that answer. Thank you for pointing that out.Nicko

© 2022 - 2024 — McMap. All rights reserved.