Why is CPUID + RDTSC unreliable?
Asked Answered
N

2

5

I am trying to profile a code for execution time on an x86-64 processor. I am referring to this Intel white paper and also gone through other SO threads discussing the topic of using RDTSCP vs CPUID+RDTSC here and here.

In the above mentioned whitepaper, the method using CPUID+RDTSC is termed unreliable and also proven using the statistics.

What might be the reason for the CPUID+RDTSC being unreliable?

Also, the graphs in Figure 1(Minimum value Behavior graph) and Figure 2 (Variance Behavior graph) in the same white paper have got a "Square wave" pattern. What explains such a pattern?

Neal answered 24/12, 2018 at 0:46 Comment(0)
L
4

I think they're finding that CPUID inside the measurement interval causes extra variability in the total time. Their proposed fix in 3.2 Improvements Using RDTSCP Instruction highlights the fact that there's no CPUID inside the timed interval when they use CPUID / RDTSC to start, and RDTSCP/CPUID to stop.

Perhaps they could have ensured EAX=0 or EAX=1 before executing CPUID, to choose which CPUID leaf of data to read (http://www.sandpile.org/x86/cpuid.htm#level_0000_0000h), in case CPUID time taken depends on which query you make. Other than that, I'm unsure why that would be.

Or better, use lfence instead of cpuid to serialize OoO exec without being a full serializing operation.


Note that the inline asm in Intel's whitepaper sucks: there's no need for those mov instructions if you use proper output constraints like "=a"(low), "=d"(high). See How to get the CPU cycle count in x86_64 from C++? for better ways.

Longcloth answered 24/12, 2018 at 1:22 Comment(5)
The "extra" CPUID meddling with the measurement is understandable. However, the second part of your answer is not clear yet.Neal
@Pramod: I'm suggesting that including a CPUID in the measurement interval might be less bad if you make sure EAX=0 when it runs (sandpile.org/x86/cpuid.htm#level_0000_0000h), in case some leaves take longer to query than others.Longcloth
ah ok. Thanks! What might be the reason for the "square wave" pattern? Initially, I thought that it might be something related to Cache. But in section 1.2, it is assumed that every factor of nondeterminism is removed. Caching also counts towards nondeterminism I believe. Do you have any thoughts on that?Neal
@Pramod: I don't have an explanation for that. I'm curious, too. I skimmed through the paper, but I don't see what code they say they're measuring. They may just be measuring an empty block to evaluate measurement overhead, so rdtsc and rdtscp are back-to-back? I don't think there's any memory (cache) access inside the timed region. I haven't tried to replicate the experiment on my Skylake CPU. I'd expect that user-space on a mostly idle system should be close enough to what they're doing (in kernel with interrupts disabled.)Longcloth
yes. You're right. They are measuring an empty block, line 46 of the code in Appendix. By cache, I meant Instruction cache. It is a good idea to try it on some platform (on PC CPU is not that interesting and also I'm dreadful to alter the kernel code, atleast for now :))Neal
L
2

Another reason for CPUID+RDTSC to be unreliable is due VM side-channel attack.

When running CPUID instruction inside VM will cause VM EXIT, this is happened so the VM will handle the CPUID as he want and manipulate the CPUID instruction.
doing this manipulation adding extra time, and using RDTSC will return "high" value, since "the entire VM CPUID manipulation" is executed in that time.
This value can be then used, to detect that we are running inside VM.

This behavior can be preventing by VM who can scale or virtualize the TSC, making the RDTSC unreliable

Detecting VM Exit Overhead

Laurustinus answered 4/1, 2019 at 0:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.