Variance in RDTSC overhead
Asked Answered
P

3

15

I'm constructing a micro-benchmark to measure performance changes as I experiment with the use of SIMD instruction intrinsics in some primitive image processing operations. However, writing useful micro-benchmarks is difficult, so I'd like to first understand (and if possible eliminate) as many sources of variation and error as possible.

One factor that I have to account for is the overhead of the measurement code itself. I'm measuring with RDTSC, and I'm using the following code to find the measurement overhead:

extern inline unsigned long long __attribute__((always_inline)) rdtsc64() {
    unsigned int hi, lo;
        __asm__ __volatile__(
            "xorl %%eax, %%eax\n\t"
            "cpuid\n\t"
            "rdtsc"
        : "=a"(lo), "=d"(hi)
        : /* no inputs */
        : "rbx", "rcx");
    return ((unsigned long long)hi << 32ull) | (unsigned long long)lo;
}

unsigned int find_rdtsc_overhead() {
    const int trials = 1000000;

    std::vector<unsigned long long> times;
    times.resize(trials, 0.0);

    for (int i = 0; i < trials; ++i) {
        unsigned long long t_begin = rdtsc64();
        unsigned long long t_end = rdtsc64();
        times[i] = (t_end - t_begin);
    }

    // print frequencies of cycle counts
}

When running this code, I get output like this:

Frequency of occurrence (for 1000000 trials):
234 cycles (counted 28 times)
243 cycles (counted 875703 times)
252 cycles (counted 124194 times)
261 cycles (counted 37 times)
270 cycles (counted 2 times)
693 cycles (counted 1 times)
1611 cycles (counted 1 times)
1665 cycles (counted 1 times)
... (a bunch of larger times each only seen once)

My questions are these:

  1. What are the possible causes of the bi-modal distribution of cycle counts generated by the code above?
  2. Why does the fastest time (234 cycles) only occur a handful of times—what highly unusual circumstance could reduce the count?

Further Information

Platform:

  • Linux 2.6.32 (Ubuntu 10.04)
  • g++ 4.4.3
  • Core 2 Duo (E6600); this has constant rate TSC.

SpeedStep has been turned off (processor is set to performance mode and is running at 2.4GHz); if running in 'ondemand' mode, I get two peaks at 243 and 252 cycles, and two (presumably corresponding) peaks at 360 and 369 cycles.

I'm using sched_setaffinity to lock the process to one core. If I run the test on each core in turn (i.e., lock to core 0 and run, then lock to core 1 and run), I get similar results for the two cores, except that the fastest time of 234 cycles tends to occur slightly fewer times on core 1 than on core 0.

Compile command is:

g++ -Wall -mssse3 -mtune=core2 -O3 -o test.bin test.cpp

The code that GCC generates for the core loop is:

.L105:
#APP
# 27 "test.cpp" 1
    xorl %eax, %eax
    cpuid
    rdtsc
# 0 "" 2
#NO_APP
    movl    %edx, %ebp
    movl    %eax, %edi
#APP
# 27 "test.cpp" 1
    xorl %eax, %eax
    cpuid
    rdtsc
# 0 "" 2
#NO_APP
    salq    $32, %rdx
    salq    $32, %rbp
    mov %eax, %eax
    mov %edi, %edi
    orq %rax, %rdx
    orq %rdi, %rbp
    subq    %rbp, %rdx
    movq    %rdx, (%r8,%rsi)
    addq    $8, %rsi
    cmpq    $8000000, %rsi
    jne .L105
Potable answered 21/6, 2011 at 22:15 Comment(10)
I got extacly the same variance with my own measuring codeTrichology
@drhirsch: That's good to know. Could you say what CPU you're measuring on, and if there was any major difference in methodology?Potable
A Core 2 Q9550, the 45 nm version of the 6600. I accepted a value of 243, well knowing I was measuring a lot of crapTrichology
According to ccsl.carleton.ca/~jamuir/rdtscpm1.pdf , CPUID has annoying variable latency. You could try repeating without CPUID. There's also apparently RDTSCP which is serializing, if your processor supports it.Preemie
@tc: Thanks for the link -- I think I found a reference to that article somewhere else, but the link it had was broken.Potable
@tc: As far as I can see, that article only says that CPUID has a warm-up effect (it's slower the first couple of times it's used). Of course, that effect may have become more complex since the Pentium II came out.Potable
Given that the OS and other things probably call CPUID several times before your app gets a chance to run, it's likely that CPUID also has a "cool-down" effect. It would also be at least interesting to see what performance without CPUID was like. I don't have an i7 handy to test RDTSCP.Preemie
@tc: Indeed. Unfortunately, I don't have an i7 available for test either. Perhaps the slower values are those measured immediately after a context switch back to the process.Potable
The risk of disabling CPUID is that the out of order execution engine could mess the order of your measures... Please take a look at: github.com/petersenna/rdtscbenchAfterclap
You can use lfence; rdtsc to serialize execution (drain the ROB) before RDTSC executes. It's much lower overhead than cpuid. See this Q&A. See also How to get the CPU cycle count in x86_64 from C++?Sphygmo
A
10

RDTSC can return inconsistent results for a number of reasons:

  • On some CPUs (especially certain older Opterons), the TSC isn't synchronized between cores. It sounds like you're already handling this by using sched_setaffinity -- good!
  • If the OS timer interrupt fires while your code is running, there'll be a delay introduced while it runs. There's no practical way to avoid this; just throw out unusually high values.
  • Pipelining artifacts in the CPU can sometimes throw you off by a few cycles in either direction in tight loops. It's perfectly possible to have some loops that run in a non-integer number of clock cycles.
  • Cache! Depending on the vagaries of the CPU cache, memory operations (like the write to times[]) can vary in speed. In this case, you're fortunate that the std::vector implementation being used is just a flat array; even so, that write can throw things off. This is probably the most significant factor for this code.

I'm not enough of a guru on the Core2 microarchitecture to say exactly why you're getting this bimodal distribution, or how your code ran faster those 28 times, but it probably has something to do with one of the reasons above.

Anastaciaanastas answered 21/6, 2011 at 22:39 Comment(12)
Primarily just experience. Intel isn't particularly forthcoming with fine details of their architecture, especially when it comes to certain aspects of memory and cache access.Anastaciaanastas
Looks like I'll just have to accept the variance without knowing its root cause (CPUID or cache effects being the most likely explanations so far).Potable
@John Bartholomew: look at the generated assembly code and try to strip away as much as possible by modifying the source code. Personally I don't think your xor and cpuid instructions are necessary. In addition the rdtsc instruction places its results in edx:eax which is where the compiler itself would return long long values. So the rbx:rcx only creates unneccessary extra work for the compiler.Philibeg
Intel's documentation says CPUID takes longer the first few times it's called, they recommend calling it 3 times before hand. Also, cache is almost certainly a factor here, if you want a value of rdtsc to subtract from other benchmarks you should just be touching one cache line -- use separate runs to get an average, a min, and a max (only gathering one each time, and use a branchless min/max implementation). Lastly, pinning to a core isn't enough to get rid of the scheduler, you also need to prevent other processes from running on that core.Cummerbund
@OlofForshell: The xor and the eax are both required. The cpuid is needed because it's a serializing instruction (it flushes the pipeline). Without it the rdtsc maybe execute before instructions in front of it. The xor is needed to zero out eax because cpuid puts different results in eax based on its current contents, which may affect its running time and cause more variance.Cummerbund
@Joseph Garvin: I guess I disagree with you and I'll explain why. First of all concerning shorter sequences. These are tricky to measure because of cache and pipelining effects which are difficult to predict. By using serializing instructions these effects are neutralized which will typically result in measurements that will indicate slower execution than in real life because overlapping/parallel execution of the sequence and neighbouring instructions is forced to complete before the measurement is performed. My advice is to skip the serializing instructions and time longer sequences instead.Philibeg
@OlofForshell: You will still have pipelining between the cpuid+rdtsc calls. The performance may be slower than 'real life', but typically you're doing these kinds of measurements so you can compare two alternatives, and what you care about is whether one is relatively faster than the other. You don't want one to look relatively faster than the other because by some artifact of the microarchitecture rdtsc gets executed before the last instructions in your first alternative and after the last instructions in your second alternative (the second will be handicapped).Cummerbund
@Joseph Garvin: I think you've got it wrong. Typically you measure performance to get an indication of actual elapsed clock cycles rather than some artificial lab environment which you agree may produce slower (and therefore misleading) results. So, in effect you're introducing an unquantifiable unknown into the process (I prfer to leave them out if I can). Synchronizing instructions could also lead to a sequence being slower than the alternative and faster when it isn't synchronized. How do you suggest such results should be handled?Philibeg
@OlofForshell: Nobody ever cares about clock cycles unless they're optimizing (in which case they're comparing two alternatives to see which is faster) or they're trying to meet a real time deadline (in which case the more pessimistic answer is better). My point is that when comparing the performance of two alternatives using cpuid is less misleading -- without it one may appear faster than the other because the second rdtsc gets executed before the work you're trying to measure. If you really want cycle accuracy, you can subtract out the cpuid/rdtsc overhead at the end, as Intel suggests.Cummerbund
@Joseph Garvin: elapsed cycles divided by clock rate equals elapsed time. If you don't care about one why should you care about the other? Instructions execute in a real world of caches, pipelining, out-of-order execution and the like. The sequence in question will most likely be executed in a more or less indistinguishable muddle with the sequence before it and the one after. Synchronizing will replace the real-world muddle with something you can measure but - obviously - won't reflect real-world behaviour. I don't get it.Philibeg
I also got the same behavior measuring a function timings with rdtsc but with larger measurement times (~2000). But the effect seems to be the same with larger variance of thousands of cycles. Can such micro effects throw off timings by that number of cycles considering that even memory accesses incur costs just around 400 cycles? Btw this is after discounting outliers from the context switches.Furthermost
@OlofForshell stumbling on this 8 years later I think what I was trying to say is this: if you only use 1 cache line to store a histogram with a branchless implementation your variance will be smaller, because cache misses and branch prediction failures create variance. A real environment will have cache misses and branch prediction failures yes, but they will be because of your actual app, not your microbenchmark code. By having less of that in your microbenchmark it makes things more realistic. At least that's what I think I meant.Cummerbund
A
3

The Intel Programmer's manual recommends you use lfence;rdtsc or rdtscp if you want to ensure that instructions prior to the rdtsc have actually executed. This is because rdtsc isn't a serializing instruction by itself.

Argentite answered 14/12, 2011 at 8:46 Comment(0)
P
1

You should make sure that frequency throttling/green functionality is disabled at the OS level. Restart the machine. You may otherwise have a situation where the cores have unsynchronized time stamp counter values.

The 243 reading is by far the most common which is one reason for using it. On the other hand suppose you get an elapsed time <243: you subtract the overhead and get an underflow. Since the arithmetic is unsigned you end up with an enormous result. This fact speaks for using the lowest reading (234) instead. It is extremely difficult to accurately measure sequences that are only a few cycles long. On a typical x86 @ a few GHz I'd recommend against timing sequences shorter than 10ns and even at that length they would typically be far from rock-solid.

The rest of my answer here is what I do, how I handle the results and my reasoning on the subject matter.

As to overhead the easiest way is to use code such as this

unsigned __int64 rdtsc_inline (void);
unsigned __int64 rdtsc_function (void);

The first form emits the rdtsc instruction into the generated code (as in your code). The second will cause a the function to be called, the rdtsc executed and a return instruction. Perhaps it will generate stack frames. Obviously the second form is much slower than the first.

The (C) code for overhead calculation can then be written

unsigned __int64 start_cycle,end_cycle;    /* place these @ the module level*/

unsigned __int64 overhead;
    
/* place this code inside a function */
    
start_cycle=rdtsc_inline();
  end_cycle=rdtsc_inline();
overhead=end_cycle-start_cycle;

If you are using the inline variant you will get a low(er) overhead. You will also run the risk of calculating an overhead which is greater than it "should" be (especially for the function form) which in turn means that if you measure very short/fast sequences you may run into having a previously calculated overhead which is greater than the measurement itself. When you attempt to adjust for the overhead you will get an underflow which will lead to messy conditions. The best way to handle this is to

  1. time the overhead several times and always use the smallest value achieved,
  2. not measure really short code sequences as you may run into pipelining effects which will require messy synchronising instructions before the rdtsc instruction and
  3. if you must measure very short sequences regard the results as indications rather than as facts

I've previously discussed what I do with the results in this thread.

Another thing that I do is to integrate the measurement code into the application. The overhead is insignificant. After a result has been computed I send it to a special structure where I count the number of measurements, sum x and x^2 values and determine min and max measurements. Later on I can use the data to calculate average and standard deviation. The structure itself is indexed and I can measure different performance aspects such individual application functions ("functional performance"), time spent in cpu, disk read/writing, network read/writing ("non-functional performance") etc.

If an application is instrumented in this manner and monitored from the very beginning I expect that the risk of it having performance problems during its lifetime will be greatly reduced.

Philibeg answered 17/8, 2011 at 12:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.