solution to rdtsc out of order execution?

I am trying to replace clock_gettime(CLOCK_REALTIME, &ts) with rdtsc to benchmark code execution time in terms of cpu cycles rather than server time. The execution time of the bench-marking code is critical for the software. I have tried running code on x86_64 3.20GHz ubuntu machine on an isolated core and got following numbers :

case 1 : clock get time : 24 nano seconds

void gettime(Timespec &ts) {
        clock_gettime(CLOCK_REALTIME, &ts);
}

case 2 : rdtsc (without mfence and compiler barrier) : 10 ns

void rdtsc(uint64_t& tsc) {
        unsigned int lo,hi;
        __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
        tsc = ((uint64_t)hi << 32) | lo;
}

case 3 : rdtsc (with mfence and compiler barrier) : 30 ns

void rdtsc(uint64_t& tsc) {
        unsigned int lo,hi;
        __asm__ __volatile__ ("mfence;rdtsc" : "=a" (lo), "=d" (hi) :: "memory");
        tsc = ((uint64_t)hi << 32) | lo;
}

Issue here is I am aware of rdtsc being a non-serializing call and can be reordered by the CPU, an alternative is rdtscp which is a serializing call but instructions after rdtscp call can be reordered before rdtscp call. Using memory barrier is increasing the execution time.

What is the most optimised and best way to benchmark a latency sensitive code ?
Is there anyway to optimise the cases I mentioned ?

You want lfence;rdtsc to start the clock, and rdtscp;lfence to stop the clock, so the barriers are outside the timed interval.

(Or sometimes you want lfence;rdtsc;lfence to start the clock, for extra repeatability at the cost of more overhead.)

MFENCE is the wrong instruction for this; it's not guaranteed to serialize the instruction stream (but in practice it does on Skylake with up-to-date microcode, to fix an erratum). LFENCE serializes the instruction stream without waiting for the store buffer to empty, just for the ROB. This is always true on Intel, but on AMD only with Spectre mitigation enabled that makes lfence not just a NOP. (I guess AMD doesn't reorder movntdqa loads from WC memory, so lfence is meaningless as a memory barrier there, and is only useful as an execution barrier against speculative execution, or for RDTSC.)

See also How to get the CPU cycle count in x86_64 from C++? which has a section about serializing rdtsc. But also, you don't need inline asm for this; use __rdtsc() and _mm_lfence(). (But as usual with microbenchmarks, not a bad idea to check the compiler's asm output to make sure it did what you want.)

You can't avoid overhead, it's always going to be significant compared to the cost of a couple instructions.

Also clflush to invalidate cache line via C function for an example of subtracting the measurement overhead.

But also note that normally it's more useful to put the code under test in a loop, because execution latency before the result is ready is more meaningful than waiting until the instruction(s) actually retire from the ROB. See RDTSCP in NASM always returns the same value (timing a single instruction) for an example (in asm) of measuring a single insn for throughput / latency.

Recommended topics

Hot tags