I am trying to replace clock_gettime(CLOCK_REALTIME, &ts) with rdtsc to benchmark code execution time in terms of cpu cycles rather than server time. The execution time of the bench-marking code is critical for the software. I have tried running code on x86_64 3.20GHz ubuntu machine on an isolated core and got following numbers :
case 1 : clock get time : 24 nano seconds
void gettime(Timespec &ts) {
clock_gettime(CLOCK_REALTIME, &ts);
}
case 2 : rdtsc (without mfence and compiler barrier) : 10 ns
void rdtsc(uint64_t& tsc) {
unsigned int lo,hi;
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
tsc = ((uint64_t)hi << 32) | lo;
}
case 3 : rdtsc (with mfence and compiler barrier) : 30 ns
void rdtsc(uint64_t& tsc) {
unsigned int lo,hi;
__asm__ __volatile__ ("mfence;rdtsc" : "=a" (lo), "=d" (hi) :: "memory");
tsc = ((uint64_t)hi << 32) | lo;
}
Issue here is I am aware of rdtsc being a non-serializing call and can be reordered by the CPU, an alternative is rdtscp which is a serializing call but instructions after rdtscp call can be reordered before rdtscp call. Using memory barrier is increasing the execution time.
- What is the most optimised and best way to benchmark a latency sensitive code ?
- Is there anyway to optimise the cases I mentioned ?