Assume we're trying to use the tsc for performance monitoring and we we want to prevent instruction reordering.
These are our options:
1: rdtscp
is a serializing call. It prevents reordering around the call to rdtscp.
__asm__ __volatile__("rdtscp; " // serializing read of tsc
"shl $32,%%rdx; " // shift higher 32 bits stored in rdx up
"or %%rdx,%%rax" // and or onto rax
: "=a"(tsc) // output to tsc variable
:
: "%rcx", "%rdx"); // rcx and rdx are clobbered
However, rdtscp
is only available on newer CPUs. So in this case we have to use rdtsc
. But rdtsc
is non-serializing, so using it alone will not prevent the CPU from reordering it.
So we can use either of these two options to prevent reordering:
2: This is a call to cpuid
and then rdtsc
. cpuid
is a serializing call.
volatile int dont_remove __attribute__((unused)); // volatile to stop optimizing
unsigned tmp;
__cpuid(0, tmp, tmp, tmp, tmp); // cpuid is a serialising call
dont_remove = tmp; // prevent optimizing out cpuid
__asm__ __volatile__("rdtsc; " // read of tsc
"shl $32,%%rdx; " // shift higher 32 bits stored in rdx up
"or %%rdx,%%rax" // and or onto rax
: "=a"(tsc) // output to tsc
:
: "%rcx", "%rdx"); // rcx and rdx are clobbered
3: This is a call to rdtsc
with memory
in the clobber list, which prevents reordering
__asm__ __volatile__("rdtsc; " // read of tsc
"shl $32,%%rdx; " // shift higher 32 bits stored in rdx up
"or %%rdx,%%rax" // and or onto rax
: "=a"(tsc) // output to tsc
:
: "%rcx", "%rdx", "memory"); // rcx and rdx are clobbered
// memory to prevent reordering
My understanding for the 3rd option is as follows:
Making the call __volatile__
prevents the optimizer from removing the asm or moving it across any instructions that could need the results (or change the inputs) of the asm. However it could still move it with respect to unrelated operations. So __volatile__
is not enough.
Tell the compiler memory is being clobbered: : "memory")
. The "memory"
clobber means that GCC cannot make any assumptions about memory contents remaining the same across the asm, and thus will not reorder around it.
So my questions are:
- 1: Is my understanding of
__volatile__
and"memory"
correct? - 2: Do the second two calls do the same thing?
- 3: Using
"memory"
looks much simpler than using another serializing instruction. Why would anyone use the 3rd option over the 2nd option?
volatile
andmemory
and reordering of instructions executed by the processor (aka out of order execution), which you avoid by usingcpuid
. – Princememory
in the clobber list prevent the processor reordering the instructions? Doesn'tmemory
act like a memory fence? – Illegalitymemory
in the clobber list is only emitted to gcc, and the resulting machine code doesn't expose this to the processor? – Illegalitymovntdq
. Most of the time you do not need a memory fence on Intel/AMD processors, as these processors have strong memory ordering by default. And yes,memory
only affects the order in which instructions are emitted by the compiler, it does not make the compiler emit additional instructions. – Princerdtscp
doesn't prevent reordering, it only ensures all previous instructions have finished executing:The RDTSCP instruction waits until all previous instructions have been executed before reading the counter. However, subsequent instructions may begin execution before the read operation is performed.
, I suggest you read this whitepaper from intel if you are considering using this for benchmarking etc: download.intel.com/embedded/software/IA/324264.pdf (it actually shows that you need bothrdtsc
+cpuid
andrdtscp
+cpuid
for correct measurements) – Mcnalley