Supposed we have some repetitions of the same asm that contains RDTSC
such as
volatile size_t tick1;
asm ( "rdtsc\n" // Returns the time in EDX:EAX.
"shl $32, %%rdx\n" // Shift the upper bits left.
"or %%rdx, %q0" // 'Or' in the lower bits.
: "=a" (tick1)
:
: "rdx");
this_thread::sleep_for(1s);
volatile size_t tick2;
asm ( "rdtsc\n" // clang's optimizer just thinks this asm yields
"shl $32, %%rdx\n" // the same bits as above, so it just loads
"or %%rdx, %q0" // the result to qword ptr [rsp + 8]
: "=a" (tick2) //
: // mov qword ptr [rsp + 8], rbx
: "rdx");
printf("tick2 - tick1 diff : %zu cycles\n", tick2 - tick1);
printf("CPU Clock Speed : %.2f GHz\n\n", (double) (tick2 - tick1) / 1'000'000'000.);
Clang++'s optimizer (even with `-O1` ) thinks those two asm blocks yield the same :
tick2 - tick1 diff : 0 cycles
CPU Clock Speed : 0.00 GHz
tick1 : bd806adf8b2
this_thread::sleep_for(1s)
tick2 : bd806adf8b2
When turn off Clang's optimizer, the 2nd block yields progressing ticks as expected :
tick2 - tick1 diff : 2900160778 cycles
CPU Clock Speed : 2.90 GHz
tick1 : 14ab6ab3391c
this_thread::sleep_for(1s)
tick2 : 14ac17902a26
1st GCC g++ "seems" not to affect from this.
tick2 - tick1 diff : 2900226898 cycles
CPU Clock Speed : 2.90 GHz
tick1 : 20e40010d8a8
this_thread::sleep_for(1s)
tick2 : 20e4aceecbfa
[LIVE]
However, let's add tick3
with the exact asm
right after tick2
volatile size_t tick1;
asm ( "rdtsc\n" // Returns the time in EDX:EAX.
"shl $32, %%rdx\n" // Shift the upper bits left.
"or %%rdx, %q0" // 'Or' in the lower bits.
: "=a" (tick1)
:
: "rdx");
this_thread::sleep_for(1s);
volatile size_t tick2;
asm ( "rdtsc\n" // clang's optimizer just thinks this asm yields
"shl $32, %%rdx\n" // the same bits as above, so it just loads
"or %%rdx, %q0" // the result to qword ptr [rsp + 8]
: "=a" (tick2) //
: // mov qword ptr [rsp + 8], rbx
: "rdx");
volatile size_t tick3;
asm ( "rdtsc\n"
"shl $32, %%rdx\n"
"or %%rdx, %q0"
: "=a" (tick3)
:
: "rdx");
It turns out that GCC thinks tick3
's asm
must produce the same value as tick2
because there are "obviously" no external side effects, so it just reload from tick2
. Even that's wrong, well, it has a very strong point though.
tick2 - tick1 diff : 2900209182 cycles
CPU Clock Speed : 2.90 GHz
tick1 : 5670bd15088e
this_thread::sleep_for(1s)
tick2 : 567169f2b6ac
tick3 : 567169f2b6ac
[LIVE]
In C mode, the optimizers of both GCC and Clang affect with this.
In other words, even with -O1
both optimize out the repetitions of asm
blocks containing rdtsc
tick2 - tick1 diff : 0 cycles
CPU Clock Speed : 0.00 GHz
tick1 : 324ab8f5dd2a
thrd_sleep(&(struct timespec){.tv_sec=1}, nullptr)
tick2 : 324ab8f5dd2a
tick3_rdx : 324b65d3368c
[LIVE]
It turns out that all optimizers can do common-subexpression elimination on identical non-volatile
asm
statements, so an asm statement for RDTSC
needs to be volatile
.
volatile
. It's not "a wrong assumption aboutrdtsc
", it's an assumption about a non-volatile
asm statement that happens to containrdtsc
. – Hinduasm
statements based, so if you want to useasm
to readrdtsc
, you needasm volatile
. – Hindu