rdtsc, too many cycles

Asked 30/11, 2011 at 8:1 Answered 31/10, 2012 at 5:5

#include <stdio.h>
static inline unsigned long long tick() 
{
        unsigned long long d;
        __asm__ __volatile__ ("rdtsc" : "=A" (d) );
        return d;
}

int main()
{
        long long res;
        res=tick();

        res=tick()-res;
        printf("%d",res);
        return 0;
}

I have compiled this code with gcc with -O0 -O1 -O2 -O3 optimizations. And I always get 2000-2500 cycles. Can anyone explain the reason for this output? How to spend these cycles?

~~First function "tick" is wrong. This is right~~.

Another version of function "tick"

static __inline__ unsigned long long tick()
{
  unsigned hi, lo;
  __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
  return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}

This is assembly code for -O3

 .file  "rdtsc.c"
.section    .rodata.str1.1,"aMS",@progbits,1
.LC0:
    .string "%d"
    .text
    .p2align 4,,15
.globl main
    .type   main, @function
main:
    leal    4(%esp), %ecx
    andl    $-16, %esp
    pushl   -4(%ecx)
    pushl   %ebp
    movl    %esp, %ebp
    subl    $40, %esp
    movl    %ecx, -16(%ebp)
    movl    %ebx, -12(%ebp)
    movl    %esi, -8(%ebp)
    movl    %edi, -4(%ebp)
#APP
# 6 "rdtsc.c" 1
    rdtsc
# 0 "" 2
#NO_APP
    movl    %edx, %edi
    movl    %eax, %esi
#APP
# 6 "rdtsc.c" 1
    rdtsc
# 0 "" 2
#NO_APP
    movl    %eax, %ecx
    movl    %edx, %ebx
    subl    %esi, %ecx
    sbbl    %edi, %ebx
    movl    %ecx, 4(%esp)
    movl    %ebx, 8(%esp)
    movl    $.LC0, (%esp)
    call    printf
    movl    -16(%ebp), %ecx
    xorl    %eax, %eax
    movl    -12(%ebp), %ebx
    movl    -8(%ebp), %esi
    movl    -4(%ebp), %edi
    movl    %ebp, %esp
    popl    %ebp
    leal    -4(%ecx), %esp
    ret
    .size   main, .-main
    .ident  "GCC: (Debian 4.3.2-1.1) 4.3.2"
    .section    .note.GNU-stack,"",@progbits

This is CPU

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 15
model       : 4
model name  : Intel(R) Xeon(TM) CPU 3.00GHz
stepping    : 3
cpu MHz     : 3000.105
cache size  : 2048 KB
fdiv_bug    : no
hlt_bug     : no
f00f_bug    : no
coma_bug    : no
fpu     : yes
fpu_exception   : yes
cpuid level : 5
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss constant_tsc up pebs bts pni
bogomips    : 6036.62
clflush size    : 64

Unwarrantable answered 30/11, 2011 at 8:1 Comment(9)

Much implies you cant count it. Use many instead. – Cecelia 30/11, 2011 at 8:3

rdtsc writes to %edx:%eax`. Your tick() function is wrong. – Papillote 30/11, 2011 at 8:12

with second function the same output – Unwarrantable 30/11, 2011 at 8:23

of course, using =A will return lower 32 bits on x86_64 only, and lower 32 bits are enough for this test anyway – Sheaves 30/11, 2011 at 8:29

Make sure you understand why rdtsc is not a reliable timer. – Basting 30/11, 2011 at 8:44

@Thanatos: Surely you can help the OP on why you say this. – Cecelia 30/11, 2011 at 9:23

When I run your code, I almost always get 42. Douglas Adams would be happy. :-) – Aberrant 30/11, 2011 at 10:16

What OS are you using for this test (distro and kernel version)? – Aberrant 30/11, 2011 at 10:18

Linux version 2.6.26-2-686 (Debian 2.6.26-26lenny1) ([email protected]) (gcc version 4.1.3 20080704 (prerelease) (Debian 4.1.2-25)) #1 SMP Thu Nov 25 01:53:57 UTC 2010 – Unwarrantable 30/11, 2011 at 11:25

I've tried your code on several Linux distros running on different Intel CPUs (admittedly all more recent than the Pentium 4 HT 630 you appear to be using). In all those tests I got values between 25 and 50 cycles.

My only hypothesis that's consistent with all the evidence is that you're running your operating system inside a virtual machine rather than on bare metal, and TSC is getting virtualized.

Aberrant answered 30/11, 2011 at 10:50 Comment(0)

There are any number of reasons to get a large number:

The OS did a context switch, and your process got put to sleep.
A disk seek occurred, and your process got put to sleep.
…any of a slew of reasons as to why your process might get ignored.

Note that rdtsc is not particularly reliable for timing without work, because:

Processor speeds can change, and thus, the length of a cycle (when measured in seconds) changes.
Different processors may have different values for the TSC for a given instant in time.

Most operatings systems have a high-precision clock or timing method. clock_gettime on Linux for example, particularly the monotonic clocks. (Understand too the difference between a wall-clock and a monotonic clock: a wall clock can move backwards — even in UTC.) On Windows, I think the recommendation is QueryHighPerformanceCounter. Typically these clocks provide more than enough accuracy for most needs.

Also, looking at the assembly, it looks like you're only getting 32-bits of the answer: I don't see %edx getting saved after rdtsc.

Running your code, I get timings from 120-150 ns for clock_gettime using CLOCK_MONOTONIC, and 70-90 cycles for rdtsc (~20 ns at full speed, but I suspect the processor is clocked down, and that's really about 50 ns). (On a ~~laptop~~desktop (darn SSH, forgot which machine I was on!) that is at about a constant 20% CPU use) Sure your machine isn't bogged down?

Basting answered 30/11, 2011 at 9:31 Comment(8)

struct timespec ts1,ts2; clock_gettime(CLOCK_MONOTONIC,&ts1); clock_gettime(CLOCK_MONOTONIC,&ts2); ts2.tv_nsec-ts1.tv_nsec ~8000 it is too large. – Unwarrantable 30/11, 2011 at 9:44

=A copies values both from edx and eax. – Sheaves 30/11, 2011 at 9:46

@mifki: If you say so: perhaps OP is mis-compiling. Do you see %edx getting saved in the assembly? – Basting 30/11, 2011 at 9:48

@Basting again: if compiled for 32 bits, =A constraint will copy both eax and edx to specified location. – Sheaves 30/11, 2011 at 9:59

@eXXXXXXXXXXX: I get diffs in the ~130 range, on a Intel Core i7. – Basting 30/11, 2011 at 9:59

@mifki: Read the assembly that the OP posted. Previously, it had not included anything using %edx. – Basting 30/11, 2011 at 10:1

@Basting hard to tell now, maybe just wrong code pasted. =A surely produces correct assembly with both registers – Sheaves 30/11, 2011 at 10:12

Actually, on a modern Linux kernel clock_gettime is executed in user space (it's a VDSO rather than a proper syscall), so it's very efficient. The main reason why the plain rdtsc is faster is that it doesn't include the barriers which are necessary to prevent the processor from hoisting the rdtsc ahead of the instructions one is trying to time. – Opprobrium 30/11, 2011 at 11:38

It looks like your OS disabled execution of RDTSC in user space. And your application has to switch to kernel and back, which takes a lot of cycles.

This is from the Intel Software Developer’s Manual:

When in protected or virtual 8086 mode, the time stamp disable (TSD) flag in register CR4 restricts the use of the RDTSC instruction as follows. When the TSD flag is clear, the RDTSC instruction can be executed at any privilege level; when the flag is set, the instruction can only be executed at privilege level 0. (When in real-address mode, the RDTSC instruction is always enabled.)

Edit:

Answering aix's comment, I explain, why TSD is most likely the reason here.

I know only these possibilities for a program to perform a single instruction longer than usual:

Running under some emulator,
using self-modified code,
context switch,
kernel switch.

First 2 reasons cannot usually delay execution for more than a few hundred cycles. 2000-2500 cycles are more typical for context/kernel switch. But it is practically impossible to catch a context switch several times on the same place. So it should be kernel switch. Which means that either program is running under debugger or RDTSC is not allowed in user mode.

The most likely reason for OS to disable RDTSC may be security. There were attempts to use RDTSC to crack encryption programs.

Kokanee answered 30/11, 2011 at 9:46 Comment(15)

I didn't vote on the answer either way, but I think it would be improved if you stated why you thought TSD might be the culprit. – Aberrant 30/11, 2011 at 10:26

@aix I have explained this. Thanks. – Kokanee 30/11, 2011 at 10:53

@Evgeny Kluev and for clock_gettime similar explanations? I got 8000 nsec with using CLOCK_MONOTONIC id. It is huge=/ – Unwarrantable 30/11, 2011 at 11:4

@eXXXXXXXXXXX clock_gettime is most likely performed by the kernel, so at least 2000 cycles is required. May be more. But 8000 nsec seems too high - I cannot explain it. – Kokanee 30/11, 2011 at 11:13

@Evgeny Kluev i.e. I can't in this case to measure e.g. interval of 100 cycles. Probably there are ways to avoid this problem? – Unwarrantable 30/11, 2011 at 11:22

@eXXXXXXXXXXX It may be related to resolution (precision) of clock_gettime, but then you should see 8000 nsec intervals intermixed with zero intervals. – Kokanee 30/11, 2011 at 11:24

@Evgeny Kluev Run clock_gettime ~50 times results are 8000-11000 and 2 times was 2000 and 3000. This is strange. CPU was not loaded – Unwarrantable 30/11, 2011 at 11:29

@eXXXXXXXXXXX Probably the only way to measure short intervals is to convince your sysadmin to allow RDTSC (because system has only one core and therefore there is no security risk on this system). – Kokanee 30/11, 2011 at 11:32

@eXXXXXXXXXXX resolution is definitely affecting the results, still I can't explain why average interval is so high. – Kokanee 30/11, 2011 at 11:34

@EvgenyKluev don't you think that user-mode process will just be killed if it tries to execute instruction it's not allowed to? – Sheaves 30/11, 2011 at 11:46

@mifki Not necessary. In this particular case system may safely perform it in kernel mode. Hacker's program cannot guess what's happening on other hyperthreading thread if it is given such an unprecise timer. – Kokanee 30/11, 2011 at 11:59

@EvgenyKluev According to this test program included with kernel - mjmwired.net/kernel/Documentation/prctl/disable-tsc-test.c (and defines PR_TSC_ENABLE PR_TSC_SIGSEGV in prctl.h) - there will be SIGSEGV. – Sheaves 30/11, 2011 at 12:4

@mifki May be it gives SIGSEGV just for testing and performs it in kernel mode in real life... I don't know. It would be reasonable not to SIGSEGV in production code. Otherwise many useful applications will be ... unuseful :) – Kokanee 30/11, 2011 at 12:19

@EvgenyKluev what do you mean?? Kernel is the same always, it can't kill app in one case and do not in another. Of course there's some possibility that this program and names of defines are outdated and something changed.. Unfortunately I don't have non-virtualized linux box currently to try myself but I will try to find one. – Sheaves 30/11, 2011 at 12:45

@mifki I don't know the details. Never tried any security features. I just say what may be possible behavior. I may be wrong. – Kokanee 30/11, 2011 at 12:57

Instruction cache miss? (this is my guess)

Also, possibly,

Switch to hypervisor in a virtualized system? Remnants of program bootstrap (including network activity on same CPU)?

To Thanatos: On systems more recent than 2008, rdtsc() is a wall clock and does not vary with frequency steps.

Can you try this little code?

int main()
{   
    long long res;

    fflush(stdout);           // chnage the exact timing of stdout, in case there is something to write in a ssh connection, together with its interrupts

    for (int pass = 0; pass < 2; pass++)
    {
    res=tick();
    res=tick()-res;
    }
    printf("%d",res);     // ignore result on first pass, display the result on second pass.
    return 0;
}

Waddell answered 31/10, 2012 at 5:5 Comment(1)

This technically isn't an answer, it's more follow up engagement. However, you don't have enough rep to comment and it looks like you're actively engaging the OP (and seem to have the knowledge to answer this), so I'm not removing this post. If you don't get a reply from the OP, please turn this into an answer that simply assumes what you stated - or it will likely be removed. – Polash 31/10, 2012 at 5:29

Just an idea - maybe these two rdtsc instructions are executed on different cores? rdtsc values may slightly vary across cores.

Sheaves answered 30/11, 2011 at 9:36 Comment(1)

only core, judging by /sys/devices/system/cpu/ – Unwarrantable 30/11, 2011 at 9:55

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags