In the past I have dealt with time-critical software development. The development of these applications basically proceeded as follows: "let's write the code, test latency and jitter, and optimize both until they are in the acceptable range." I find that highly frustrating; it isn't what I call proper engineering and I want to do better.
So I looked into the question: why do we have jitter at all? And the answers, of course, are:
- caching: getting a piece of code or data from main memory takes some 2 orders of magnitude more time than getting the same data from L1 cache. So
the physical execution time depends on what is in the cache. Which, in turn, depends on several factors:
- code and data layout of the application: we all know the horrifying row- vs. column-major matrix traversal example
- caching strategy of the CPU, including speculative pre-fetching of cache lines
- other processes on the same core doing stuff
- branch prediction: the CPU tries to guess which branch of a conditional jump will be executed. Even if the same conditional jump is executed twice, the prediction may differ and hence a "pipeline bubble" might form one time, but not the other.
- Interrupts: asynchronous behavior obviously leads to jitter
- Frequency scaling: thankfully disabled in real-time systems
That's a lot of things that can interfere with the behavior of a piece of code. Nonetheless: if I have two instructions, located on the same cache line, not depending on any data and not containing (conditional) jumps. Then the jitter from caching and branch prediction should be eliminated, and only interrupts should play a role. Right? Well, I wrote a small program getting the Time Stamp Counter (tsc) twice, and writing the difference to stdout. I executed it on a rt-patched linux kernel with frequency scaling disabled.
The code has glibc-based init and cleanup, and calls printf which I presume is sometimes in cache and sometimes isn't. But between the calls to "rdtsc" (writing the tsc into edx:eax), everything should be deterministic over each execution of the binary. Just to be sure, I disassembled the elf file, here the part with the two rdtsc calls:
00000000000006b0 <main>:
6b0: 0f 31 rdtsc
6b2: 48 c1 e2 20 shl $0x20,%rdx
6b6: 48 09 d0 or %rdx,%rax
6b9: 48 89 c6 mov %rax,%rsi
6bc: 0f 31 rdtsc
6be: 48 c1 e2 20 shl $0x20,%rdx
6c2: 48 09 d0 or %rdx,%rax
6c5: 48 29 c6 sub %rax,%rsi
6c8: e8 01 00 00 00 callq 6ce <print_rsi>
6cd: c3 retq
No conditional jumps, located on the same cache line (although I am not 100% sure about that - where exactly does the elf loader put the instructions? Do 64 byte boundaries here map to 64 byte boundaries in memory?)... where is the jitter coming from? If I execute that code 1000 times (via zsh, re-starting the program each time), I get values from 12 to 46 with several values in between. As frequency scaling is disabled in my kernel, that leaves interrupts. Now I am willing to believe that, out of 1000 executions, one is interrupted. I am not prepared to believe that 90% are interrupted (we are talking about ns-intervals here! Where should the interrupts come from?!).
So my questions are these:
- why is the code not deterministic, i.e., why don't I get the same number with every run?
- is it possible to reason about the running time, at least of this very simple piece of code, at all? Is there at least a bound on the running time that I can guarantee (using engineering principles, not measurement combined with hope)?
- if there isn't, what exactly is the source of the non-deterministic behavior? What component of the CPU (or the rest of the computer?) is rolling the dice here?