As follow up to my question The advantages of using 32bit registers/instructions in x86-64, I started to measure the costs of instructions. I'm aware that this have been done multiple times (e.g. Agner Fog), but I'm doing it for fun and self education.
My testing code is pretty simple (for simplicity here as pseudo code, in reality in assembler):
for(outer_loop=0; outer_loop<NO;outer_loop++){
operation #first
operation #second
...
operation #NI-th
}
But yet some things should be considered.
- If the inner part of the loop is large (large
NI>10^7
), the whole content of the loop does not fit into the instruction cache and thus must be loaded over and over again, making the speed of RAM define the time needed for execution. For example, for large inner parts,xorl %eax, %eax
(2 bytes) is 33% faster thanxorq %rax, %rax
(3 bytes). - If
NI
is small and the whole loop fits easily into the instruction cache, thanxorl %eax, %eax
andxorq %rax, %rax
are equally fast and can be executed 4 times per clock cycle.
However this simple model does not hold water for the jmp
-instruction. For the jmp
-instruction my test code looks as follows:
for(outer_loop=0; outer_loop<NO;outer_loop++){
jmp .L0
.L0: jmp .L1
L1: jmp L2
....
}
And the results are:
- For "large" loop sizes (already for
NI>10^4
) I measure 4.2 ns/jmp
-instruction ( would equate to 42 bytes loaded from RAM or ca. 12 clock cycles on my machine). - For small loop sizes (
NI<10^3
) I measure 1 ns/jmp-
instruction (which is around 3 clock cycles, which sounds plausible - Agner Fog's tables shows costs of 2 clock cycles).
The instruction jmp LX
uses the 2 byte eb 00
encoding.
Thus, my question: What could be the explanation for the high cost of jmp
-instruction in the "large" loops?
PS: If you like to try it out on your machine, you can download the scripts from here, just run sh jmp_test.sh
in src-folder.
Edit: Experimental results confirming Peter's BTB size theory.
The following table shows cycles per instruction for different ǸI
values (relative to NI
=1000):
|oprations/ NI | 1000 | 2000| 3000| 4000| 5000| 10000|
|---------------------|------|------|------|------|------|------|
|jmp | 1.0 | 1.0 | 1.0 | 1.2 | 1.9 | 3.8|
|jmp+xor | 1.0 | 1.2 | 1.3 | 1.6 | 2.8 | 5.3|
|jmp+cmp+je (jump) | 1.0 | 1.5 | 4.0 | 4.4 | 5.5 | 5.5|
|jmp+cmp+je (no jump) | 1.0 | 1.2 | 1.3 | 1.5 | 3.8 | 7.6|
It can be seen:
- For the
jmp
instruction, a (yet unknown) resource becomes scarce and this leads to a performance degradation forǸI
larger than 4000. - This resource is not shared with such instructions as
xor
- the performance degradation kicks in still forNI
about 4000, ifjmp
andxor
are executed after each other. - But this resource is shared with
je
if the jump is made - forjmp
+je
after each other, the resource becomes scarce forNI
about 2000. - However, if
je
does not jump at all, the resource is becoming scarce once again forNI
being about 4000 (4th line).
Matt Godbolt's branch-prediction reverse engineering articles establishes that the branch target buffer capacity is 4096 entries. That is very strong evidence that BTB misses are the reason for the observed throughput difference between small and large jmp
loops.
xorq %rax,%rax
does exactly the same thing asxorl %eax,%eax
so there is almost never a reason to use the former (except perhaps to avoid having to insert anop
for alignment somewhere). – Schoeningmov
andxor
I need to go as far as 10^7 instruction in the loop to see the "RAM-speed". Howeverjmp
becomes 4 time slower from 10^3 to 10^4. I'm not saying it is because of RAM - it is something different, but I don't quite know what it is. – Broomstickjmp+cmp+je (no jump)
case doesn't hit resource scarcity until around 4,000 jumps is because jumps that aren't taken don't consume a BTB entry (indeed, there would be nothing to put in the BTB!). – Aftmost