Your issue lies in the variable cost of the jne
instruction.
First of all, to understand the impact of the effect, we need to analyze the full loop itself. The architecture of the Ryzen 9 3900X is Zen2. We can retrieve information about this on the AMD website or also WikiChip.
This architecture have 4 ALU and 3 AGU. This roughly means it can execute up to 4 instructions like add/and/cmp per cycle.
Here is the cost of each instruction of the loop (based on the Agner Fog instruction table for Zen1):
# Reciprocal throughput
loop0: add rax, 1 # 0.25
mov rdx, rax # 0.2
and rdx, 1 # 0.25
add rcx, rdx # 0.25
cmp rcx, 1000000000 # 0.25 | Fused 0.5-2 (2 if jumping)
jne loop0 # 0.5-2 |
As you can see, the 4 first computing instructions of the loop can be executed in ~1 cycle. The 2 last instructions can be merged by your processor into a faster one.
Your main issue is that this last jne
instruction could be quite slow compared to the rest of the loop. So you are very likely to measure only the overhead of this instruction. From this point, things start to be complicated.
Engineer and researcher worked hard the last decades to reduce the cost of such instructions at (almost) any price. Nowadays, processors (like the Ryzen 9 3900X) use out-of-order instruction scheduling to execute the dependent instructions required by the jne
instruction as soon as possible. Most processors can also predict the address of the next instruction to execute after the jne
and fetch new instructions (eg. the one of the next loop iteration) while the other instruction of the current loop iteration are being executed.
Performing the fetch as soon as possible is important to prevent any stall in the execution pipeline of the processor (especially in your case).
From the AMD document "Software Optimization Guide for AMD Family 17h Models 30h and Greater Processors", we can read:
Thus, performing a conditional branching to instructions located in another cache line introduces an additional latency overhead due to a fetch of the Op Cache (an instruction cache faster than the L1) not required if the whole loop fit in 1 cache line. Indeed, if a loop is crossing a cache line, 2 line-cache fetches are required, which takes no less than 2 cycles. If the whole loop is fitting in a cache-line, only one 1 line-cache fetch is needed, which take only 1 cycle. As a result, since your loop iterations are very fast, paying 1 additional cycle introduces a significant slowdown. But how much?
You say that the program takes about 450 ms.
Since the Ryzen 9 3900X turbo frequency is 4.6 GHz and your loop is executed 2e9 times, the number cycles per loop iteration is 1.035. Note that this is better than what we can expected before (I guess this processor is able to rename registers, ignore the mov
instruction, execute the jne
instruction in parallel in only 1 cycle while other instructions of the loop are perfectly pipelined; which is astounding). This also shows that paying an additional fetch overhead of 1 cycle will double the number of cycles needed to execute each loop iteration and so the overall execution time.
If you do not want to pay this overhead, consider unrolling your loop to significantly reduce the number of conditional branches and non-sequential fetches.
This problem can occur on other architectures such as on Intel Skylake. Indeed, the same loop on a i5-9600KF takes 0.70s with loop alignment and 0.90s without (also due to the additional 1 cycle fetch latency). With a 8x unrolling, the result is 0.53s (whatever the alignment).