Why is my loop much faster when it is contained in one cache line?
Asked Answered
C

1

4

When I run this little assembly program on my Ryzen 9 3900X:

_start:   xor       rax, rax
          xor       rcx, rcx
loop0:    add       rax, 1
          mov       rdx, rax
          and       rdx, 1
          add       rcx, rdx
          cmp       rcx, 1000000000
          jne       loop0

It completes in 450 ms if all the instructions between loop0 and up to and including the jne, are contained entirely in one cacheline. That is, if:

round((address of loop0)/64) == round((address of end of jne-instruction)/64)

However, if the above condition does not hold, the loop takes 900 ms instead.

I've made a repo with the code https://github.com/avl/strange_performance_repro .

Why is the inner loop much slower in some specific cases?

Edit: Removed a claim with a conclusion from a mistake in testing.

Caftan answered 2/3, 2020 at 22:13 Comment(0)
A
3

Your issue lies in the variable cost of the jne instruction.

First of all, to understand the impact of the effect, we need to analyze the full loop itself. The architecture of the Ryzen 9 3900X is Zen2. We can retrieve information about this on the AMD website or also WikiChip. This architecture have 4 ALU and 3 AGU. This roughly means it can execute up to 4 instructions like add/and/cmp per cycle.

Here is the cost of each instruction of the loop (based on the Agner Fog instruction table for Zen1):

                                        # Reciprocal throughput
loop0:    add       rax, 1              # 0.25
          mov       rdx, rax            # 0.2
          and       rdx, 1              # 0.25
          add       rcx, rdx            # 0.25
          cmp       rcx, 1000000000     # 0.25  | Fused  0.5-2 (2 if jumping)
          jne       loop0               # 0.5-2 |

As you can see, the 4 first computing instructions of the loop can be executed in ~1 cycle. The 2 last instructions can be merged by your processor into a faster one. Your main issue is that this last jne instruction could be quite slow compared to the rest of the loop. So you are very likely to measure only the overhead of this instruction. From this point, things start to be complicated.

Engineer and researcher worked hard the last decades to reduce the cost of such instructions at (almost) any price. Nowadays, processors (like the Ryzen 9 3900X) use out-of-order instruction scheduling to execute the dependent instructions required by the jne instruction as soon as possible. Most processors can also predict the address of the next instruction to execute after the jne and fetch new instructions (eg. the one of the next loop iteration) while the other instruction of the current loop iteration are being executed. Performing the fetch as soon as possible is important to prevent any stall in the execution pipeline of the processor (especially in your case).

From the AMD document "Software Optimization Guide for AMD Family 17h Models 30h and Greater Processors", we can read:

  • 2.8.3 Loop Alignment:

    For the processor loop alignment is not usually a significant issue. However, for hot loops, some further knowledge of trade-offs can be helpful. Since the processor can read an aligned 64-byte fetch block every cycle, aligning the end of the loop to the last byte of a 64-byte cache line is the best thing to do, if possible.

  • 2.8.1.1 Next Address Logic

    The next-address logic determines addresses for instruction fetch. [...]. When branches are identified, the next-address logic is redirected by the branch target and branch direction prediction hardware to generate a non-sequential fetch block address. The processor facilities that are designed to predict the next instruction to be executed following a branch are detailed in the following sections.

Thus, performing a conditional branching to instructions located in another cache line introduces an additional latency overhead due to a fetch of the Op Cache (an instruction cache faster than the L1) not required if the whole loop fit in 1 cache line. Indeed, if a loop is crossing a cache line, 2 line-cache fetches are required, which takes no less than 2 cycles. If the whole loop is fitting in a cache-line, only one 1 line-cache fetch is needed, which take only 1 cycle. As a result, since your loop iterations are very fast, paying 1 additional cycle introduces a significant slowdown. But how much?

You say that the program takes about 450 ms. Since the Ryzen 9 3900X turbo frequency is 4.6 GHz and your loop is executed 2e9 times, the number cycles per loop iteration is 1.035. Note that this is better than what we can expected before (I guess this processor is able to rename registers, ignore the mov instruction, execute the jne instruction in parallel in only 1 cycle while other instructions of the loop are perfectly pipelined; which is astounding). This also shows that paying an additional fetch overhead of 1 cycle will double the number of cycles needed to execute each loop iteration and so the overall execution time.

If you do not want to pay this overhead, consider unrolling your loop to significantly reduce the number of conditional branches and non-sequential fetches.

This problem can occur on other architectures such as on Intel Skylake. Indeed, the same loop on a i5-9600KF takes 0.70s with loop alignment and 0.90s without (also due to the additional 1 cycle fetch latency). With a 8x unrolling, the result is 0.53s (whatever the alignment).

Armington answered 8/3, 2020 at 13:27 Comment(13)
Oh man, thank you so much for this excellent answer! It makes perfect sense. I agree that the Ryzen 9 performance in this particular loop (when it is aligned) is extremely impressive. Doesn't each instruction have a data-dependency on the instruction before it? How on earth can it possibly run quicker than 1 cycle per instruction unless instructions are somehow merged? The reciprocal throughput of 0.25 normally only holds when data being operated on is already available, right?Caftan
Today processors are insanely complex and "clever". They can execute multiple loop iterations speculatively. They can track dependencies between the instructions (of possibly different loop iterations) and execute independent instruction in parallel. Here, the processor succeed to pipeline different loop iterations. For example, the instructions jne loop0 of iter 1, add rcx, rdx of iter 2, and rdx_2, 1 of iter 3, add rax, 1 of iter 4 can be executed in parallel (with rdx of iter 2 and 3 stored in different internal registers due to the renaming done by the processor itself).Alnico
For the last question: all instructions have a latency of at least 1 cycle, and yes, instructions must wait for data to be available. A reciprocal throughput of 0.25 means that 4 instructions of a given type can run simultaneously (working on different registers, eventually renamed before).Alnico
Yes, Zen has mov-elimination for GPRs, not just vector registers. The loop is 5 uops for the front-end, but only needs 4 back-end ALU execution units per iteration. This allows 1/clock throughput, assuming perfect scheduling so the 1-cycle dependency chains don't ever miss a cycle. (Even on CPUs where taken branches in general are worse than 1/clock throughput, loop branches are usually special somehow for tight loops.)Insectile
On Skylake, having a tiny loop split across two 32-byte blocks means its instructions will go in two separate uop-cache lines, making best-case front-end throughput 2 cycles / iter. (Before the microcode update that disabled the loop buffer (LSD), Skylake didn't have this problem; it could just lock down the loop in the IDQ (queue that feeds the issue stage) and replay its uops from there until a mispredict, even if the loop spanned a 64-byte boundary.)Insectile
I wonder if the actual reason in Zen 2 is more related to uop-cache lines than jcc throughput? Performance counters should show if it's a front-end or back-end bottleneck (ROB and/or RS full vs. empty) en.wikichip.org/wiki/amd/microarchitectures/zen_2#Front_End doesn't mention a loop buffer, so it might well be 1 vs. 2 uop-cache fetches.Insectile
@PeterCordes I do not have a Zen2 processor anymore to make tests (the post is 3 years old now). That being said, I still have my i5-9600KF processor and it still give similar timings. My microcode version is sig=0x906ec, pf=0x2, revision=0xae. I do not expect to have a recent version (certainly the default one), but it is not clear to me how to get the last version number for this processor. I am not convinced jcc can be the issue on Zen2 but it might on Coffee Lake. Do you know which performance counter can be used to see JCC issues on CoffeeLake?Alnico
@PeterCordes So far, the only few counters show a relevant value with a significant change between the slow and fast version. icache_64b.iftag_hit : it is twice bigger in the slow version (1.0/cycle for the slow version and 0.62/cycle for the fast one -- matching with the fact that more cache lines need to be fetched because the loop fits on 2 cache lines). The idq_uops_not_delivered.xxx show that the front-end clearly does not send enough uops half the time in the slow version (frontend-bound) while it is nearly always fine with the fast version (mainly backend-bound).Alnico
Coffee Lake behaves the same as Skylake: LSD disabled. The JCC erratum shouldn't be a problem unless the jcc touches the last byte of a 32-byte chunk. (To detect that, look for idq.mite_uops being a significant fraction of total uops issued, instead of tiny.) To detect that your uops are coming from two separate cache lines, perhaps idq.all_dsb_cycles_4_uops vs. idq.all_dsb_cycles_any_uops - many cycles with fewer than 4 uops is a problem. (I guess there's buffering between the actual DSB fetch and delivering to the IDQ, since SKL can read all 6 uops in a cycle from a DSB line.)Insectile
On Intel Skylake-derived CPUs, it's a well-known fact that tiny loops split across 2 uop-cache lines (because they span a 32B boundary) are limited to 2c / iter for that reason. The hardware mechanism that was supposed to handle that case (the LSD) was disabled by microcode because of an erratum. So it's not surprising that leaves performance potholes. It's a bit surprising for Zen 2, since you'd guess the architects might have included something to avoid being slow on such loops, perhaps a loop buffer.Insectile
@PeterCordes idq.all_dsb_cycles_4_uops and idq.all_dsb_cycles_any_uops are globally equal (~2e9) and the same for the two versions. idq.mite_uops is very small compared to cpu-cycles also in the two versions (1000x smaller: 1e6-2e6 -- though the slow version has ~30% higher values). cpu-cycles is 3.2e9 for the fast version and 4.0e9 for the slow version. What we can conclude is that there are cycles where the IDQ does not send uops and cycles where everything seems fine. I think we can also say that this is not a JCC problem based on the idq.mite_uops value.Alnico
@PeterCordes pastebin.com/XR9WZ0AZ shows the result for the idq_uops_not_delivered.xxx counters interesting to see why no uop are sent from the IDQ. That being said, I find these counters far from being simple to understand/interpret... They give me headaches. Apparently, the RAT (and the backend) is stalling on the front-end for 1.2e9 cycles in the fast version. The slow version gives very different values. Note that in both versions the DSB is active during the same number of cycles (~2e9) and there is 5 uops/cycle in average to execute during the reported active cycles of the DSB.Alnico
@PeterCordes I can confirm at least that the LSD is disabled (both versions). This is sad since it looks like the problem seems to only appears with hyper-threading and my processor does not even support it so the LSD should work fine I guess... Regarding Zen 2, I didn't found anything related to the LSD and the quoted sentences seems to show that there is no such a unit on Zen 2. Not to mention, the observed behavior match pretty well with this hypothesis.Alnico

© 2022 - 2024 — McMap. All rights reserved.