Why jnz requires 2 cycles to complete in an inner loop
Asked Answered
A

2

5

I'm on an IvyBridge. I found the performance behavior of jnz inconsistent in inner loop and outer loop.

The following simple program has an inner loop with fixed size 16:

global _start
_start:
    mov rcx, 100000000
.loop_outer:
    mov rax,    16

.loop_inner:
    dec rax
    jnz .loop_inner

    dec rcx
    jnz .loop_outer

    xor edi, edi
    mov eax, 60
    syscall

perf tool shows the outer loop runs 32c/iter. It suggests the jnz requires 2 cycles to complete.

I then search in Agner's instruction table, conditional jump has 1-2 "reciprocal throughput", with a comment "fast if no jump".

At this point I start to believe the above behavior is somehow expected. But why does jnz in an outer loop only require 1 cycle to complete?

If I remove the .loop_inner part altogether, the outer loop runs 1c/iter. The behavior looks inconsistent.

What I am missing here?

Edit for more info:

The perf results for the above program with command:

perf stat -ecycles,branches,branch-misses,lsd.uops,uops_issued.any -r4 ./a.out

is:

 3,215,921,579      cycles                                                        ( +-  0.11% )  (79.83%)
 1,701,361,270      branches                                                      ( +-  0.02% )  (80.05%)
        19,212      branch-misses             #    0.00% of all branches          ( +- 17.72% )  (80.09%)
        31,052      lsd.uops                                                      ( +- 76.58% )  (80.09%)
 1,803,009,428      uops_issued.any                                               ( +-  0.08% )  (79.93%)

The perf result of the reference case:

global _start
_start:
    mov rcx, 100000000
.loop_outer:
    mov rax,    16
    dec rcx
    jnz .loop_outer

    xor edi, edi
    mov eax, 60
    syscall

is:

   100,978,250      cycles                                                        ( +-  0.66% )  (75.75%)
   100,606,742      branches                                                      ( +-  0.59% )  (75.74%)
         1,825      branch-misses             #    0.00% of all branches          ( +- 13.15% )  (81.22%)
   199,698,873      lsd.uops                                                      ( +-  0.07% )  (87.87%)
   200,300,606      uops_issued.any                                               ( +-  0.12% )  (79.42%)

So the cause is mostly clear: LSD stops working for some reason in the nested case. Reducing the inner loop size will slightly mitigate the slowness, but not completely.

Searching Intel "optimization manual", I found that LSD won't work if the loop contains "more than eight taken branches". This somehow explains the behavior.

Asserted answered 12/1, 2019 at 3:17 Comment(6)
16 iterations should be few enough that the loop exit of the inner loop predicts correctly (and you'd probably see much slower timing for that), but you should check anyway. (~23 iterations is when it stops predicting correctly on Skylake last time I tested). Long-running tight loops are kind of a special case, handled specially by the front-end using the loop buffer. This might be defeating the loop buffer (LSD); check counters for lsd.uops vs uops_issued.any. (I don't think the LSD can handle nested loops, so at best all the inner-loop uops come from the LSD, but it could be less)Allocution
Also worth trying aligning your outer loop by 32. That should put the whole thing (inner+outer) in the same uop-cache line. The decoders won't macro-fuse back to back dec/jnz on IvB (or actually if they hit the decoders in the same group of up-to-4 uops), only on HSW and later, so keep in mind that your outer loop probably has separate uops for dec and jnz. That's not the direct cause of anything you're seeing, though. BTW, how did you measure the cost of an outer loop JNZ with an inner loop present? Or did you really mean "in a single long-running loop" with no nesting for the 1c/iter?Allocution
@PeterCordes Thanks, you're right, the LSD is the cause. See my edit. Alignment doesn't make difference, and branch prediction works perfectly in both cases. I will accept if you write these comments as answer.Asserted
@PeterCordes I still have a doubt: is LSD the same thing as "loopback buffer" in Agner's book? It looks the same thing, but if so, Agner's statement "the loop buffer has no measurable effect in the cases where the uop cache is not a bottlenect..." is wrong? Because this is certainly a measurable effect and the uop cache isn't bottleneck because the cache has ~1.5K capacity.Asserted
Yes, Agner calls it the loopback buffer. His statement is that adding the LSD to the design doesn't speed up any code. But yes, it appears to be wrong for very tight loops, apparently SnB/IvB do need the loop buffer to issue or execute 1c/iter loops. Unless the microarchitectural bottleneck is in fetching uops from the uop cache after branching, in which case his caveat covers this.Allocution
@PeterCordes I see. And I guess it's not only for SnB/IvB, but also for Haswell and Skylake. His instruction table still shows that taken branch has "1-2 reciprocal throughput" for Haswell and Skylake,Asserted
I
3

TL;DR: The DSB seems to be only able to deliver one jump uop of the inner loop every other cycle. Also DSB-MITE switches constitute up to 9% of execution time.


Introduction - Part 1: Understanding the LSD performance events

I'll first discuss when the LSD.UOPS and LSD.CYCLES_ACTIVE performance events occur and some peculiarities of the LSD on the IvB and SnB microarchitectures. Once we establish this foundation, we can then answer the question. To do this, we can use small pieces of code that are especially designed to determine accurately when these events occur.

According to the documentation:

LSD.UOPS: Number of Uops delivered by the LSD.
LSD.CYCLES_ACTIVE: Cycles Uops delivered by the LSD, but didn't come from the decoder.

These definitions are useful, but, as you'll see later, not precise enough to answer your question. It's important to develop a better understand of these events. Some of the information presented here is not documented by Intel and it's just my best interpretation of the empirical results and some of the related patents that I went through. Although I was not able to find the specific patent that describes the LSD implementation in SnB or later microarchitectures.

Each of the following benchmarks start with a comment that contains the name of the benchmark. All numbers are normalized per iteration, unless otherwise mentioned.

; B1
----------------------------------------------------
    mov rax, 100000000
.loop:
    dec rax
    jnz .loop
----------------------------------------------------
Metric                             |  IvB   |  SnB
----------------------------------------------------
cycles                             |  0.90  |  1.00
LSD.UOPS                           |  0.99  |  1.99
LSD.CYCLES_ACTIVE                  |  0.49  |  0.99
CYCLE_ACTIVITY.CYCLES_NO_EXECUTE   |  0.00  |  0.00
UOPS_ISSUED.STALL_CYCLES           |  0.43  |  0.50

Both instructions in the loop body are mac-fused into a single uop. There is only one execution port on IvB and SnB that can execute jump instructions. Therefore, the maximum throughput should be 1c/iter. IvB is 10% faster, though, for some reason.

According to Is performance reduced when executing loops whose uop count is not a multiple of processor width?, the LSD in IvB and SnB cannot issue uops across the loop body boundaries even if there are available issue slots. Since the loop contains a single uop, we expect that the LSD will issue a single uop per cycle and that LSD.CYCLES_ACTIVE should about equal to the total number of cycles.

On IvB, LSD.UOPS is as expected. That is, the LSD will issue one uop per cycle. Note that since the number of cycles is equal to the number of iterations which is equal to the number of uops, we can equivalently say that the LSD issues one uop per iteration. Essentially, most of the uops that were execute were issued from the LSD. However, LSD.CYCLES_ACTIVE is about half of the number of cycles. How is this possible? In this case, shouldn't only half of the total number of uops be issued from the LSD? I think what is happening here is that the loop is being essentially unrolled twice and two uops are being issued per cycle. Nonetheless, only a single uop can be executed per cycle yet RESOURCE_STALLS.RS is zero, indicating that RS never gets full. However, RESOURCE_STALLS.ANY is about half of the cycle count. Putting all of this together now, it seems that the LSD is actually issuing 2 uops every other cycle and that there is some structural limitation that is being reached every other cycle. CYCLE_ACTIVITY.CYCLES_NO_EXECUTE confirms that there is always at least one read uop in the RS at any given cycle. The following experiments will reveal the conditions for the unrolling to happen.

On SnB, LSD.UOPS shows that twice the total number of uops were issued from the LSD. Also LSD.CYCLES_ACTIVE indicates the LSD was active most of the time. CYCLE_ACTIVITY.CYCLES_NO_EXECUTE and UOPS_ISSUED.STALL_CYCLES are as on IvB. The following experiments are helpful to understand what's happening. It seems that the measured LSD.CYCLES_ACTIVE is equal to the real LSD.CYCLES_ACTIVE+RESOURCE_STALLS.ANY. Therefore, to get the real LSD.CYCLES_ACTIVE, RESOURCE_STALLS.ANY must be subtracted from the measured LSD.CYCLES_ACTIVE. The same applies to LSD.CYCLES_4_UOPS. The real LSD.UOPS can be calculated as follows:

LSD.UOPSmeasured = LSD.UOPSreal + ((LSD.UOPSmeasured/LSD.CYCLES_ACTIVEmeasured)*RESOURCE_STALLS.ANY)

Thus,

LSD.UOPSreal = LSD.UOPSmeasured - ((LSD.UOPSmeasured/LSD.CYCLES_ACTIVEmeasured) * RESOURCE_STALLS.ANY)
     = LSD.UOPSmeasured * (1 - (RESOURCE_STALLS.ANY/LSD.CYCLES_ACTIVEmeasured))

For all of the benchmarks I've run on SnB (including those not shown here), these adjustments are accurate.

Note that RESOURCE_STALLS.RS and RESOURCE_STALLS.ANY on SnB are just like IvB. So it seems that the LSD works the same way, as far as this particular benchmark is concerned, on IvB and SnB, except that the events LSD.UOPS and LSD.CYCLES_ACTIVE are counted differently.

; B2
----------------------------------------------------
    mov rax, 100000000
    mov rbx, 0
.loop:
    dec rbx
    jz .loop
    dec rax
    jnz .loop
----------------------------------------------------
Metric                             |  IvB   |  SnB
----------------------------------------------------
cycles                             |  1.98  |  2.00
LSD.UOPS                           |  1.92  |  3.99
LSD.CYCLES_ACTIVE                  |  0.94  |  1.99
CYCLE_ACTIVITY.CYCLES_NO_EXECUTE   |  0.00  |  0.00
UOPS_ISSUED.STALL_CYCLES           |  1.00  |  1.00

In B2, there are 2 uops per iteration and both are jumps. The first one is never taken, so there is still only one loop. We expect it to run at 2c/iter, which is indeed the case. LSD.UOPS shows that most uops were issued from LSD, but LSD.CYCLES_ACTIVE shows that the LSD was active only half of the time. This means that the loop was not unrolled. So it seems that unrolling only occurs when there is a single uop in the loop.

; B3
----------------------------------------------------
    mov rax, 100000000
.loop:
    dec rbx
    dec rax
    jnz .loop
----------------------------------------------------
Metric                             |  IvB   |  SnB
----------------------------------------------------
cycles                             |  0.90  |  1.00
LSD.UOPS                           |  1.99  |  1.99
LSD.CYCLES_ACTIVE                  |  0.99  |  0.99
CYCLE_ACTIVITY.CYCLES_NO_EXECUTE   |  0.00  |  0.00
UOPS_ISSUED.STALL_CYCLES           |  0.00  |  0.00

There are also 2 uops here, but the first one is a single-cycle ALU uop that is not related to the jump uop. B3 helps us answer the following two questions:

  • If the target of a jump is not a jump uop, will the LSD.UOPS and LSD.CYCLES_ACTIVE still count twice on SnB?
  • If the loop contains 2 uops where only one of them is a jump, will the LSD unroll the loop?

B3 shows that the answer to both question is a "No."

UOPS_ISSUED.STALL_CYCLES suggests that the LSD will only stall one cycle if it issues two jump uops in one cycle. This never happens in B3, so there are no stalls.

; B4
----------------------------------------------------
    mov rax, 100000000
.loop:
    add rbx, qword [buf]
    dec rax
    jnz .loop
----------------------------------------------------
Metric                             |  IvB   |  SnB
----------------------------------------------------
cycles                             |  0.90  |  1.00
LSD.UOPS                           |  1.99  |  2.00
LSD.CYCLES_ACTIVE                  |  0.99  |  1.00
CYCLE_ACTIVITY.CYCLES_NO_EXECUTE   |  0.00  |  0.00
UOPS_ISSUED.STALL_CYCLES           |  0.00  |  0.00

B4 has an additional twist to it; it contains 2 uops in the fused domain but 3 uops in the fused domain because the load-ALU instruction gets unfused in the RS. In the previous benchmarks, there were no micro-fused uops, only macro-fused uops. The goal here is to see how micro-fused uops are treated by the LSD.

LSD.UOPS shows that the two uops of the load-ALU instruction have consumed a single issue slot (the fused jump uop consumes only a single slot). Also since LSD.CYCLES_ACTIVE is equal to cycles, no unrolling has occurred. The loop throughput is as expected.

; B5
----------------------------------------------------
    mov rax, 100000000
.loop:
    jmp .next
.next:
    dec rax
    jnz .loop
----------------------------------------------------
Metric                             |  IvB   |  SnB
----------------------------------------------------
cycles                             |  2.00  |  2.00
LSD.UOPS                           |  1.91  |  3.99
LSD.CYCLES_ACTIVE                  |  0.96  |  1.99
CYCLE_ACTIVITY.CYCLES_NO_EXECUTE   |  0.00  |  0.00
UOPS_ISSUED.STALL_CYCLES           |  1.00  |  1.00

B5 is the last benchmark that we will need. It is similar to B2 in that it contains two branch uops. However, one of the jump uops in B5 is a forward unconditional jump. The results are identical to B2, indicating that it doesn't matter whether a jump uop is conditional or not. This is also case if the first jump uop is conditional and the second is not.

Introduction - Part 2: Branch prediction in the LSD

The LSD is mechanism implemented in the uop queue (IDQ) that can improve performance and reduce power consumption (consequently, heat emission is reduced). It can improve performance because some of the limitations that exist in the frontend may be relaxed in the uop queue. In particular, on SnB and IvB, both the MITE and DSB paths have a maximum throughput of 4uops/c, but in terms of bytes, it's 16B/c and 32B/c, respectively. The uop queue bandwidth is also 4uops/c, but has no limitation on the number of bytes. As long as the LSD issues uops from the uop queue, the frontend (i.e., the fetch and decode units) and even unneeded logic downstream from the IDQ can be powered down. Prior to Nehalem, the LSD was implemented in the IQ unit. Starting with Haswell, the LSD supports loops that contain uops from the MSROM. The LSD in Skylake processors is disabled because, apparently, it's buggy.

Loops usually contain at least one conditional branch. The LSD essentially monitors backward conditional branches and tries to determine a sequence of uops that constitute a loop. If the LSD takes too much time to detect a loop, performance may degrade and power may be wasted. On the other hand, if the LSD prematurely locks down a loop and attempts to replay it, the conditional jump of the loop may actually fall through. This can only be detected after executing the conditional jump, which means that later uops might have already issued and dispatched for execution. All of these uops need to be flushed and the frontend need to be activated to fetch uops from the correct path. So there can be a significant performance penalty if the performance improvement from using the LSD does not exceeds the performance degradation resulting from potentially mispredicting the last execution of the conditional branch where the loop is exited.

We already know that the branch prediction unit (BPU) on SnB and later can correctly predict when a conditional branch of a loop falls through when the total number of iterations does not exceed some small number, after which the BPU assumes that the loop will iteration forever. If the LSD uses the sophisticated capabilities of the BPU to predict when a locked down loop terminates, it should be able to correctly predict the same cases. It's possible also that the LSD uses its own branch predictor that is potentially much simpler. Let's find out.

mov rcx, 100000000/(IC+3)
.loop_outer:
    mov rax, IC
    mov rbx, 1 

.loop_inner:
    dec rax
    jnz .loop_inner

    dec rcx
    jnz .loop_outer

Let OC and IC denote the number of outer iterations and the number of inner iterations, respectively. These are related as follows:

OC = 100000000/(IC+3)       where IC > 0

For any given IC, the total number of uops retired is the same. In addition, the number of uops in the fused domain is equal to the number of uops in the unfused domain. This is nice because it really simplifies the analysis and allows us to make a fair performance comparison between different values of IC.

Compared to the code from the question, there is an additional instruction, mov rbx, 1, so that the total number of uops in the outer loop is exactly 4 uops. This enables us to make use of the LSD.CYCLES_4_UOPS performance event in addition to LSD.CYCLES_ACTIVE and BR_MISP_RETIRED.CONDITIONAL. Note that since there is only a single branch execution port, each outer loop iteration takes at least 2 cycles (or according to Agner's table, 1-2 cycles). See also: Can the LSD issue uOPs from the next iteration of the detected loop?.

The total number of jump uops is:

OC + IC*OC = 100M/(IC+3) + IC*100M/(IC+3)
     = 100M(IC+1)/(IC+3)

Assuming that the maximum jump uop throughput is 1 per cycle, the optimal execution time is 100M(IC+1)/(IC+3) cycles. On IvB, we can instead use a maximum jump uop throughput of 0.9/c if we want to be strict. It would be useful to divide this by the number of inner iterations:

OPT = (100M(IC+1)/(IC+3)) / (100MIC/(IC+3)) =
    100M(IC+1) * (IC+3) / (IC+3) * 100MIC =
    (IC+1)/IC = 1 + 1/IC

Hence, 1 < OPT <= 1.5 for IC > 1. The person designing the LSD can use this to compare different designs of the LSD. We'll use this shortly also. Putting it another way, the optimal performance is achieved when the total number of cycles divided by the total number of jumps is 1 (or 0.9 on IvB).

Assuming that the prediction for the two jumps are independent and given that jnz .loop_outer is easily predictable, the performance depends on the prediction of jnz .loop_inner. On a misprediction that changes control to a uop outside of the locked down loop, the LSD terminates the loop and tries to detect another loop. The LSD can be represented as a state machine with three states. In one state, the LSD is looking for a looping behavior. In the second state, the LSD is learning the boundaries and the number of iterations of the loop. In the third state, the LSD is replaying the loop. When the loop exists, the state changes from the third to the first.

As we have learned from the previous set of experiments, there will be extra LSD events on SnB when there are backend-related issue stalls. So the numbers need to be understood accordingly. Note that the case where IC=1 has not been tested in the previous section. It will be discussed here. Recall also that, on both IvB and SnB, the inner loop may get unrolled. The outer loop will never get unrolled because it contains more than one uop. By the way, LSD.CYCLES_4_UOPS works as expected (sorry, no surprises there).

The following figures show the raw results. I've only shown the results up to IC=13 and IC=9 on IvB and SnB, respectively. I'll discuss in the next section what happens for larger values. Note that when the denominator is zero, the value cannot be computed and so it's not plotted.

uop metrics cycle metrics

LSD.UOPS/100M is the ratio of the number of uops issued from the LSD to the total number of uops. LSD.UOPS/OC is the average number of uops issued from the LSD per outer iteration. LSD.UOPS/(OC*IC) is the average number of uops issued from the LSD per inner iteration. BR_MISP_RETIRED.CONDITIONAL/OC is the average number of retired conditional branches that were mispredicted per outer iteration, which is clearly zero on both IvB and SnB for all IC.

For IC=1 on IvB, all uops were issued from the LSD. The inner conditional branch is always not taken. The LSD.CYCLES_4_UOPS/LSD.CYCLES_ACTIVE metric shown in the second figure shows that in all of the cycles in which the LSD is active, the LSD is issuing 4 uops per cycle. We have learned from previous experiments that when the LSD issues 2 jump uops in the same cycle, it cannot issue jump uops in the next cycle due to some structural limitation, so it will stall. LSD.CYCLES_ACTIVE/cycles shows that the LSD is stalling (almost) every other cycle. We expect that it takes about 2 cycles to execute an outer iteration, but cycles shows that it takes about 1.8 cycles. This is probably related to the 0.9 jump uop throughput on IvB we have seen earlier.

The case IC=1 on SnB is similar except for two things. First, an outer loop actually takes 2 cycles as expected, not 1.8. Second, all of the three LSD event counts are double what is expected. They can be adjusted as discussed in the previous section.

Branch prediction is particularly interesting when IC>1. Let's analyze the IC=2 case in detail. LSD.CYCLES_ACTIVE and LSD.CYCLES_4_UOPS show that in about 32% of all cycles, the LSD is active, and in 50% of these cycles, the LSD issues 4 uops per cycle. So there are either mispredictions or that LSD is taking a lot of time in the loop detection state or the learning state. Nonetheless, cycles/(OC*IC) is about 1.6, or in other words, cycles/jumps is 1.07, which is close to the optimal performance. It's difficult to figure out which uops are being issued in groups of 4 from the LSD and which uops are being issued in groups of size less than 4 from the LSD. In fact, we don't know how the LSD events are counted in the presence of LSD mispredictions. Potential unrolling adds another level of complexity. The LSD event counts can be considered as upper bounds on the useful uops issued by the LSD and the cycles in which the LSD issued useful uops.

As IC increase, both LSD.CYCLES_ACTIVE and LSD.CYCLES_4_UOPS decrease and performance deteriorates slowly but consistently (remember that cycles/(OC*IC) should be compared against OPT). It is as if the last inner loop iteration is being mispredicted, but its misprediction penalty is increasing with IC. Note that BPU always correctly predicts the number of inner loop iterations.


The answer

I'll discuss what happens for any IC, why performance deteriorates for larger IC, and what the upper and lower bounds on performance are. The following code will be used in this section:

mov rcx, 100000000/(IC+2)
.loop_outer:
    mov rax, IC

.loop_inner:
    dec rax
    jnz .loop_inner

    dec rcx
    jnz .loop_outer

This is essentially the same as the code from the question. The only difference is that the number of outer iterations is adjusted to maintain the same number of dynamic uops. Note thatLSD.CYCLES_4_UOPS is useless in this case because the LSD will never have 4 uops to issue in any cycle. All of the following figures are for IvB only. No worries, though, how SnB is different will be mentioned in the text.

enter image description here

When IC=1, cycles/jumps is 0.7 (1.0 on SnB), which is even lower than 0.9. I don't know how this throughput is being achieved. Performance decreases with larger values of IC, which correlates with the decrease in LSD active cycles. When IC=13-27 (9-27 on SnB), zero uops get issued from the LSD. I think in this range, the LSD deems the performance impact due to mispredicting the last inner iteration is larger than some threshold, it decides to never lock down the loop and it remembers its decision. When IC<13, the LSD appears to be aggressive and perhaps it considers the loop to be more predictable. For IC>27, the LSD active cycles count slowly grows and that correlates with gradual improvement in performance. Although not shown in the figure, as IC grows far beyond 64, most of the uops will come from the LSD and cycles/jumps settles at 0.9.

The results for the range IC=13-27 is particularly useful. The issue stall cycles is about half of the total cycle count and is also equal to the dispatch stall cycles. It is precisely for this reason why the inner loop is executing at 2.0c/iter; because jump uops of the inner loop is being issued/dispatched every other cycle. When the LSD is not active, the uops can either come from the DSB, MITE, or the MSROM. Microcode assists are not required for our loop, so there is probably a limitation in either the DSB, MITE, or both. We can further investigate to determine where the limitations is using the frontend performance events. I've done this and the results show that about 80-90% of all uops come from the DSB. The DSB itself has many limitations and it seems that the loop is hitting one them. It seems that the DSB takes 2 cycles to deliver a jump uop that targets itself. In addition, for the full IC range, the stalls due to MITE-DSB switching consistute up to 9% of all cycles. Again, the reason for these switches is due to limitations in the DSB itself. Note that up 20% are being delivered from the MITE path. Assuming that the uops don't exceed the 16B/c bandwidth of the MITE path, I think the loop would have executed at 1c/iter if the DSB was not there.

The above figure also shows BPU misprediction rate (per outer loop iteration). On IvB, it's zero for IC=1-33, except when IC=21, 0-1 when IC=34-45, and it's exactly 1 when IC>46. On SnB, it's zero for IC=1-33 and 1 otherwise.

Ism answered 19/1, 2019 at 2:44 Comment(3)
With IC=1, we might not be getting macro-fusion because only HSW and later can make 2 macro-fusions in one decode group. But if the inner loop branch is taken at least once, then maybe the outer loop dec/jnz is re-decoded when the the inner loop finally exits, instead of saving the decode result from the non-executed path. This still doesn't explain how IC=1 could lead to more than 1 jump per cycle, but it is a potential qualitative difference.Allocution
@PeterCordes I was surprised to see that the store uop in B4 is apparently using 2 issue slots, which means it 's getting unlaminated, even though it's not using an indexed addressing mode. Do you know whether this is documented by Intel? Am I missing something?Ism
Instructions with an immediate and a RIP-relative operand can never micro-fuse in the first place, not even in the decoders. So there's no unlamination, and this is documented elsewhere in Intel's manual I think. This applies to mov-stores, and to test or cmp [rel foo], imm. See Micro fusion and addressing modes for examples. IIRC, the load+operate part of add [rel foo], imm can't micro-fuse either.Allocution
A
3

(Partial answer / speculation I didn't finish writing before Hadi posted a detailed analysis; some of this is continuing from comments)

Agner's statement "the loop buffer has no measurable effect in the cases where the uop cache is not a bottleneck..." is wrong? Because this is certainly a measurable effect and the uop cache isn't bottleneck because the cache has ~1.5K capacity.

Yes, Agner calls it the loopback buffer. His statement is that adding the LSD to the design doesn't speed up any code. But yes, it appears to be wrong for very tight loops, at least for nested loops. Apparently SnB/IvB do need the loop buffer to issue or execute 1c/iter loops. Unless the microarchitectural bottleneck is in fetching uops from the uop cache after branching, in which case his caveat covers this.

There are cases other than uop-cache misses where reading the uop cache can be a bottleneck. e.g. if uops didn't pack very well because of alignment effects, or if they use large immediates and/or displacements that take extra cycles to read from the uop cache. See the Sandybridge section of Agner Fog's uarch guide for more details about these effects. Your assumption that capacity (up to 1.5k uops if they pack perfectly) is the only reason it could be slow is very wrong.

BTW, a microcode update for Skylake disabled the LSD entirely to fix a partial-register merging bug, erratum SKL1501, and that did in fact have little effect except when a tiny loop spans a 32B boundary and needs 2 cache lines.

But Agner lists JMP rel8/32 and taken JCC throughput as 1-2 cycles on HSW/SKL, vs. just 2 on IvB. So something about taken branches may have sped up since IvB other than the LSD itself.

There might be some part(s) of the CPU other than LSD which also has a special case for long-running tiny loops that lets them run 1 taken jump per clock, on Haswell and later. I haven't tested what conditions cause 1 vs. 2 cycle taken branch throughput on HSW/SKL. Also note that Agner measured before the microcode update for erratum SKL150.


footnote 1: See How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent, and note that SKX and Kaby Lake ship with microcode that already includes this. It's finally re-enabled in CPUs such as CannonLake / Ice Lake, which fixed the buggy hard-wired logic so the LSD can safely be enabled again.

(I previously thought Coffee Lake had re-enabled the LSD, but it seems not - wikichip explicitly says it's still disabled, so I think that's correcting some earlier reports that it was re-enabled. CFL did fix L1TF and Meltdown vulnerabilities, though, making software mitigation unnecessary for those vulnerabilities specifically.)

Allocution answered 19/1, 2019 at 3:16 Comment(8)
Upvoted. Your answer and Hadi's are both very good, I've already accepted Hadi's before your posting, sorry for that. I've reread the uop cache part in Agner's book, but I'm more confused on the uop cache organization. Agner shows: "the same piece of code can have multiple entries in the uop cache if it has multiple jump entries". How can a piece of code have multiple entries? In my code, the counter of inner_loop is 16, I thought before that it only has one uop entry -- the macro-fused jnz. Does Agner mean it has 16 entries?Asserted
@user10865622: No, this loop has the same entry-point jumped to 16 times. Imagine a loop body with a first-iteration entry point in the middle of the full body, like for the jump to the loop condition strategy that gcc -Os uses for loops that might need to run 0 times. (Why are loops always compiled into "do...while" style (tail jump)?). The CPU might end up re-decoding from the top of the loop and making a new uop cache line, maybe without evicting the original (right away / at all).Allocution
@user10865622: I think Hadi's more directly answers the specific question, while mine is addressing extra stuff from comments, so you should keep the accept checkmark on Hadi's. (SO does let you move it, though, but there's no way to accept more than one which would be useful in some cases.)Allocution
If I understand your comment correctly, the CPU has the ability to distinguish an instruction following a non-branch and an instruction following a branch. In the former case, a new entry will be created, while in the later case, CPU will use the existing entry. Am I correct?Asserted
@user10865622: No, that's not exactly what I'm saying. I'm not sure exactly when you'd end up with multiple uop cache lines containing the same instruction. But to avoid crazy switching between decode and uop cache, I think if the CPU is already decoding from legacy decoders, it might only re-check for a uop cache hit when crossing a 32-byte boundary. Or at least it will decode+cache the group of up-to-4 uops it just decoded.Allocution
If the first time some instructions from a block executed, it started part way through, the first few instructions in the block haven't been decoded and aren't cached. But then if you jump to them, they and the later isns will be cached in a new uop cache line. Decode happens 4 uops at a time. It makes sense to re-decode the instructions from an earlier starting point and cache that "view" once the CPU sees that execution does sometimes start there.Allocution
But if a block was already cached, a jump into the middle of it can hit in that uop cache line.Allocution
I see, thanks. I was generalizing your while loop example from a wrong perspective. It's more about some alignment issue for uop cache line, and the actual realization might be complicated.Asserted

© 2022 - 2024 — McMap. All rights reserved.