Is performance reduced when executing loops whose uop count is not a multiple of processor width?

Asked 3/9, 2016 at 22:28 Answered 16/5, 2020 at 9:10

Solved performance assembly x86 cpu-architecture micro-optimization

I'm wondering how loops of various sizes perform on recent x86 processors, as a function of number of uops.

Here's a quote from Peter Cordes who raised the issue of non-multiple-of-4 counts in another question:

I also found that the uop bandwidth out of the loop buffer isn't a constant 4 per cycle, if the loop isn't a multiple of 4 uops. (i.e. it's abc, abc, ...; not abca, bcab, ...). Agner Fog's microarch doc unfortunately wasn't clear on this limitation of the loop buffer.

The issue is about whether loops need to be a multiple of N uops to execute at maximum uop throughput, where N is the width of the processor. (i.e., 4 for recent Intel processors). There are a lot of complicating factors when talking about "width" and count uops, but I mostly want to ignore those. In particular, assume no micro or macro-fusion.

Peter gives the following example of a loop with 7 uops in its body:

A 7-uop loop will issue groups of 4|3|4|3|... I haven't tested larger loops (that don't fit in the loop buffer) to see if it's possible for the first instruction from the next iteration to issue in the same group as the taken-branch to it, but I assume not.

More generally, the claim is that each iteration of a loop with x uops in its body will take at least ceil(x / 4) iterations, rather than simply x / 4.

Is this true for some or all recent x86-compatible processors?

Hideaway answered 3/9, 2016 at 22:28 Comment(22)

There's also the question of whether a taken branch within a loop body can also produce an issue group of less than 4 uops. – Artery 3/9, 2016 at 22:39

I'll try to test that too. You basically that a jump within a loop may lower the throughput below what it would otherwise be, because of fetching issues. – Hideaway 3/9, 2016 at 22:59

This is assuming you can feed the processor to even see these differences, fetch line alignment and cache line alignment, branch prediction and where it aligns with its dependencies can each result in tens to thousands of percent difference in execution of the same sequence of instructions. Adding the loop count is only one more factor that can have affects on the performance of a loop assuming you can even time that. – Policyholder 4/9, 2016 at 1:14

if you are running on/with an operating system, out of painfully slow ram and layers of caching, I wouldnt expect you to be able to see this even in targeted tests. – Policyholder 4/9, 2016 at 1:15

@dwelch: To microbenchmark this, you simply write a loop with 2 NOPs vs. a loop with 3 NOPs (plus a non-macro-fused dec/jnz). The total cycles should double when you go from 4 uops in the loop to 5. Or just independent reg-reg ALU ops like ADD or OR, instead of NOP. Or were you talking about instruction fetch? The whole point of this experiment is to test the loop buffer in modern Intel CPUs, which, for tiny loops, recycles the contents of the queue between the rest of the frontend and the issue stage, using it as a loop buffer. So L1I and L0uop caches untouched. – Artery 4/9, 2016 at 4:56

Take a small loop test and change its alignment relative to the fetch size and alignment and relative to cache line alignment, even if in the L1 cache the alignment relative to the fetch can/will have a dramatic affect all by itself. You somehow have to erase/overcome that issue when trying to isolate this other feature/limit. – Policyholder 4/9, 2016 at 14:3

if you get it all in a single fetch, then you somehow have to get accurate timing in there as well, so no operating system calls for measuring a timer, and perhaps that is doable as well...Point is it is very very easy to get a result that is not related to the thing you were looking for, in particular just trying to feed the processor, it is very easy to have two back to back instructions that take hundreds of clock cycles to fetch the second one relative to the first. – Policyholder 4/9, 2016 at 14:6

dont know about intel but some processors shortcut nops, so have to be careful there too...the only way to really see a tiny feature like this is in a sim of the logic looking at the waveforms. I would expect to have some doubt hovering over the results otherwise. – Policyholder 4/9, 2016 at 14:9

What kind of debug/performance timers are available, for a cisc like this should be a single instruction and for a test like this needs to be in units of the processor clock not a divisor, and how do you insure the time measurements are not shortcut or sent parallel through one of the execution paths passing instructions under test? – Policyholder 4/9, 2016 at 14:17

@dwelch: This affects the long-term throughput of a loop by a factor of 25 to 100%, so you can just benchmark 100M iterations lasting ~1/10th of a second. Interrupts / multitasking overhead becomes a non issue. Measurement is easy: perf stat ./a.out gives you a cycle count from the precise HW perf counters. You have to know what you're doing to get this right, but x86 microarchitecture internals are known at this level of detail. There are far fewer different microarchitectures than for ARM. The same core design scales from 4W Core-M to 120W 20-core Xeon, just with different uncore/L3. – Artery 4/9, 2016 at 18:51

@dwelch Your comments here are completely unhelpful. This is a real question from someone who understands the complexity. Go read the Skylake section in Agner Fog's microarch pdf before making any more wrong guesses about why this effect might be hard to measure or alignment dependent. It's known more or less exactly how SnB-family microarchitectures shortcut NOPs, issuing them but not needing to dispatch them to an execution unit. (Still, it is something to double-check, and best avoided when possible). – Artery 4/9, 2016 at 18:59

Most of the time when someone who doesnt already know the answer (someone who doesnt need to ask SO for help) is able to accurately pull off a performance test. then of the folks that are able I tend to see many of those fail as well, or draw incorrect conclusions, so it is extremely relevant. Now had the poster shown results for various systems or any system, then we could have a constructive conversation. – Policyholder 4/9, 2016 at 20:32

As to the performing millions of loops, I demonstrate that failed assumption on a regular basis, the size and alignment of the fetch in particular and if you use branch prediction, etc, is as much of what you are timing as the instructions. L1 is also slow in keeping the processor fed, if running on an operating system the more loops you run the more likely to have an operating system or other event interfere with your timing. The only way to accurately time what they are talking about here is in a sim. – Policyholder 4/9, 2016 at 20:35

If the linked prior answer covers this topic then lets close this one, if there is an answer here the Peter in particular or someone just answer it then. Post the demo code results on your computer and how to run on others in case you dont have access to the core(s) being discussed. – Policyholder 4/9, 2016 at 20:43

@dwelch: I don't have a SKL, IDK why BeeOnRope doesn't just test it. BTW you're totally wrong about L1I being involved here. The whole point of the loop buffer is that it caches already-decoded instructions, and can just feed them to issue stage 4 uops at a time, without touching L1I$ or even the main L0 uop-cache. Overhead from OS interrupts is fractions of a %, and this effect produces easily-measurable differences of 25% to 100% in the total cycle count to run 100M iterations. I've done this on my SnB hardware, but it's broken ATM so I can't re-run the experiment myself. – Artery 5/9, 2016 at 13:1

There are a lot of complicating factors I'm not sure you can ignore these complicating factors if you're optimizing on such a low level. And when you have it right for one CPU another comes out with different alignment of optimal factors. – Youthen 5/9, 2016 at 15:35

@PeterCordes is right, my experiments on AMD confirm this. – Youthen 5/9, 2016 at 15:43

@dwelch - you are way off track here. This is a question by a knowledgeable user, for experts. I understand the complexities. As Peter already makes clear above this isn't some minor effect - it's the difference between a 5-op loop running in 2 cycles vs 1.25 cycles - i.e., close to a doubling in throughput. Changes of 1% or less can be measured with a properly configured environment, to say nothing of changes of 75%! Furthermore, even with the inherent complexity of x86, most small loops can be analyzed down to a cycle accurate level (see Agner's docs or IACA) - modulo cache misses... – Hideaway 9/9, 2016 at 23:24

... which themselves are well understood! @PeterCordes, well - I have been testing exactly this and I'll share my results soon. I'm traveling and had a 5 day stretch with zero internet access. My early results for Skylake show that the situation is definitely not as bad as "issues in groups of 4, ended by a jump". I.e., 5 uop loops execute at nearly 1.25 cycles/iteration, rather than the 2 cycles/iteration that would be the case if the simple model held. 7 uops/loop is the worst case. I have some theories and code that I'll share in an answer: perhaps you can help test on SB. – Hideaway 9/9, 2016 at 23:27

My Intel DZ68DB motherboard is bricked because of bad BIOS that Intel still has had up on their web site. I probably skipped it years ago, but then applied it because I thought it might fix affect a graphics-drivers problem. – Artery 9/9, 2016 at 23:47

Bad news... Hopefully you find a way to unbrick it. In any case other interested users may be able to chip in with numbers for other archs. – Hideaway 10/9, 2016 at 0:40

I got around to posting my results. Hopefully we can get others to chip in some results on other hardware. – Hideaway 9/10, 2016 at 7:29

I did some investigation with Linux perf to help answer this on my Skylake i7-6700HQ box, and Haswell results have been kindly provided by another user. The analysis below applies to Skylake, but it is followed by a comparison versus Haswell.

Other architectures may vary⁰, and to help sort it all out I welcome additional results. The source is available).

This question mostly deals with the front end, since on recent architectures it is the front end which imposes the hard limit of four fused-domain uops per cycle.

Summary of Rules for Loop Performance

First, I'll summarize the results in terms of a few "performance rules" to keep in mind when dealing with small loops. There are plenty of other performance rules as well - these are complementary to them (i.e., you probably don't break another rule to just to satisfy these ones). These rules apply most directly to Haswell and later architectures - see the other answer for an overview of the differences on earlier architectures.

First, count the number of macro-fused uops in your loop. You can use Agner's instruction tables to look this up directly for every instruction, except that an ALU uop and immediately follow branch will usually fuse together into a single uop. Then based on this count:

If the count is a multiple of 4, you're good: these loops execute optimally.
If the count is even and less than 32, you're good, except if it's 10 in which case you should unroll to another even number if you can.
For odd numbers you should try to unroll to an even number less than 32 or a multiple of 4, if you can.
For loops larger than 32 uops but less than 64, you might want to unroll if it isn't already a multiple of 4: with more than 64 uops you'll get efficient performance at any value on Sklyake and almost all values on Haswell (with a few deviations, possibly alignment related). The inefficiencies for these loops are still relatively small: the values to avoid most are 4N + 1 counts, followed by 4N + 2 counts.

Summary of Findings

For code served out of the uop cache, there are no apparent multiple-of-4 effects. Loops of any number of uops can be executed at a throughput of 4 fused-domain uops per cycle.

For code processed by the legacy decoders, the opposite is true: loop execution time is limited to integral number of cycles, and hence loops that are not a multiple of 4 uops cannot achieve 4 uops/cycle, as they waste some issue/execution slots.

For code issued from the loop stream detector (LSD), the situation is a mix of the two and is explained in more detail below. In general, loops less than 32 uops and with an even number of uops execute optimally, while odd-sized loops do not, and larger loops require a multiple-of-4 uop count to execute optimally.

What Intel Says

Intel actually has a note on this in their optimization manual, details in the other answer.

Details

As anyone well-versed recent x86-64 architectures knows, at any point the fetch and decode portion of the front end may be working in one several different modes, depending on the code size and other factors. As it turns out, these different modes all have different behaviors with respect to loop sizing. I'll cover them separately follow.

Legacy Decoder

The legacy decoder¹ is the full machine-code-to-uops decoder that is used² when the code doesn't fit in the uop caching mechanisms (LSD or DSB). The primary reason this would occur is if the code working set is larger than the uop cache (approximately ~1500 uops in the ideal case, less in practice). For this test though, we'll take advantage of the fact that the legacy decoder will also be used if an aligned 32-byte chunk contains more than 18 instructions³.

To test the legacy decoder behavior, we use a loop that looks like this:

short_nop:
    mov rax, 100_000_000
ALIGN 32
.top:
    dec rax
    nop
    ...
    jnz .top
    ret

Basically, a trivial loop that counts down until rax is zero. All instructions are a single uop⁴ and the number of nop instructions is varied (at the location shown as ...) to test different sizes of loops (so a 4-uop loop will have 2 nops, plus the two loop control instructions). There is no macro-fusion as we always separate the dec and jnz with at least one nop, and also no micro-fusion. Finally, there is no memory access at (outside of the implied icache access).

Note that this loop is very dense - about 1 byte per instruction (since the nop instructions are 1 byte each) - so we'll trigger the > 18 instructions in a 32B chunk condition as soon as hit 19 instructions in the loop. Based on examining the perf performance counters lsd.uops and idq.mite_uops that's exactly what we see: essentially 100% of the instructions come out of the LSD⁵ up until and including the 18 uop loop, but at 19 uops and up, 100% come from the legacy decoder.

In any case, here are the cycles/iteration for all loop sizes from 3 to 99 uops⁶:

The blue points are the loops that fit in the LSD, and show somewhat complex behavior. We'll look at these later.

The red points (starting at 19 uops/iteration), are handled by the legacy decoder, and show a very predictable pattern:

All loops with N uops take exactly ceiling(N/4) iterations

So, for the legacy decoder at least, Peter's observation holds exactly on Skylake: loops with a multiple of 4 uops may execute at an IPC of 4, but any other number of uops will waste 1, 2 or 3 execution slots (for loops with 4N+3, 4N+2, 4N+1 instructions, respectively).

It is not clear to me why this happens. Although it may seem obvious if you consider that decoding happens in contiguous 16B chunks, and so at a decoding rate of 4 uops/cycle loops not a multiple of 4 would always have some trailing (wasted) slots in the cycle the jnz instruction is encountered. However, the actual fetch & decode unit is composed of predecode and decode phases, with a queue in-between. The predecode phase actually has a throughput of 6 instructions, but only decodes to the end of the 16-byte boundary on each cycle. This seems to imply that the bubble that occurs at the end of the loop could be absorbed by the predecoder -> decode queue since the predecoder has an average throughput higher than 4.

So I can't fully explain this based on my understanding of how the predecoder works. It may be that there is some additional limitation in decoding or pre-decoding that prevents non-integral cycle counts. For example, perhaps the legacy decoders cannot decode instructions on both sides of a jump even if the instructions after the jump are available in the predecoded queue. Perhaps it is related to the need to handle macro-fusion.

The test above shows the behavior where the top of the loop is aligned on a 32-byte boundary. Below is the same graph, but with an added series that shows the effect when the top of loop is moved 2 bytes up (i.e, now misaligned at a 32N + 30 boundary):

Most loop sizes now suffer a 1 or 2 cycle penalty. The 1 penalty case makes sense when you consider decoding 16B boundaries and 4-instructions per cycle decoding, and the 2 cycle penalty cases occurs for loops where for some reason the DSB is used for 1 instruction in the loop (probably the dec instruction which appears in its own 32-byte chunk), and some DSB<->MITE switching penalties are incurred.

In some cases, the misalignment doesn't hurt when it ends up better aligning the end of the loop. I tested the misalignment and it persists in the same way up to 200 uop loops. If you take the description of the predecoders at face value, it would seem that, as above, they should be able to hide a fetch bubble for misalignment, but it doesn't happen (perhaps the queue is not big enough).

DSB (Uop Cache)

The uop cache (Intel likes to call it the DSB) is able to cache most loops of moderate amount of instructions. In a typical program, you'd hope that most of your instructions are served out of this cache⁷.

We can repeat the test above, but now serving uops out of the uop cache. This is a simple matter of increasing the size of our nops to 2 bytes, so we no longer hit the 18-instruction limit. We use the 2-byte nop xchg ax, ax in our loop:

long_nop_test:
    mov rax, iters
ALIGN 32
.top:
    dec eax
    xchg ax, ax  ; this is a 2-byte nop
    ...
    xchg ax, ax
    jnz .top
    ret

Here, there results are very straightforward. For all tested loop sizes delivered out of the DSB, the number of cycles required was N/4 - i.e., the loops executed at the maximum theoretical throughput, even if they didn't have a multiple of 4 uops. So in general, on Skylake, moderately sized loops served out of the DSB shouldn't need to worry about ensuring the uop count meets some particular multiple.

Here's a graph out to 1,000 uop loops. If you squint, you can see the sub-optimal behavior before 64-uops (when the loop is in the LSD). After that, it's a straight shot, 4 IPC the whole way to 1,000 uops (with a blip around 900 that was probably due to load on my box):

Next we look at performance for loops that are small enough to fit in the uop cache.

LSD (Loop steam detector)

Important note: Intel has apparently disabled the LSD on Skylake (SKL150 erratum) and Kaby Lake (KBL095, KBW095 erratum) chips via a microcode update and on Skylake-X out of the box, due to a bug related to the interaction between hyperthreading and the LSD. For those chips, the graph below will likely not have the interesting region up to 64 uops; rather, it will just look the same as the region after 64 uops.

The loop stream detector can cache small loops of up to 64 uops (on Skylake). In Intel's recent documentation it is positioned more as a power-saving mechanism than a performance feature - although there are certainly no performance downsides mentioned to using the LSD.

Running this for the loop sizes that should fit in the LSD, we get the following cycles/iteration behavior:

The red line here is the % of uops which are delivered from the LSD. It flatlines at 100% for all loop sizes from 5 to 56 uops.

For the 3 and 4 uop loops, we have the unusual behavior that 16% and 25% of the uops, respectively, are delivered from the legacy decoder. Huh? Luckily, it doesn't seem to affect the loop throughput as both cases achieve the maximum throughput of 1 loop/cycle - despite the fact that one could expect some MITE<->LSD transition penalties.

Between loop sizes of 57 and 62 uops, the number of uops delivered from LSD exhibits some weird behavior - approximately 70% of the uops are delivered from the LSD, and the rest from the DSB. Skylake nominally has a 64-uop LSD, so this is some kind of transition right before the LSD size is exceeded - perhaps there is some kind of internal alignment within the IDQ (on which the LSD is implemented) that causes only partial hits to the LSD in this phase. This phase is short and, performance-wise, seems mostly to be a linear combination of the full-in-LSD performance which precedes it, and the fully-in-DSB performance which follows it.

Let's look at the main body of results between 5 and 56 uops. We see three distinct regions:

Loops from 3 to 10 uops: Here, the behavior is complex. It is the only region where we see cycle counts that can't be explained by static behavior over a single loop iteration⁸. The range is short enough that it's hard to say if there is a pattern. Loops of 4, 6 and 8 uops all execute optimally, in N/4 cycles (that's the same pattern as the next region).

A loop of 10 uops, on the other hand, executes in 2.66 cycles per iteration, making it the only even loop size that doesn't execute optimally until you get to loop sizes of 34 uops or above (other than the outlier at 26). That corresponds to something like a repeated uop/cycle execution rate of 4, 4, 4, 3. For a loop of 5 uops, you get 1.33 cycles per iteration, very close but not the same as the ideal of 1.25. That corresponds to an execution rate of 4, 4, 4, 4, 3.

These results are hard to explain. The results are repeatable from run to run, and robust to changes such as swapping out the nop for an instruction that actually does something like mov ecx, 123. It might be something to do with the limit of 1 taken branch every 2 cycles, which applies to all loops except those that are "very small". It might be that the uops occasionally line up such that this limitation kicks in, leading to an extra cycle. Once you get to 12 uops or above, this never occurs since you are always taking at least three cycles per iteration.

Loops from 11 to 32-uops: We see a stair-step pattern, but with a period of two. Basically all loops with an even number of uops perform optimally - i.e., taking exactly N/4 cycles. Loops with odd number of uops waste one "issue slot", and take the same number of cycles as a loop with one more uops (i.e., a 17 uop loop takes the same 4.5 cycles as an 18 uop loop). So here we have behavior better than ceiling(N/4) for many uop counts, and we have the first evidence that Skylake at least can execute loops in a non-integral number of cycles.

The only outliers are N=25 and N=26, which both take about 1.5% longer than expected. It's small but reproducible, and robust to moving the function around in the file. That's too small to be explained by a per-iteration effect, unless it has a giant period, so it's probably something else.

The overall behavior here is exactly consistent (outside of the 25/26 anomaly) with the hardware unrolling the loop by a factor of 2.

Loops from 33 to ~64 uops: We see a stair-step pattern again, but with a period of 4, and worse average performance than the up-to 32 uop case. The behavior is exactly ceiling(N/4) - that is, the same as the legacy decoder case. So for loops of 32 to 64 uops, the LSD provides no apparent benefit over the legacy decoders, in terms of front end throughput for this particular limitation. Of course, there are many other ways the LSD is better - it avoids many of the potential decoding bottlenecks that occur for more complex or longer instructions, and it saves power, etc.

All of this is quite surprising, because it means that loops delivered from the uop cache generally perform better in the front end than loops delivered from the LSD, despite the LSD usually being positioned as a strictly better source of uops than the DSB (e.g., as part of advice to try to keep loops small enough to fit in the LSD).

Here's another way to look at the same data - in terms of the efficiency loss for a given uop count, versus the theoretical maximum throughput of 4 uops per cycle. A 10% efficiency hit means you only have 90% of the throughput that you'd calculate from the simple N/4 formula.

The overall behavior here is consistent with the hardware not doing any unrolling, which makes sense since a loop of more than 32 uops cannot be unrolled at all in a buffer of 64 uops.

The three regions discussed above are colored differently, and at least competing effects are visible:

Everything else being equal, the larger the number of uops involved, the lower the efficiency hit. The hit is a fixed cost only once per iteration, so larger loops pay a smaller relative cost.
There is a large jump in inefficiency when you cross to into the 33+ uop region: both the size of the throughput loss increases, and the number of affected uop counts doubles.
The first region is somewhat chaotic, and 7 uops is the worst overall uop count.

Alignment

The DSB and LSD analysis above is for loop entries aligned to a 32-byte boundary, but the unaligned case doesn't seem to suffer in either case: there isn't a material difference from the aligned case (other than perhaps some small variation for less than 10 uops that I didn't investigate further).

Here's the unaligned results for 32N-2 and 32N+2 (i.e., the loop top 2 bytes before and after the 32B boundary):

The ideal N/4 line is also shown for reference.

Haswell

Next next take a look at the prior microarchitecture: Haswell. The numbers here have been graciously provided by user Iwillnotexist Idonotexist.

LSD + Legacy Decode Pipeline

First, the results from the "dense code" test which tests the LSD (for small uop counts) and the legacy pipeline (for larger uop counts, since the loop "busts out" of the DSB due to instruction density.

Immediately we see a difference already in terms of when each architecture delivers uops from the LSD for a dense loop. Below we compare Skylake and Haswell for short loops of dense code (1 byte per instruction).

As described above, the Skylake loop stops being delivered from the LSD at exactly 19 uops, as expected from the 18-uop per 32-byte region of code limit. Haswell, on the other hand, seems to stop delivering reliably from the LSD for the 16-uop and 17-uop loops as well. I don't have any explanation for this. There is also a difference in the 3-uop case: oddly both processors only deliver some of the their uops out of the LSD in the 3 and 4 uop cases, but the exact amount is the same for 4 uops, and different from 3.

We mostly care about the actual performance though, right? So let's look at the cycles/iteration for the 32-byte aligned dense code case:

This is the same data as show above for Skylake (the misaligned series has been removed), with Haswell plotted alongside. Immediately you notice that the pattern is similar for Haswell, but not the same. As above, there are two regions here:

Legacy Decode

The loops larger than ~16-18 uops (the uncertainty is described above) are delivered from the legacy decoders. The pattern for Haswell is somewhat different from Skylake.

For the range from 19-30 uops they are identical, but after that Haswell breaks the pattern. Skylake took ceil(N/4) cycles for loops delivered from the legacy decoders. Haswell, on the other hand, seems to take something like ceil((N+1)/4) + ceil((N+2)/12) - ceil((N+1)/12). OK, that's messy (shorter form, anyone?) - but basically it means that while Skylake executes loops with 4*N cycles optimally (i.e,. at 4-uops/cycle), such loops are (locally) usually the least optimal count (at least locally) - it takes one more cycle to execute such loops than Skylake. So you are actually best off with loops of 4N-1 uops on Haswell, except that the 25% of such loops that are also of the form 16-1N (31, 47, 63, etc) take one additional cycle. It's starting to sound like a leap year calculation - but the pattern is probably best understood visually above.

I don't think this pattern is intrinsic to uop dispatch on Haswell, so we shouldn't read to much into it. It seems to be explained by

0000000000455a80 <short_nop_aligned35.top>:
16B cycle
  1     1 455a80:       ff c8   dec    eax
  1     1 455a82:       90      nop
  1     1 455a83:       90      nop
  1     1 455a84:       90      nop
  1     2 455a85:       90      nop
  1     2 455a86:       90      nop
  1     2 455a87:       90      nop
  1     2 455a88:       90      nop
  1     3 455a89:       90      nop
  1     3 455a8a:       90      nop
  1     3 455a8b:       90      nop
  1     3 455a8c:       90      nop
  1     4 455a8d:       90      nop
  1     4 455a8e:       90      nop
  1     4 455a8f:       90      nop
  2     5 455a90:       90      nop
  2     5 455a91:       90      nop
  2     5 455a92:       90      nop
  2     5 455a93:       90      nop
  2     6 455a94:       90      nop
  2     6 455a95:       90      nop
  2     6 455a96:       90      nop
  2     6 455a97:       90      nop
  2     7 455a98:       90      nop
  2     7 455a99:       90      nop
  2     7 455a9a:       90      nop
  2     7 455a9b:       90      nop
  2     8 455a9c:       90      nop
  2     8 455a9d:       90      nop
  2     8 455a9e:       90      nop
  2     8 455a9f:       90      nop
  3     9 455aa0:       90      nop
  3     9 455aa1:       90      nop
  3     9 455aa2:       90      nop
  3     9 455aa3:       75 db   jne    455a80 <short_nop_aligned35.top>

Here I've noted the 16B decode chunk (1-3) each instruction appears in, and the cycle in which it will be decoded. The rule is basically that up to the next 4 instructions are decoded, as long as they fall in the current 16B chunk. Otherwise they have to wait until the next cycle. For N=35, we see that there is a loss of 1 decode slot in cycle 4 (only 3 instruction are left in the 16B chunk), but that otherwise the loop lines up very well with the 16B boundaries and even the last cycle (9) can decode 4 instructions.

Here's a truncated look at N=36, which is identical except for the end of the loop:

0000000000455b20 <short_nop_aligned36.top>:
16B cycle
  1     1 455a80:       ff c8   dec    eax
  1     1 455b20:       ff c8   dec    eax
  1     1 455b22:       90      nop
  ... [29 lines omitted] ...
  2     8 455b3f:       90      nop
  3     9 455b40:       90      nop
  3     9 455b41:       90      nop
  3     9 455b42:       90      nop
  3     9 455b43:       90      nop
  3    10 455b44:       75 da   jne    455b20 <short_nop_aligned36.top>

There are now 5 instructions to decode in the 3rd and final 16B chunk, so one additional cycle is needed. Basically 35 instructions, for this particular pattern of instructions happens to line up better with the 16B bit boundaries and saves one cycle when decoding. This doesn't mean that N=35 is better than N=36 in general! Different instructions will have different numbers of bytes and will line up differently. A similar alignment issue explains also the additional cycle that is required every 16 bytes:

16B cycle
...
  2     7 45581b:       90      nop
  2     8 45581c:       90      nop
  2     8 45581d:       90      nop
  2     8 45581e:       90      nop
  3     8 45581f:       75 df   jne    455800 <short_nop_aligned31.top>

Here the final jne has slipped into the next 16B chunk (if an instruction spans a 16B boundary it is effectively in the latter chunk), causing an extra cycle loss. This occurs only every 16 bytes.

So the Haswell legacy decoder results are explained perfectly by a legacy decoder that behaves as described, for example, in Agner Fog's microarchitecture doc. In fact, it also seems to explain Skylake results if you assume Skylake can decode 5 instructions per cycle (delivering up to 5 uops)⁹. Assuming it can, the asymptotic legacy decode throughput on this code for Skylake is still 4-uops, since a block of 16 nops decodes 5-5-5-1, versus 4-4-4-4 on Haswell, so you only get benefits at the edges: in the N=36 case above, for example, Skylake can decode all remaining 5 instructions, versus 4-1 for Haswell, saving a cycle.

The upshot is that it seems to be that the legacy decoder behavior can be understood in a fairly straightforward manner, and the main optimization advice is to continue to massage code so that it falls "smartly" into the 16B aligned chunks (perhaps that's NP-hard like bin packing?).

DSB (and LSD again)

Next let's take a look at the scenario where the code is served out of the LSD or DSB - by using the "long nop" test which avoids breaking the 18-uop per 32B chunk limit, and so stays in the DSB.

Haswell vs Skylake:

Note the LSD behavior - here Haswell stops serving out of the LSD at exactly 57 uops, which is completely consistent with the published size of the LSD of 57 uops. There is no weird "transition period" like we see on Skylake. Haswell also has the weird behavior for 3 and 4 uops where only ~0% and ~40% of the uops, respectively, come from the LSD.

Performance-wise, Haswell is normally in-line with Skylake with a few deviations, e.g., around 65, 77 and 97 uops where it rounds up to the next cycle, whereas Skylake is always able to sustain 4 uops/cycle even when that's results in a non-integer number of cycles. The slight deviation from expected at 25 and 26 uops has disappeared. Perhaps the 6-uop delivery rate of Skylake helps it avoid uop-cache alignment issues that Haswell suffers with its 4-uop delivery rate.

Other Architectures

Results for the following additional architectures were kindly provided by user Andreas Abel, but we'll have to use another answer for further analysis as we are at the character limit here.

Help Needed

Although results for many platforms have been kindly offered by the community, I'm still interested in results on chips older than Nehalem, and newer than Coffee Lake (in particular, Cannon Lake, which is a new uarch). The code to generate these results is public. Also, the results above are available in .ods format in GitHub as well.

⁰ In particular, the legacy decoder maximum throughput apparently increased from 4 to 5 uops in Skylake, and the maximum throughput for the uop cache increased from 4 to 6. Both of those could impact the results described here.

¹ Intel actually like to call the legacy decoder the MITE (Micro-instruction Translation Engine), perhaps because it's a faux-pas to actually tag any part of your architecture with the legacy connotation.

² Technically there is another, even slower, source of uops - the MS (microcode sequencing engine), which is used to implement any instruction with more than 4 uops, but we ignore this here since none of our loops contain microcoded instructions.

³ This works because any aligned 32-byte chunk can use at most 3-ways in its uop cache slot, and each slot holds up to 6 uops. So if you use more than 3 * 6 = 18 uops in a 32B chunk, the code can't be stored in the uop cache at all. It's probably rare to encounter this condition in practice, since the code needs to be very dense (less than 2 bytes per instruction) to trigger this.

⁴ The nop instructions decode to one uop, but don't are eliminated prior to execution (i.e., they don't use an execution port) - but still take up space in the front end and so count against the various limits that we are interested in.

⁵ The LSD is the loop stream detector, which caches small loops of up to 64 (Skylake) uops directly in the IDQ. On earlier architectures it can hold 28 uops (both logical cores active) or 56 uops (one logical core active).

⁶ We can't easily fit a 2 uop loop in this pattern, since that would mean zero nop instructions, meaning the dec and jnz instructions would macro-fuse, with a corresponding change in the uop count. Just take my word that all loops with 4 or less uops execute at best at 1 cycle/iteration.

⁷ For fun, I just ran perf stat against a short run of Firefox where I opened a tab and clicked around on a few Stack Overflow questions. For instructions delivered, I got 46% from DSB, 50% from legacy decoder and 4% for LSD. This shows that at least for big, branchy code like a browser the DSB still can't capture the large majority of the code (lucky the legacy decoders aren't too bad).

⁸ By this, I mean that all the other cycle counts can be explained by simply by taking an "effective" integral loop cost in uops (which might be higher than the actual size is uops) and dividing by 4. For these very short loops, this doesn't work - you can't get to 1.333 cycles per iteration by dividing any integer by 4. Said another way, in all other regions the costs have the form N/4 for some integer N.

⁹ In fact we know that Skylake can deliver 5 uops per cycle from the legacy decoder, but we don't know if those 5 uops can come from 5 different instructions, or only 4 or less. That is, we expect that Skylake can decode in the pattern 2-1-1-1, but I'm not sure if it can decode in the pattern 1-1-1-1-1. The above results give some evidence that it can indeed decode 1-1-1-1-1.

Hideaway answered 9/10, 2016 at 7:4 Comment(37)

I see SKL can't manage non-integer cycles per iteration for larger loops, > half LSD size. Perhaps SKL unrolls loops in the LSD until they're a multiple of 4 uops, size permitting? That would be plausible from a design perspective I think, and sounds like a good theory if it's consistent with the data. – Artery 9/10, 2016 at 7:20

Well it can, but just not for loops in the LSD, so the 0.5 * LSD -> 1.0 * LSD size range is where it can't. I never considered some type of hardware unrolling like that - what benefit would it bring? It's clear they can execute just fine out of the DSB, which still uses all the IDQ stuff, so in that more complex situation they don't need unrolling, so why would they do it in the LSD? Still, it kind of nicely explains the phase-change at ~32 bytes. It doesn't seem to much to explain the weirdness in region 1 or 2 though. – Hideaway 9/10, 2016 at 7:28

What benefit? Running a 5 uop loop in 1.25 cycles instead of 2. Without something new, it would be like SnB where the loop branch ends an issue group, if my data is correct. This is my theory for explaining the original question! It doesn't need any help to do this when running from the uop cache, because it's not looping back onto what's already in the IDQ, so SKL's improved uop-cache read bandwidth lets it keep the IDQ supplied with 4 uops per clock, and the natural operation of the queue lets the taken branch issue in the same cycle as the first insn of the next iteration. – Artery 9/10, 2016 at 7:34

Ah yeah, that makes sense. The IDQ doesn't work like a queue when it's being used as the LSD - perhaps it's essentially just a fixed buffer and the processor executes out of it by moving pointers out of it, and the loop-end causes an issue in that scenario. In the DSB case it operates like a buffer after all, and the loop end doens't cause an issue because by the time the RAT sees it, it is already followed by the beginning of the next loop thanks to the the uop cache fetching running ahead... – Hideaway 9/10, 2016 at 7:39

the bubble that occurs at the end of the loop could be absorbed by the predecoder: only if the pre-decoders / pre-decode queue know about branches and can stitch together the bytes required to feed the decoders a non-contiguous group of instructions. In a CPU with a uop cache and loop buffer, I'm not surprised they don't spend the transistors on that. I'll test tomorrow whether Core2 does it, since it might be worth the transistors for tight loops there. Keep in mind that most loops that run from the decoders won't bottlenck on the frontend not quite managing 4.0 uops / cycle! – Artery 9/10, 2016 at 7:40

@PeterCordes - FWIW the 5 uop loop runs in 1.3333.... cycles not 1.25. One of the weird mysteries of the LSD region 1. It's like it hiccups once every few iterations... – Hideaway 9/10, 2016 at 7:40

The predecoders must know something about branches or else they would be quite useless! The branch prediction logic needs to feed into fetch (i.e., even before predecoders) or else the whole thing falls apart because branches are very frequent and otherwise the predecoders would be getting the wrong 16B all the time. – Hideaway 9/10, 2016 at 7:43

Getting the right 16B is simpler than stitching together blocks of 4 instructions from separate sources. Doing a data-dependent operation like that would take extra transistors and extra gate-delays. – Artery 9/10, 2016 at 7:45

I expect the IDQ is normally a circular buffer with a read pointer and a write pointer, instead of having all the data actually move to different transistors every cycle. It's big enough that it might be more like SRAM than anything else, but I'm not a real CPU designer / analyst. – Artery 9/10, 2016 at 7:47

That said, I'm definitely not saying they stitch together any bytes! They don't need to here. Let's say they are looking at 17 nops followed by a jnz. They handle the first 6 + 6 + 4 = 16 nops in 3 cycles (all in the same 16B chunk), then they handle the final nop and the jnz, and the rest of the predecode bandwidth is wasted that cycle, but overall it has delivered 18 instructions in 4 cycles = 4.5 inst/cycle. – Hideaway 9/10, 2016 at 7:47

Where is the "blocks of 4" coming when discussing the decoder and pre-decoder? There is already a queue between those two things (20 instructions in SB), so I don't see where any stitching needs to happen - in some cycle the predecoder puts in 3 insts, in another 5 instrs - later on the decoders can pull those out as 4 and 4. The – Hideaway 9/10, 2016 at 7:51

Right, but you're hoping they can feed the decoders a non-contiguous block of x86 machine code in the same cycle. That's a big deal, I think. Or at least a moderate deal. I see your point about the queue between pre-decode and decode; maybe it's not a problem at all. – Artery 9/10, 2016 at 7:52

The decoders can't macro-fuse a test/jcc when the first byte of the JCC is 64B-aligned. They can in any other case of either the test or JCC spanning the 64B boundary. This is probably for uop-cache reasons more than pre-decode or decode reasons. (Invalidating a line of L1I cache also invalidates any uop cache entries for that range of address space, so it has to be able to invalidate just the JCC or just the test.) – Artery 9/10, 2016 at 8:1

Right, yes. I don't see why the decoders would care, because by the time they slurp it in off of the predecode queue I thought it has already been cracked up (i.e., the queue length is specified in "instructions" not "bytes"). Probably it's possible to test it directly with some small forward jumping code. In any case, the measurements for the legacy are consistent with your position and not really consistent with mine, although there could be another reason for it as well. – Hideaway 9/10, 2016 at 8:2

I never made any claims about how the legacy decoders behaved (until just now). I've been curious about how uop-cache-line effects affect frontend performance for loops that don't fit in the LSD, since while tuning real code on SnB I saw perf changes from reordering instructions or from alignment. I suspect that pre-SKL, uop-cache read bandwidth was sometimes the bottleneck, since IIRC the frontend can only read from one uop cache line per cycle. (And pre-SKL, maybe couldn't read all 6 uops in a single cycle, but if that's true, I would have expected bigger bottlenecks.) – Artery 9/10, 2016 at 8:18

Right, the other question was about LSD resident loops, I think. I was responding to your comments here though. So far, it seems that Skylake behaves as if it can only process 4 contiguous instructions in the legacy pipeline, but it just isn't clear to me why this is the case. – Hideaway 9/10, 2016 at 22:52

@IwillnotexistIdonotexist: perfect, both those links work for me. Hopefully BeeOnRope can get them, too, and turn them into the same kind of graphs. – Artery 28/10, 2016 at 5:40

Will do today! Is there a comment missing above? – Hideaway 28/10, 2016 at 14:17

Yeah I got them. It just seemed like @PeterCordes replied to a message above that no longer existed or something :) – Hideaway 28/10, 2016 at 18:58

@IwillnotexistIdonotexist - thanks very much for the Haswell numbers. I uploaded the first chunk of analysis above, covering mostly the legacy decode pipeline. It shed some light on the Skylake behavior actually - the legacy pipeline now seems like a simple case that can be explained (mostly?) by just looking at how the code falls on 16B boundaries, with the additional proviso that Skylake can decode 5 uops/cycle from 5 instructions, versus 4 from 4 in Haswell. – Hideaway 28/10, 2016 at 21:58

I can verify this with a new test that has 15 instructions per 16B chunk (good for Skylake) rather than 16 (good for Haswell). Then maybe the steps disapear in the Skylake legacy graph! – Hideaway 28/10, 2016 at 21:58

@Hideaway Any immediate differences b/ Haswell HT and no HT? – Bebe 28/10, 2016 at 22:4

I didn't look at HT at all. AFAIK nothing differs between HT and non-HT unless you are actually running something on both logical cores of the same physical core. For example, all the resources things that are "split when using HT" are split not when HT is enabled, but when a core is actually running two logical threads. Since I assume you didn't do any tricks to run two versions of my test at once on the same logical core (right?), I don't expect any difference in your data sets (but I will confirm). – Hideaway 28/10, 2016 at 22:14

... except of course that I found that HT-on is much noisier (less repeatable) likely due to an occasional other thread getting scheduled to the same core while the test runs. That's why I turn HT off. – Hideaway 28/10, 2016 at 22:15

@Hideaway I ran your tests while not touching the computer at all, and for HT off I completely offlined the sibling core before running. – Bebe 28/10, 2016 at 22:17

Also, I'll avow myself surprised that <100% of the uops in a 3-uop loop come from the LSD. In my quickie Haswell experiments with libpfc I get ~100%. I suspect that this is because you put the nop(s) between the dec rax and the jne. In the loop nop dec jne, 3 insns/i issue but just 2 uops/i, all served out of LSD, in a pattern 0-4-0-4. In the loop dec nop jne, 3 insns/i issue, 3 uops/i, all served out of LSD, in a pattern 0-4-4-4-0-4-4-4. – Bebe 28/10, 2016 at 22:25

Let us continue this discussion in chat. – Hideaway 28/10, 2016 at 22:45

I posted the second half of the Haswell results up. – Hideaway 29/10, 2016 at 16:45

@PeterCordes - FWIW I think your unrolling conclusion was correct. It explains most of the results (the region <= 10 uops is different though) pretty much exactly. It doesn't seem to fully unroll in the 11 to 31 region: just by 2x in all those cases. – Hideaway 1/11, 2017 at 2:48

Wild guess: unrolling by 2x happens naturally in the HW: the IDQ is probably a circular buffer, and maybe loop detection happens when a backward branch is taken twice in a row. (So the loop has already issued twice, so its uops are sitting there in the buffer.) If SnB is different and doesn't unroll, maybe it activates the LSD the first time it branches to a uop that hasn't been overwritten yet in the IDQ? – Artery 1/11, 2017 at 10:16

I created a pull request on GitHub with results for several microarchitectures, including Sandy Bridge. – Lampe 27/10, 2018 at 13:49

@AndreasAbel - I added another answer with some analysis of your results, so far only for the "dense nop" cases which exercises the legacy decoders and/or LSD. There is a weird outlier at 16 uops for Coffee Lake (see graph below) - is it consistent on your system? The Broadwell results are also a bit weird, they seem "noisy" in that they don't follow any strict linear or stair-step pattern. Are they consistent? – Hideaway 5/11, 2018 at 5:14

@PeterCordes - Intel has finally confirmed your "unrolling" theory in the newest optimization manual: Assume a loop that qualifies for LSD has 23 μops in the loop body. The hardware unrolls the loop such that it still fits into the μop-queue, in this case twice. The loop in the μop-queue thus takes 46 μops. from section 3.4.2.4. – Hideaway 6/10, 2019 at 0:45

@Andreas Abel mentioned in another comment (which I can't find now) that Skylake legacy decode (MITE) still only has 4 decoders, with only the number of uops they can produce being increased to 5. – Artery 13/12, 2021 at 6:49

@PeterCordes This was the comment: stackoverflow.com/questions/70131766/… You can test this with the IDQ.MITE_UOPS performance counter with a CMASK of 5. – Lampe 13/12, 2021 at 18:35

@PeterCordes - yeah I "discovered" this on my own when looking into Skylake decode behavior in the last year, and Andreas already knew this and confirmed it on Twitter (Andreas has a whole paper and simulator which goes into detail on the front-end among other things, if you haven't seen it). I tried to find where this "5 decoders" wisdom actually started, but it wasn't in the intel docs: they actually only say "5 uops". I actually found a lot of my own posts saying "5 decoders" so I definitely added to the problem. – Hideaway 14/12, 2021 at 1:7

There were also some diagrams on wikichip IIRC which showed 5 decoders, that might have been the source of some of it. (I just clicked on the link Andreas provided and I see it refers to uiCA, so nevermind that part, oops). – Hideaway 14/12, 2021 at 1:8

This is a follow-on to the original answer, to analyze the behavior for five additional architectures, based on test results provided by Andreas Abel:

Nehalem
Sandy Bridge
Ivy Bridge
Broadwell
Coffee Lake

We take a quick look at the results on these architectures in addition to Skylake and Haswell. It only needs to be a "quick" look since all the architectures except Nehalem follow one of the existing patterns discussed above.

First, the short nop case which exercises the legacy decoder (for loops that don't fit in the LSD) and the LSD. Here is the cycles/iteration for this scenario, for all 7 architectures.

Figure 2.1: All architectures dense nop performance:

This graph is really busy (click for a larger view) and a bit hard to read since the results for many architectures lie on top of each other, but I tried to ensure that a dedicated reader can track the line for any architecture.

First, let's discuss the big outlier: Nehalem. All of the other architectures have a slope that roughly follows the 4 uops/cycle line, but Nehalem is at almost exactly 3 uops per cycle, so quickly falls behind all of the other architectures. Outside of the initial LSD region, the line is also totally smooth, without the "stair step" appearance seen in the other architectures.

This is entirely consistent with Nehalem having a uop retirement limit of 3 uops/cycle. This is the bottleneck for uops outside of the LSD: they all execute at about exactly 3 uops per cycle, bottlenecked on retirement. The front-end isn't the bottleneck, so the exact uop count and decoding arrangement doens't matter and so the stair-step is absent.

Other than Nehalem, the other architectures, except Broadwell split fairly cleanly into groups: Haswell-like or Skylake-like. That is, all of Sandy Bridge, Ivy Bridge and Haswell behave like Haswell, for loops greater than about 15 uops (Haswell behavior is discussed in the other answer). Even though they are different micro-architectures, they behave largely the same since their legacy decoding capabilities are the same. Below about 15 uops we see Haswell as somewhat faster for any uop count not a multiple of 4. Perhaps it gets an additional unrolling in the LSD due to a larger LSD, or there are other "small loop" optimizations. For Sandy Bridge and Ivy Bridge, this means that small loops should definitely target a uop count which is a multiple of 4.

Coffee Lake behaves similarly to Skylake¹. This makes sense, since the micro-architecture is the same. Coffee Lake appears better than Skylake below about 16 uops, but this is just an effect of Coffee Lake's disabled LSD by default. Skylake was tested with an enabled LSD, before Intel disabled it via microcode update due to a security issue. Coffee Lake was released after this issue was known, so had the LSD disabled out-of-the-box. So for this test, Coffee Lake is using either the DSB (for loops below about 18 uops, which can still fit in the DSB) or the legacy decoder (for the remainder of the loops), which leads to better results for small uop count loops where the LSD imposes an overhead (interesting, for larger loops, the LSD and the legacy decoder happen to impose exactly the same overhead, for very different reasons).

Finally, we take a look at 2-byte NOPs, which aren't dense enough to prevent the use of the DSB (so this case is more reflective of typical code).

Figure 2.1: 2-byte nop performance:

Again, the result is along the same lines as the earlier chart. Nehalem is still the outlier bottlenecked at 3 uops per cycle. For the range up to about 60ish uops, all architectures other than Coffee Lake are using the LSD, and we see that Sandy Bridge and Ivy Bridge perform a bit worse here, rounding up to the next cycle and so only achieving the maximum throughput of 4 uops/cycle if the number of uops in the loop is a multiple of 4. Above 32 uops the "unrolling" feature of Haswell and new uarchs dosn't have any effect, so everything is roughly tied.

Sandy Bridge actually has a few uop ranges (e.g., from 36 through 44 uops) where it performs better than the newer architectures. This seems to occur because not all loops are detected by the LSD and in these ranges the loops are served from the DSB instead. Since the DSB is generally faster, so is Sandy Bridge in these cases.

What Intel Says

You can actually find a section specifically dealing with this topic in the Intel Optimization Manual, section 3.4.2.5, as pointed out by Andreas Abel in the comments. There, Intel says:

The LSD holds micro-ops that construct small “infinite” loops. Micro-ops from the LSD are allocated in the out-of-order engine. The loop in the LSD ends with a taken branch to the beginning of the loop. The taken branch at the end of the loop is always the last micro-op allocated in the cycle. The instruction at the beginning of the loop is always allocated at the next cycle. If the code performance is bound by front end bandwidth, unused allocation slots result in a bubble in allocation, and can cause performance degrada- tion. Allocation bandwidth in Intel microarchitecture code name Sandy Bridge is four micro-ops per cycle. Performance is best, when the number of micro-ops in the LSD result in the least number of unused allo- cation slots. You can use loop unrolling to control the number of micro-ops that are in the LSD.

They go on to show an example where unrolling a loop by a factor of two doesn't help performance due to LSD "rounding", but unrolling by three works. The example is a big confusing since it actually mixes two effects since unrolling more also reduces the loop overhead and hence the number of uops per iteration. A more interesting example would have been where unrolling the loop fewer times led to an increase in performance due to LSD rounding effects.

This section seems to accurately describe the behavior in Sandy Bridge and Ivy Bridge. The results above show that both of these architectures do as described, and you lose 1, 2 or 3 uop execution slots for loops with 4N+3, 4N+2, or 4N+1 uops respectively.

It hasn't been updated with the new performance for Haswell and later however. As described in the other answer, performance has improved from the simple model described above and the behavior is more complex.

¹ There is a weird outlier at 16 uops where Coffee Lake performs worse than all the other architectures, even Nehalem (a regression of about 50%), but maybe this measurement noise?

Hideaway answered 5/11, 2018 at 5:12 Comment(9)

Don't you mean Coffee Lake was using the uop cache, not the legacy decoder? Do your uops not fit in the uop cache? Anyway, interesting that my SnB results really were correct: a 5 to 7 uop loop runs at 2.0c per iter from the loop buffer, and it wasn't until Haswell that things improved. – Artery 5/11, 2018 at 5:26

@Peter this is the dense nop case, so in general the legacy decoder gets used since there are too many instructions per uop cache line. However for the small loops like under 18 hope one might imagine that the uop cache could still be used since there aren't enough nops to "break out" - which is what I saw on Sklyake with the LSD enabled. However for the coffee lake results it seems the DSB is not being used even for those small loops based on the perf counter results. – Hideaway 5/11, 2018 at 5:36

I should really just add a line for Skylake with microcode update, to check that it behaves the same as coffee lake. – Hideaway 5/11, 2018 at 5:38

I will run the test on Coffee lake again later to see if the outlier was a measurement error. – Lampe 5/11, 2018 at 16:27

I just came across section 3.4.2.5 of Intel's optimization manual. It hasn't been mentioned in the answers so far, but it seems relevant to the issue discussed here. – Lampe 5/11, 2018 at 16:29

@PeterCordes - a correction to the above: Coffee Lake indeed uses the DSB for small loops less than about 18 uops, even in the "dense" case, so all is as expected (I observed this also on Skylake pre-microcode patch except replace DSB with LSD). I just read the data wrong or mis-remembered it. Yeah, it seems the LSD strategy was perhaps improved in Haswell: maybe the whole "unrolling" thing was added then, so before that small loops especially suffered when they weren't of the form 4N. This makes unrolling somewhat more important for those architectures. – Hideaway 5/11, 2018 at 20:55

@AndreasAbel - huh, at least part of the answer was sitting there in the optimization manual the whole time! The behavior described there seems more or less in line with your Ivy Bridge and Sandy Bridge results. In Haswell, the LSD seems to have been made more sophisticated so the advice in the manual no longer applies as directly, per the results in the original answer. I should add what Intel says to the answer, hopefully I have enough remaining characters to do so. – Hideaway 5/11, 2018 at 22:21

@AndreasAbel - added a part about 3.4.2.5 to the end of this answer. – Hideaway 5/11, 2018 at 22:39

I added created a new pull request with additional results for Coffee Lake. The outlier at 16 uops was a measurement error, probably caused by hyperthreading. – Lampe 6/11, 2018 at 0:31

TL;DR: For tight loops consisting of exactly 7 uops it results in inefficient retirement bandwidth utilization. Consider manual loop unrolling so the loop will consist of 12 uops

I recently faced retirement bandwidth degradation with loops consisting of 7 uops. After doing some research by myself quick googling leads me to this topic. And here are my 2 cents applying to Kaby Lake i7-8550U CPU:

As @BeeOnRope noted, LSD is turned off on chips like KbL i7-8550U.

Consider the following NASM macro

;rdi = 1L << 31
%macro nops 1
    align 32:
    %%loop:
    times %1 nop
    dec rdi
    ja %%loop
%endmacro

Here is how the "average retirement rate" uops_retired.retire_slots/uops_retired.total_cycle looks like:

The thing to notice here is the retirement degradation when the loop consists of 7 uops. It results in 3.5 uops being retired per cycle.

The average idq delivery rate idq.all_dsb_cycles_any_uops / idq.dsb_cycles looks as

For loops of 7 uops it results in 3.5 uops being delivered to the idq per cycle. Judging by only this counter it is impossible to conclude whether uops cache delivers 4|3 or 6|1 groups.

For loops consisting of 6 uops it results in an efficient utilization of uops cache bandwidth - 6 uops/c. When IDQ gets overflowed the uops cache stays idle until it can deliver 6 uops again.

To check how the uops cache stays idle let's compare idq.all_dsb_cycles_any_uops and cycles

The number of cycles uops are delivered to the idq is equal to the number of total cycles for loops of 7 uops. By contrast the counters are noticeably different for the loop of 6 uops.

The key counters to check is idq_uops_not_delivered.*

As can be seen for the loop of 7 uops we have that the Renamer takes 4|3 groups which results in inefficient retirement bandwidth utilization.

Merle answered 16/5, 2020 at 9:10 Comment(10)

When looking for the bottleneck I would be careful about assumptions of causality when looking at the performance counters. From the start, you have some bottleneck which causes the sustained throughput to be 3.5 uops/cycle. By "bottleneck" here I just mean that you aren't running at the max theoretical 4.0 uops cycle. Even without knowing anything about the source of the bottleneck, it must be the case that every performance counter along the pipeline: front-end, allocation, dispatch, issue, retirement, will report exactly the same 3.5 sustained throughput. – Hideaway 16/5, 2020 at 17:8

... with a slight exception in this case since you used nop which doesn't execute. So every counter will report less than the maximum bandwidth, that has cycles or slots unused, etc. That doesn't tell you why there is a bottleneck. If you have an execution bottleneck, like a string of dependent multiply instructions, all the front-end counters are going to report really low numbers of delivered uops, and lots of idle cycles and so on, despite there being zero FE problem: it couldn't be otherwise: in the steady state, the throughput of every part of the pipeline must be equal. – Hideaway 16/5, 2020 at 17:10

So you can't use the DSB counters to conclude that the DSB is causing a bottleneck, generally. Same for most other counters. This is why methodologies for VTune need "conditional" counters: things like "cycles where no uops were delivered from the front end and allocation wasn't stalled". That is, if the RAT was able to accept ops but the FE couldn't provide them: in that case it's reasonable to think you might have a stall. – Hideaway 16/5, 2020 at 17:12

Anyways, the reason for the dip at 7 uops is fairly clear: the DSB can only deliver from one 6-uop line every cycle, and doesn't usefully deliver across a taken jump (uop cache is not a trace cache). So a 7 uop loop will always take at least 2 cycles: since you'll need 2 cycles to deliver 7 uops. – Hideaway 16/5, 2020 at 17:14

7 uops / 2 cycles = 3.5 / cycle. For 6 uops, there is no issue: all uops can come from a single way (if other restrictions are met), so you are limited elsewhere to 4 / cycle. For 8 uops you also need 2 cycles, but 8 / 4 = 2 so you don't really notice the bottleneck. BTW this is also a reason why it's useful to increase the DSB line size to 6 uops: so loops with 5 or 6 uops can execute in at 4 uops/cycle from the DSB. – Hideaway 16/5, 2020 at 17:17

@Hideaway Thanks for the detailed feedback. I noticed the problem for L1D bound computation in a loop consisting of 7 uops. As Intel Optimization Manual documents it in the Appendix B. to determine FE boundness the counter idq_uops_not_delivered.core should be analyzed. It counts exactly FE problems and not when the RAT cannot allocate the resource. I provided the counters on the last plot. It can be seen that 7 uops has noticeable peak which is due to delivery 7 uops per 2 cycles. – Merle 17/5, 2020 at 5:33

@Hideaway My point was that one of the possible solutions for the 7 uops loop problem maybe to manually unroll it. In the L1D bound computation I gained 28% performance improvement by unrolling the loop to contain 2 iterations. – Merle 17/5, 2020 at 5:40

@Hideaway so loops with 5 or 6 uops can execute in at 4 uops/cycle from the DSB. I would notice that not only 5 or 6, but also 9, 10, 11, etc till 18 since. 7 is the only problematic number in case of delivering from DSB. – Merle 17/5, 2020 at 5:40

@Hideaway the DSB can only deliver from one 6-uop line every cycle, and doesn't usefully deliver across a taken jump As far as I notice a branch predicted to be taken stops fetching uops from the DSB and the uops corresponding the branch's target will be delivered in the next cycle (if any). – Merle 17/5, 2020 at 5:51

Agree with everything. Sorry I hadn't noticed that one counter was idq_uops_not_delivered which doesn't have this "causality problem" (as much). Agreed about the jump: I think we are saying the same thing there? I am not 100% positive about special cases like very short forward jumps that could land in the same line though. – Hideaway 17/5, 2020 at 6:13

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Summary of Rules for Loop Performance

Summary of Findings

What Intel Says

Details

Legacy Decoder

DSB (Uop Cache)

LSD (Loop steam detector)

Alignment

Haswell

LSD + Legacy Decode Pipeline

Legacy Decode

DSB (and LSD again)

Other Architectures

Help Needed

What Intel Says

Recommended topics

Hot tags