Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?
Asked Answered
O

3

98

LOOP (Intel ref manual entry) decrements ecx / rcx, and then jumps if non-zero. It's slow, but couldn't Intel have cheaply made it fast? dec/jnz already macro-fuses into a single uop on Sandybridge-family; the only difference being that that sets flags.

loop on various microarchitectures, from Agner Fog's instruction tables:

  • K8/K10: 7 m-ops

  • Bulldozer-family/Ryzen: 1 m-op (same cost as macro-fused test-and-branch, or jecxz)

  • P4: 4 uops (same as jecxz)

  • P6 (PII/PIII): 8 uops

  • Pentium M, Core2: 11 uops

  • Nehalem: 6 uops. (11 for loope / loopne). Throughput = 4c (loop) or 7c (loope/ne).

  • SnB-family: 7 uops. (11 for loope / loopne). Throughput = one per 5 cycles, as much of a bottleneck as keeping your loop counter in memory! jecxz is only 2 uops with same throughput as regular jcc

  • Silvermont: 7 uops

  • AMD Jaguar (low-power): 8 uops, 5c throughput

  • Via Nano3000: 2 uops


Couldn't the decoders just decode the same as lea rcx, [rcx-1] / jrcxz? That would be 3 uops. At least that would be the case with no address-size prefix, otherwise it has to use ecx and truncate RIP to EIP if the jump is taken; maybe the odd choice of address-size controlling the width of the decrement explains the many uops? (Fun fact: rep-string instructions have the same behaviour with using ecx with 32-bit address-size.)

Or better, just decode it as a fused dec-and-branch that doesn't set flags? dec ecx / jnz on SnB decodes to a single uop (which does set flags).

I know that real code doesn't use it (because it's been slow since at least P5 or something), but AMD decided it was worth it to make it fast for Bulldozer. Probably because it was easy.


  • Would it be easy for SnB-family uarch to have fast loop? If so, why don't they? If not, why is it hard? A lot of decoder transistors? Or extra bits in a fused dec&branch uop to record that it doesn't set flags? What could those 7 uops be doing? It's a really simple instruction.

  • What's special about Bulldozer that made a fast loop easy / worth it? Or did AMD waste a bunch of transistors on making loop fast? If so, presumably someone thought it was a good idea.


If loop was fast, it would be perfect for BigInteger arbitrary-precision adc loops, to avoid partial-flag stalls / slowdowns (see my comments on my answer), or any other case where you want to loop without touching flags. It also has a minor code-size advantage over dec/jnz. (And dec/jnz only macro-fuses on SnB-family).

On modern CPUs where dec/jnz is ok in an ADC loop, loop would still be nice for ADCX / ADOX loops (to preserve OF).

If loop had been fast, compilers would already be using it as a peephole optimization for code-size + speed on CPUs without macro-fusion.


It wouldn't stop me from getting annoyed at all the questions with bad 16bit code that uses loop for every loop, even when they also need another counter inside the loop. But at least it wouldn't be as bad.

Overlying answered 2/3, 2016 at 9:1 Comment(28)
It's funny that AMD themselves recommends avoiding the LOOP instruction when optimizing for Bulldozer.Sadiesadira
@Michael: Maybe it doesn't branch-predict the same way? IDK. I found some speculation and plausible theories on groups.google.com/d/msg/comp.arch/5RN6EegUxE0/KETMqmKWVN4J. (Link to one of Paul Clayton's post mid way though. Scroll up for the start of the thread, which was an exact duplicate of my question). hurr durr google your questions >.<Overlying
One of the other answers says: "LOOP became slow on some of the earliest machines (circa 486) when significant pipelining started to happen, and running any but the simplest instruction down the pipeline efficiently was technologically impractical. So LOOP was slow for a number of generations. So nobody used it. So when it became possible to speed it up, there was no real incentive to do so, since nobody was actually using it. " So, if the compilers have stopped using the instruction, why bother to improve it now? It would not improve the benchmarks for a new CPU...Gowen
" it's not worth speeding it up, 'cause no one uses it 'cause it's slow? " that's genius :-)Cookbook
@BoPersson: If it had been efficient again on P6, compilers would already be using it, and saving a couple code bytes. (And before macro-fused dec-and-branch, saving uops too if it was single-uop). This only applies to the rare cases where a compiler can transform the loop counter into a count-down, since most programmers write their loops to count up. Even without loop, at the asm level, counting down to zero is slightly more efficient, because the decrement will set the zero flag without needing a compare. I still usually write my C loops from 0..n, for readability though.Overlying
Looping over a buffer forwards in one loop, then backwards in the next loop, is probably the idea case for caching, though. In theory you'll always get a whole cache-size block of cache hits at the turn-around end of the buffer, instead of getting no hits when the array is slightly too big (and the beginning is evicted by the time you get to the end). Hardware prefetchers recognize forward and backward streams, so you're not missing out on that (I checked, and this is true for at least SnB-family. HW prefetchers might have fewer backwards slots on older CPUs, I forget.)Overlying
@PeterCordes I worked at Nexgen for a short while, then at AMD on the K6, K6-2, and Athlon processors. One problem I recall with the LOOP instruction is that fast implementations of it would cause certain existing software (more than one program) to malfunction that used LOOP for delay loops to implement micro-delays, e.g. in driver software. As I recall (but my memory is hazy and I don't have time to find references) both Nexgen and Cyrix fell into that trap, ca. 1995. Smart CPU architects only make the same mistake once, so subsequent CPUs kept LOOP slow on purpose.Jollenta
@njuffa: Ah, I hadn't thought of correctness problems with drivers. Timing problems have been mentioned as one of the reasons, but I had been thinking of games or something that would run too fast, and variable CPU speed makes that obsolete. But if driver delays can be shorter on faster CPUs, that makes sense. (Or if they calibrate their delay loops at startup, if fast loop made the necessary count overflow?) Given that AMD has once again tempted fate with fast loop, I think it's safe to assume that kind of delay loop is fully dead in the age of DVFS power-saving/turbo CPU clocks.Overlying
@PeterCordes Nexgen's Nx586 had patchable microcode, stored in the SBIOS, so fixing the issue with the fast LOOPinstruction required nothing more than a BIOS update, as I recall. I am under the impression that patchable microcode is a standard feature on x86 processors these days, so it doesn't take much bravery to try a fast LOOP. Those delay loops probably died out with DOS and Win16 but for the Athlon processor we stuck with a slow LOOP implementation to avoid unnecessary risk: software has a tendency to live longer than hardware.Jollenta
@njuffa: IDK if Bulldozer's loop instruction could be changed with microcode. Yes, Intel and AMD have patchable microcode (and yes there are actual bugfixes in updates for Skylake, for example!). But not everything is not microcoded. I suspect loop might be hard-wired. In AMD terminology, it's a "DirectPath Single" instruction, decodeable by any of the 4 decoders into a single macro-op. Only VectorPath instructions (more than 2 m-ops) get uops from a ucode ROM. (superuser.com/q/360456/20798). (Intel is similar, 4 uops and less are decoded directly).Overlying
@njuffa: I'm guessing NX586's LOOP was multiple uops and came from ROM anyway, so you could easily make it slower? Microcode updates can often only fix things by turning off whole features. e.g. Skylake has a bug with partial-register renaming and merging uops, and the update to fix that disables the loop buffer entirely (so even tiny loops have to fetch uops from the L0 uop cache, instead of recycling the buffer that feeds the issue stage). Fortunately Skylake just beefed up the front-end, so it's not a bottleneck, prob. just a minor power penalty.Overlying
@PeterCordes Nx586's LOOP instruction was microcoded, thus the ease of slowing it down. DirectPath is AMD terminology for an instruction implemented directly in hardware, while VectorPath refers to microcoded instructions (I was a microcoder for the Athlon processor, where that same terminology was used twenty years ago). Whether DirectPath instructions on modern AMD processors could be re-vectored to microcode for bug-fixing purposes, I do not know; generally speaking it is certainly technically feasible to design-in such a feature (for a small number of instructions).Jollenta
@PeterCordes, regarding and the update to fix that disables the loop buffer entirely - do you have a reference for that claim? It would be a big deal, but I didn't see any confirmation yet. Update: I found this.Bronchia
@BeeOnRope: perf counters on my desktop. I meant to mention that in an update to my SKL partial-regs answer. Everything I've profiled since actually enabling Arch Linux to update the ucode has shown exactly 0 counts for lsd.uops. Even non-microbench things (like ocperf.py -p some-PID) never have any counts. Either that perf counter is now broken, or they disabled the LSD. I've read that SKL-X doesn't use the LSD, and this discovery explains why: it shipped with new enough ucode to disable the LSD. (update: found the same link you did on wikichip).Overlying
IMO that's a big deal.Bronchia
@BeeOnRope: Yeah, it is, but I think the effect is small to nonexistent most of the time. The LSD only worked for uops that are contained in the uop cache, and SKL has excellent uop-cache read bandwidth. Unless your code fits very poorly in the uop cache and could otherwise sustain 4 uops per clock, it's not a real bottleneck.Overlying
@PeterCordes - right, I would guess that performance-wise it's actually a pessimization more often than a benefit, but it's there to save power, right? It seems like a non-trivial amount of complexity and validation effort, so I assume it must have some reasonable power benefit. With very high probability most people will never run into this bug (due to the specific high-reg use that triggers it), so paying any price is kind of unfortunate.Bronchia
@BeeOnRope: Yeah, I think the main benefit in SKL was power. On HSW, it could sometimes be a perf boost, I think. I still haven't tested exactly when uop-cache read can be a bottleneck on NHM (e.g. with 5 uops per line?), i.e. what kind of buffer there is ahead of that "4 uops per clock from the DSB" limit on HSW. They kept the LSD from NHM where it was definitely a big boost (no uop cache), but probably a lot of it had to be re-implemented for SnB. Still, IDK if they would have designed it from scratch for SnB if they didn't already have it from NHM.Overlying
As of KBY (Kaby Lake) and APL (Apollo Lake) nothing seems to have changed: uops.info/html-instr/LOOP-786.htmlElinaelinor
Not sure if it is relevant here, but Intel actually failed at macro-fusing jcc. See Mitigations for Jump Conditional Code Erratum on Intel website or /QIntel-jcc-erratum MSVC switch for example. I though that loop would have been free from this failure.Marcmarcano
@AlexGuteniev: That's not a failure of macro-fusion, that's a failure of the uop cache whether the JCC is macro-fused or not. It also includes all types of jumps, including call, ret, and jmp, and presumably also jrcxz. loop is micro-coded on SnB-family (more than 4 uops, so it has to activate the ucode sequencer) so it might be different. But it's unlikely to be worth using for performance vs. padding with a long NOP so a dec/jcc doesn't touch a 32-byte boundary. That microcode update side-effect sucks a lot, making it much harder to tune for SnB-family than previously :(Overlying
I'm confused. stackoverflow.com/questions/62305998/…Marcmarcano
Is this still true for 16-bit real mode on modern hardware, or no?Pimiento
@puppydrum64: Yes, instructions decode to uops the same way regardless of mode (except for things like mov Sreg, reg of course, since real vs. protected mode includes a difference in meaning for that). Otherwise it only depends on the operand-size and address-size of the instruction. Real mode (or 16-bit protected / compat mode) imply a different default for those, with 66h / 67h setting the other, so add ax, cx has different machine code when assembled for real vs. long mode, but once decoded runs identically in the pipeline. Same for loop.Overlying
I'm sorry, I phrased that wrong. What I meant to ask was, is loop still slow in 16-bit real mode of a 32-bit or higher CPU, for the same reasons it is in long mode? Or does it behave identical to loop on an actual 8086?Pimiento
@puppydrum64: That's what I was answering. On a Skylake in 16-bit real mode, loop decodes to 7 uops, dec cx/jnz decodes to 1. Because (except for writing Sregs), instructions decode to the same uops as they would in other modes. And those uops run on the same out-of-order back-end machinery. 16-bit code does tend to have more false dependencies from writing 16-bit registers with mov, but loop itself is an RMW of CX (or ECX in real mode with a 67h prefix), so it already has a dependency on the register it modifies. (Unlike mov cx, dx)Overlying
Interesting. If I'm writing code to be run on an actual 8086 though, it's ok to use loop or is it still slower?Pimiento
@puppydrum64: loop is more efficient on 8086 because it's smaller and not artificially slow. www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html. See Increasing Efficiency of binary -> gray code for 8086 re: optimizing for 8086 and 8088 where memory access (including code-fetch) is the primary bottleneck for those CPUs without cache and slow narrow busses, especially 8088.Overlying
A
67

In 1988, IBM fellow Glenn Henry had just come on board at Dell, which had a few hundred employees at the time, and in his first month he gave a tech talk about 386 internals. A bunch of us BIOS programmers had been wondering why LOOP was slower than DEC/JNZ so during the question/answer section somebody posed the question.

His answer made sense. It had to do with paging.

LOOP consists of two parts: decrementing CX, then jumping if CX is not zero. The first part cannot cause a processor exception, whereas the jump part can. For one, you could jump (or fall through) to an address outside segment boundaries, causing a SEGFAULT. For two, you could jump to a page that is swapped out.

A SEGFAULT usually spells the end for a process, but page faults are different. When a page fault occurs, the processor throws an exception, and the OS does the housekeeping to swap in the page from disk into RAM. After that, it restarts the instruction that caused the fault.

Restarting means restoring the state of the process to what it was just before the offending instruction. In the case of the LOOP instruction in particular, it meant restoring the value of the CX register. One might think you could just add 1 to CX, since we know CX got decremented, but apparently, it's not that simple. For example, check out this erratum from Intel:

The protection violations involved usually indicate a probable software bug and restart is not desired if one of these violations occurs. In a Protected Mode 80286 system with wait states during any bus cycles, when certain protection violations are detected by the 80286 component, and the component transfers control to the exception handling routine, the contents of the CX register may be unreliable. (Whether CX contents are changed is a function of bus activity at the time internal microcode detects the protection violation.)

To be safe, they needed to save the value of CX on every iteration of a LOOP instruction, in order to reliably restore it if needed.

It's this extra burden of saving CX that made LOOP so slow.

Intel, like everyone else at the time, was getting more and more RISC. The old CISC instructions (LOOP, ENTER, LEAVE, BOUND) were being phased out. We still used them in hand-coded assembly, but compilers ignored them completely.

Aman answered 25/10, 2018 at 2:26 Comment(7)
Thanks for the historical answer for 386; it obviously doesn't still apply to Sandybridge-family, where dec ecx / jnz decodes as a single uop that decrements and branches. Interesting that it wasn't purely intentionally slow to try to avoid breaking delay loops.Overlying
I'm surprised though; I thought code-fetch from an invalid page would give you a page fault with EIP = the jump target, so rerunning the jump instruction itself wouldn't happen. But maybe Intel built the check into the jump instruction? And if fall-through can do it, too, then any instruction has that potential problem at the end of a page. (Unless I'm mistaken, logically in x86 a jump to an invalid page succeeds and doesn't itself fault, but then code fetch from that new address can fault.) Still, +1 because the 286 erratum is some solid evidence that there's a real thing here.Overlying
The LOOP instruction itself can't cause a page fault. If the destination page isn't mapped, the page fault occurs with CS:EIP set to the destination and ECX updated. The LOOP instruction can however cause a general protection (#GP) fault if the destination is outside the CS segment limit and in that case ECX needs to be left unmodified. However the easiest way to implement this is to jump only if (ECX - 1) == 0, check to segment bounds, and then decrement ECX. See the Intel Software Developer's Manual entry for LOOP to see the details of how this works.Atmospherics
Thanks @Ross, I wondered if segment limits might work differently from paging. That does explain needing multiple internal steps.Overlying
Actually, reading the manual more carefully, the Operation section suggests that ECX would be changed if the LOOP instruction causes a #GP fault, so I'm not sure what is actually the case.Atmospherics
Then why is call preferred over push $+X, call Y, where X is the return address for call? Since a call essentially does two things: pushing the return address to the stack, and jumping to an address, it should have the same problem than loop instruction does, shouldn't it?Amalburga
Feels wrong. jmp into nilspace doesn't fault at the site of the jmp but at the nilspace.Enplane
O
41

Now that I googled after writing my question, it turns out to be an exact duplicate of one on comp.arch, which came up right away. I expected it to be hard to google (lots of "why is my loop slow" hits), but my first try (why is the x86 loop instruction slow) got results.

This is not a good or complete answer.

It might be the best we'll get, and will have to suffice unless someone can shed some more light on it. I didn't set out to write this as an answer-my-own-question post.


Good posts with different theories in that thread:

Robert

LOOP became slow on some of the earliest machines (circa 486) when significant pipelining started to happen, and running any but the simplest instruction down the pipeline efficiently was technologically impractical. So LOOP was slow for a number of generations. So nobody used it. So when it became possible to speed it up, there was no real incentive to do so, since nobody was actually using it.


Anton Ertl:

IIRC LOOP was used in some software for timing loops; there was (important) software that did not work on CPUs where LOOP was too fast (this was in the early 90s or so). So CPU makers learned to make LOOP slow.


(Paul, and anyone else: You're welcome to re-post your own writing as your own answer. I'll remove it from my answer and up-vote yours.)

@Paul A. Clayton (occasional SO poster and CPU architecture guy) took a guess at how you could use that many uops. (This looks like loope/ne which checks both the counter and ZF):

I could imagine a possibly sensible 6-µop version:

virtual_cc = cc; 
temp = test (cc); 
rCX = rCX - temp; // also setting cc 
cc = temp & cc; // assumes branch handling is not 
       // substantially changed for the sake of LOOP 
branch 
cc = virtual_cc 

(Note that this is 6 uops, not SnB's 11 for LOOPE/LOOPNE, and is a total guess not even trying to take into account anything known from SnB perf counters.)

Then Paul said:

I agree that a shorter sequence should be possible, but I was trying to think of a bloated sequence that might make sense if minimal microarchitectural adjustments were permitted.

summary: The designers wanted loop to be supported only via microcode, with no adjustments whatsoever to the hardware proper.

If a useless, compatibility-only instruction is handed to the microcode developers, they might reasonably not be able or willing to suggest minor changes to the internal microarchitecture to improve such an instruction. Not only would they rather use their "change suggestion capital" more productively but the suggestion of a change for a useless case would reduce the credibility of other suggestions.

(My opinion: Intel is probably still making it slow on purpose, and hasn't bothered to rewrite their microcode for it for a long time. Modern CPUs are probably too fast for anything using loop in a naive way to work correctly.)

... Paul continues:

The architects behind Nano may have found avoiding the special casing of LOOP simplified their design in terms of area or power. Or they may have had incentives from embedded users to provide a fast implementation (for code density benefits). Those are just WILD guesses.

If optimization of LOOP fell out of other optimizations (like fusion of compare and branch), it might be easier to tweak LOOP into a fast path instruction than to handle it in microcode even if the performance of LOOP was unimportant.

I suspect that such decisions are based on specific details of the implementation. Information about such details does not seem to be generally available and interpreting such information would be beyond the skill level of most people. (I am not a hardware designer--and have never played one on television or stayed at a Holiday Inn Express. :-)


The thread then went off-topic into the realm of AMD blowing our one chance to clean up the cruft in x86 instruction encoding. It's hard to blame them, since every change is a case where the decoders can't share transistors. And before Intel adopted x86-64, it wasn't even clear that it would catch on. AMD didn't want to burden their CPUs with hardware nobody used if AMD64 didn't catch on.

But still, there are so many small things: setcc could have changed to 32bits. (Usually you have to use xor-zero / test / setcc to avoid false dependencies, or because you need a zero-extended reg). Shift could have unconditionally written flags, even with zero shift count (removing the input data dependency on eflags for variable-count shift for OOO execution). Last time I typed this list of pet peeves, I think there was a third one... Oh yeah, bt / bts etc. with memory operands has the address dependent on the upper bits of the index (bit string, not just bit within a machine word).

bts instructions are very useful for bit-field stuff, and are slower than they need to be so you almost always want to load into a register and then use that. (It's usually faster to shift/mask to get an address yourself, instead of using 10 uop bts [mem], reg on Skylake, but it does take extra instructions. So it made sense on 386, but not on K8). Atomic bit-manipulation has to use the memory-dest form, but the locked version needs lots of uops anyway. It's still slower than if it couldn't access outside the dword it's operating on.

Overlying answered 2/3, 2016 at 9:52 Comment(6)
My understanding is basically what Robert said. The LOOP instruction has been slower than DEC/JNZ since the '386. Even on the '86 and '286 it was only 2 and 1 cycles faster which meant on those processors using the more restrictive LOOP instruction was often mistake. I'm not sure if any of the common 16-bit compilers of the time ever generated the instruction. Even today I think it would be hard to write a compiler that would use it effectively. So no code uses it, and even they did improve the instruction, its not clear if it would actually start getting used.Atmospherics
@RossRidge and future readers: One case where it'd be great is for avoiding partial-flags problems in an adc loop. A cheap way to loop without touching flags is exactly what you want for arbitrary size BigInteger adc loops. So AMD Bulldozer-family has a solid advantage there, even vs. Intel Broadwell and later where adc a 1-uop insn. Compilers can already put byte counts into ecx for rep stos and so on; I don't think it'd be that hard to use.Overlying
Yah, hand optimized assembly code like that is where it could end up being used. Still I'm not sure if it assembly programmers would find enough of an opportunity to use it to make the engineering effort worth while.Atmospherics
@RossRidge: Good point that compilers rarely generate adc loops (usually just a single adc for __int128_t or int64_t). I assume Intel cares some about arbitrary-precision integers. gmplib.org has been around for a long time, and public-key crypto is a big deal. Math on large integers is not uncommon.Overlying
Actually, I'm overstating the case a bit. A dec/jcc unrolled by 2 or 4 on SnB-family microarchitectures does pretty well. Apparently it adds a single extra uop to merge the flags when the next adc reads them, so a 1uop loop would only save 1uop. But that's only if you're willing to use code that performs badly on pre-SnB (Nehalem). Otherwise, saving/restoring flags around cmp/jcc (with lahf/sahf) costs 2 extra uops. And looping with adcx / adox (new with broadwell) to do two dep chains in parallel requires a loop that doesn't affect flags. (lahf doesn't do OF.)Overlying
@PeterCordes This is interesting. I previously thought loop would decode into a single decjnz instruction. If it is slower than decjnz then the only difference between the two that I can think of that there would feasibly be is that it doesn't macrofuse. Right? By definition macrofusion requires 2 instructions being directed to the same decoder. Perhaps the macrofuse logic requries there to be 2 input instructions and it cannot produce a macrofused uop otherwise. Perhaps that logic is kept separate from single instructions and single instructions just don't issue a microfused uop?Oniskey
C
13

Please see the nice article by Abrash, Michael, published in Dr. Dobb's Journal March 1991 v16 n3 p16(8): http://archive.gamedev.net/archive/reference/articles/article369.html

The summary of the article is the following:

Optimizing code for 8088, 80286, 80386 and 80486 microprocessors is difficult because the chips use significantly different memory architectures and instruction execution times. Code cannot be optimized for the 80x86 family; rather, code must be designed to produce good performance on a range of systems or optimized for particular combinations of processors and memory. Programmers must avoid the unusual instructions supported by the 8088, which have lost their performance edge in subsequent chips. String instructions should be used but not relied upon. Registers should be used rather than memory operations. Branching is also slow for all four processors. Memory accesses should be aligned to improve performance. Generally, optimizing an 80486 requires exactly the opposite steps as optimizing an 8088.

By "unusual instructions supported by the 8088" the author also means "loop":

Any 8088 programmer would instinctively replace: DEC CX JNZ LOOPTOP with: LOOP LOOPTOP because LOOP is significantly faster on the 8088. LOOP is also faster on the 286. On the 386, however, LOOP is actually two cycles slower than DEC/JNZ. The pendulum swings still further on the 486, where LOOP is about twice as slow as DEC/JNZ--and, mind you, we're talking about what was originally perhaps the most obvious optimization in the entire 80x86 instruction set.

This is a very good article, and I highly recommend it. Even though it was published in 1991, it is surprisingly highly relevant today.

But this article just gives advices, it encourages to test execution speed and choose faster variants. It doesn’t explain WHY some commands become very slow, so it doesn’t fully address your question.

The answer is that earlier processors, like 80386 (released in 1985) and before, executed instructions one-by-one, sequentially.

Later processors have started to use instruction pipelining – initially, simple, for 804086, and, finally, Pentium Pro (released in 1995) introduced radically different internal pipeline, calling it the Out Of Order (OOO) core where instructions were transformed to small fragments of operations called micro-ops or µops, and then all micro-ops of different instructions were put to a large pool of micro-ops where they were supposed to execute simultaneously as long as they do not depend on one another. This OOO pipeline principle is still used, almost unchanged, on modern processors. You can find more information about instruction pipelining in this brilliant article: https://www.gamedev.net/resources/_/technical/general-programming/a-journey-through-the-cpu-pipeline-r3115

In order to simplify chip design, Intel decided to build processors in such a way that one instructions did transform to micro-ops in a very efficient way, while others are not.

Efficient conversion from instructions to micro-ops requires more transistors, so Intel have decided to save on transistors at a cost of slower decoding and execution of some “complex” or “rarely-used” instructions.

For example, the “Intel® Architecture Optimization Reference Manual” http://download.intel.com/design/PentiumII/manuals/24512701.pdf mentions the following: “Avoid using complex instructions (for example, enter, leave, or loop) that generally have more than four µops and require multiple cycles to decode. Use sequences of simple instructions instead.”

So, Intel somehow have decided that the “loop” instruction is “complex”, and, since then, it became very slow. However, there is no official Intel reference on instruction breakdown: how many micro-ops each instruction produces, and how many cycles are required to decode it.

You can also read about The Out-of-Order Execution Engine in the "Intel® 64 and IA-32 Architectures Optimization Reference Manual" http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf section the 2.1.2.

Chromosome answered 8/5, 2017 at 11:51 Comment(5)
P6's decoding to uops explains why LOOP is slow in PPRO, but Sandybridge decodes dec rcx / jnz looptop as a single uop (macro-fusion). The question is why is LOOP still slow on Sandybridge when it's possible for a single uop to do everything that LOOP does (except for leave the flags unmodified).Overlying
The early part of this answer does give a good summary of why Intel didn't even try to make LOOP efficient on P6, though: it was already slow so nobody used it on 486 and 586, so it wasn't worth spending any transistors on making it fast. First-gen P6 had far fewer transistors to play with than Sandybridge.Overlying
As far as how many cycles it takes to decode and execute, Agner Fog's experimental testing tells us that it can execute with a throughput of one per 5 cycles on Skylake. It produces multiple uops, so it has to be decoded by the first (complex) decoder, but then decodes in a single cycle. Since it produces more than 4 uops (7 on Skylake), the uops are read from the microcode ROM. Switching from uop-cache to microcode can slow down the front-end (stackoverflow.com/questions/26907523/…).Overlying
@Peter Cordes - maybe LOOP just translates to two or even one micro-op, but my idea is not that these micro-ops execute slowly. The idea is that the process of the LOOP instruction translation to micro-ops is very slow, because Intel wanted to save on transistors.Chromosome
We know it decodes to 7 uops on SnB-family, and we also know how the decoders / uop-cache / microcode ROM work, in enough detail to rule out your theory. There are CPU performance counters for lots of events, and Intel has published some information on their CPU internals. Agner Fog has used this information + his own experiments to write up detailed descriptions of CPU microarchitectures. See his microarch.pdf at agner.org/optimize, and other stuff in the x86 tag wikiOverlying

© 2022 - 2024 — McMap. All rights reserved.