LOOP (Intel ref manual entry)
decrements ecx / rcx, and then jumps if non-zero. It's slow, but couldn't Intel have cheaply made it fast? dec/jnz
already macro-fuses into a single uop on Sandybridge-family; the only difference being that that sets flags.
loop
on various microarchitectures, from Agner Fog's instruction tables:
K8/K10: 7 m-ops
Bulldozer-family/Ryzen: 1 m-op (same cost as macro-fused test-and-branch, or
jecxz
)P4: 4 uops (same as
jecxz
)P6 (PII/PIII): 8 uops
Pentium M, Core2: 11 uops
Nehalem: 6 uops. (11 for
loope
/loopne
). Throughput = 4c (loop
) or 7c (loope/ne
).SnB-family: 7 uops. (11 for
loope
/loopne
). Throughput = one per 5 cycles, as much of a bottleneck as keeping your loop counter in memory!jecxz
is only 2 uops with same throughput as regularjcc
Silvermont: 7 uops
AMD Jaguar (low-power): 8 uops, 5c throughput
Via Nano3000: 2 uops
Couldn't the decoders just decode the same as lea rcx, [rcx-1]
/ jrcxz
? That would be 3 uops. At least that would be the case with no address-size prefix, otherwise it has to use ecx
and truncate RIP
to EIP
if the jump is taken; maybe the odd choice of address-size controlling the width of the decrement explains the many uops? (Fun fact: rep
-string instructions have the same behaviour with using ecx
with 32-bit address-size.)
Or better, just decode it as a fused dec-and-branch that doesn't set flags? dec ecx
/ jnz
on SnB decodes to a single uop (which does set flags).
I know that real code doesn't use it (because it's been slow since at least P5 or something), but AMD decided it was worth it to make it fast for Bulldozer. Probably because it was easy.
Would it be easy for SnB-family uarch to have fast
loop
? If so, why don't they? If not, why is it hard? A lot of decoder transistors? Or extra bits in a fused dec&branch uop to record that it doesn't set flags? What could those 7 uops be doing? It's a really simple instruction.What's special about Bulldozer that made a fast
loop
easy / worth it? Or did AMD waste a bunch of transistors on makingloop
fast? If so, presumably someone thought it was a good idea.
If loop
was fast, it would be perfect for BigInteger arbitrary-precision adc
loops, to avoid partial-flag stalls / slowdowns (see my comments on my answer), or any other case where you want to loop without touching flags. It also has a minor code-size advantage over dec/jnz
. (And dec/jnz
only macro-fuses on SnB-family).
On modern CPUs where dec/jnz
is ok in an ADC loop, loop
would still be nice for ADCX / ADOX loops (to preserve OF).
If loop
had been fast, compilers would already be using it as a peephole optimization for code-size + speed on CPUs without macro-fusion.
It wouldn't stop me from getting annoyed at all the questions with bad 16bit code that uses loop
for every loop, even when they also need another counter inside the loop. But at least it wouldn't be as bad.
LOOP
instruction when optimizing for Bulldozer. – Sadiesadiraloop
, at the asm level, counting down to zero is slightly more efficient, because the decrement will set the zero flag without needing a compare. I still usually write my C loops from 0..n, for readability though. – OverlyingLOOP
instruction is that fast implementations of it would cause certain existing software (more than one program) to malfunction that usedLOOP
for delay loops to implement micro-delays, e.g. in driver software. As I recall (but my memory is hazy and I don't have time to find references) both Nexgen and Cyrix fell into that trap, ca. 1995. Smart CPU architects only make the same mistake once, so subsequent CPUs keptLOOP
slow on purpose. – Jollentaloop
made the necessary count overflow?) Given that AMD has once again tempted fate with fastloop
, I think it's safe to assume that kind of delay loop is fully dead in the age of DVFS power-saving/turbo CPU clocks. – OverlyingLOOP
instruction required nothing more than a BIOS update, as I recall. I am under the impression that patchable microcode is a standard feature on x86 processors these days, so it doesn't take much bravery to try a fastLOOP
. Those delay loops probably died out with DOS and Win16 but for the Athlon processor we stuck with a slowLOOP
implementation to avoid unnecessary risk: software has a tendency to live longer than hardware. – Jollentaloop
instruction could be changed with microcode. Yes, Intel and AMD have patchable microcode (and yes there are actual bugfixes in updates for Skylake, for example!). But not everything is not microcoded. I suspectloop
might be hard-wired. In AMD terminology, it's a "DirectPath Single" instruction, decodeable by any of the 4 decoders into a single macro-op. Only VectorPath instructions (more than 2 m-ops) get uops from a ucode ROM. (superuser.com/q/360456/20798). (Intel is similar, 4 uops and less are decoded directly). – OverlyingLOOP
was multiple uops and came from ROM anyway, so you could easily make it slower? Microcode updates can often only fix things by turning off whole features. e.g. Skylake has a bug with partial-register renaming and merging uops, and the update to fix that disables the loop buffer entirely (so even tiny loops have to fetch uops from the L0 uop cache, instead of recycling the buffer that feeds the issue stage). Fortunately Skylake just beefed up the front-end, so it's not a bottleneck, prob. just a minor power penalty. – OverlyingLOOP
instruction was microcoded, thus the ease of slowing it down. DirectPath is AMD terminology for an instruction implemented directly in hardware, while VectorPath refers to microcoded instructions (I was a microcoder for the Athlon processor, where that same terminology was used twenty years ago). Whether DirectPath instructions on modern AMD processors could be re-vectored to microcode for bug-fixing purposes, I do not know; generally speaking it is certainly technically feasible to design-in such a feature (for a small number of instructions). – Jollenta0
counts forlsd.uops
. Even non-microbench things (likeocperf.py -p some-PID
) never have any counts. Either that perf counter is now broken, or they disabled the LSD. I've read that SKL-X doesn't use the LSD, and this discovery explains why: it shipped with new enough ucode to disable the LSD. (update: found the same link you did on wikichip). – Overlyingjcc
. See Mitigations for Jump Conditional Code Erratum on Intel website or/QIntel-jcc-erratum
MSVC switch for example. I though thatloop
would have been free from this failure. – Marcmarcanocall
,ret
, andjmp
, and presumably alsojrcxz
.loop
is micro-coded on SnB-family (more than 4 uops, so it has to activate the ucode sequencer) so it might be different. But it's unlikely to be worth using for performance vs. padding with a long NOP so a dec/jcc doesn't touch a 32-byte boundary. That microcode update side-effect sucks a lot, making it much harder to tune for SnB-family than previously :( – Overlyingmov Sreg, reg
of course, since real vs. protected mode includes a difference in meaning for that). Otherwise it only depends on the operand-size and address-size of the instruction. Real mode (or 16-bit protected / compat mode) imply a different default for those, with66h
/67h
setting the other, soadd ax, cx
has different machine code when assembled for real vs. long mode, but once decoded runs identically in the pipeline. Same forloop
. – Overlyingloop
still slow in 16-bit real mode of a 32-bit or higher CPU, for the same reasons it is in long mode? Or does it behave identical toloop
on an actual 8086? – Pimientoloop
decodes to 7 uops,dec cx/jnz
decodes to 1. Because (except for writing Sregs), instructions decode to the same uops as they would in other modes. And those uops run on the same out-of-order back-end machinery. 16-bit code does tend to have more false dependencies from writing 16-bit registers withmov
, butloop
itself is an RMW of CX (or ECX in real mode with a67h
prefix), so it already has a dependency on the register it modifies. (Unlikemov cx, dx
) – Overlyingloop
or is it still slower? – Pimientoloop
is more efficient on 8086 because it's smaller and not artificially slow. www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html. See Increasing Efficiency of binary -> gray code for 8086 re: optimizing for 8086 and 8088 where memory access (including code-fetch) is the primary bottleneck for those CPUs without cache and slow narrow busses, especially 8088. – Overlying