Why is LOOP faster than DEC,JNZ on 8086?

Asked 14/2, 2022 at 19:17 Answered 14/2, 2022 at 20:15

performance assembly cpu-architecture x86-16 microprocessors

-1

My professor claimed that LOOP is faster on 8086 because only one instruction is fetched instead of two, like in dec cx, jnz. So I think we are saving time by avoiding the extra fetch and decode per iteration.

But earlier in the lecture, he also mentioned that LOOP does the same stuff as DEC, JNZ under the hood, and I presume that its decoding should also be more complex, so the speed difference should kind of balance out. Then, why is the LOOP instruction faster? I went through this post, and the answers there pertain to processors more modern than 8086, although one of the answers (and the page it links) does point out that on 8088 (closely related to 8086), LOOP is faster.

Later, the professor used the same reasoning to explain why rep string operations might be faster than LOOP + individual movement instructions, but since I was not entirely convinced with the previous approach, I asked this question here.

Adversary answered 14/2, 2022 at 19:17 Comment(6)

Do you mean dec ecx / jnz, in that order? That's the drop-in replacement for loop, other than modifying FLAGS. JMP is an unconditional jump. – Peneus 14/2, 2022 at 19:48

@PeterCordes Yes, fixed it. – Adversary 14/2, 2022 at 19:53

You do realize that there would be no single-instruction LOOP if it weren't "better" (smaller/faster) on the generation where it was introduced, right? If the Intel engineers hadn't found a way to have a faster single instruction than the DEC, JNZ pair, why would they have bothered to make an encoding for LOOP, taking up valuable space in the encoding table? – Diagonal 14/2, 2022 at 20:3

@BenVoigt: It's still smaller, whether it was faster or not. Code-size was an even more valuable optimization goal more of the time back then. (Although that's a significant part of why it actually is faster, especially on 8088 with its 1-byte bus.) – Peneus 14/2, 2022 at 20:17

@PeterCordes: Agreed. But the answer to the question is still "LOOP exists because some Intel engineer saw an opportunity to shave 1 byte and 2 cycles off a loop". It may be a surprise that such an opportunity existed, but the fact that someone went to the trouble to make an instruction for it only makes sense when the specialized instruction is smaller/faster/both. – Diagonal 14/2, 2022 at 20:26

@BenVoigt: Oh right, your first comment said "smaller/faster" which could be an either or, not necessarily both. But this question is specifically asking why it's faster, without considering or caring about it being smaller in machine code. I guess that's why you commented instead of answering, since it's highly related but doesn't 100% answer the question :P And BTW, the 8086 ISA was almost entirely architected on paper by Stephen Morse, apparently before silicon design was done. He probably expected speed benefits, too, though. – Peneus 14/2, 2022 at 20:34

It's not decode that's the problem, it's usually fetch on 8086.

Starting two separate instruction-decode operations probably is more expensive than just fetching more microcode for one loop instruction. I'd guess that's what accounts for the numbers in the table below that don't include code-fetch bottlenecks.

Equally or more importantly, 8086 is often bottlenecked by memory access, including code-fetch. (8088 almost always is, breathing through a straw with it's 8-bit bus, unlike 8086's 16-bit bus).

dec cx is 1 byte, jnz rel8 is 2 bytes.
So 3 bytes total, vs. 2 for loop rel8.

8086 performance can be approximated by counting memory accesses and multiply by four, since its 6-byte instruction prefetch buffer allows it to overlap code-fetch with decode and execution of other instructions. (Except for very slow instructions like mul that would let the buffer fill up after at most three 2-byte fetches.)

See also Increasing Efficiency of binary -> gray code for 8086 for an example of optimizing something for 8086, with links to more resources like tables of instruction timings.

https://www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html has instruction timings for 8086 (taken from Intel manuals I think, as cited in njuffa's answer), but those are only execution, when fetch isn't a bottleneck. (i.e. just decoding from the prefetch buffer.)

Decode / execute timings, not including fetch:

DEC     Decrement

    operand     bytes   8088    186     286     386     486     Pentium
    r8           2       3       3       2       2       1       1   UV
    r16          1       3       3       2       2       1       1   UV
    r32          1       3       3       2       2       1       1   UV
    mem       2+d(0,2)  23+EA   15       7       6       3       3   UV

Jcc     Jump on condition code

    operand     bytes   8088    186     286     386     486     Pentium
    near8        2      4/16    4/13    3/7+m   3/7+m   1/3     1    PV
    near16       3       -       -       -      3/7+m   1/3     1    PV

LOOP    Loop control with CX counter

      operand   bytes   8088    186     286     386     486     Pentium
      short      2      5/17    5/15    4/8+m   11+m    6/7     5/6  NP

So even ignoring code-fetch differences:

dec + taken jnz takes 3 + 16 = 19 cycles to decode / exec on 8086 / 8088.
taken loop takes 17 cycles to decode / exec on 8086 / 8088.

(Taken branches are slow on 8086, and discard the prefetch buffer; there's no branch prediction. IDK if those timings include any of that penalty, since they apparently don't for other instructions and non-taken branches.)

8088/8086 are not pipelined except for the code-prefetch buffer. Finishing execution of one instruction and starting decode / exec of the next take it some time; even the cheapest instructions (like mov reg,reg / shift / rotate / stc/std / etc.) take 2 cycles. Bizarrely more than nop (3 cycles).

Peneus answered 14/2, 2022 at 20:15 Comment(2)

I assumed the 3 cycles for nop is because it's really xchg ax, ax, and xchg ax, reg takes 3 cycles in all other cases. – Tuppence 15/2, 2022 at 7:10

@NateEldredge: Oh right, not special-casing 0x90 makes perfect sense, and isn't necessary for correctness until x86-64. :P – Peneus 15/2, 2022 at 17:41

I presume that its decoding should also be more complex

There's no reason that the decoding is more complex for the loop instruction. This instruction has to do multiple things, but decoding is not at issue — it should decode as easily as JMP, since there's just the opcode and the one operand, the branch target, like JMP.

Saving one instruction's fetch & decode probably accounts for the speed improvement, since in execution they are effectively equivalent.

Saks answered 14/2, 2022 at 19:26 Comment(0)

Looking at the "8086/8088 User's Manual: Programmer's and Hardware Reference" (Intel 1989) confirms that LOOP is marginally faster than the combination DEC CX; JNZ. DEC takes 3 clock cycles, JNZ takes 4 (not taken) or 16 (taken) cycles. So the combination requires 7 or 19 cycles. LOOP on the other hand requires 5 cycles (not taken) or 17 cycles (taken), for a saving of 2 cycles.

I do not see anything in the manual that describes why LOOP is faster. The faster instruction fetch due to the reduced number of opcode bytes seems like a reasonable hypothesis.

According to the "80286 and 80287 Programmer's Reference Manual" (Intel 1987), LOOP still has a slight advantage over the discrete replacement, in that it requires 8 cycles when taken and 4 cycles when not taken, while the combo requires 1 cycle more in both cases (DEC 2 cycles; JNZ 7 or 3 cycles).

The 8086 microcode has been disassembled, so one could theoretically take a look at the internal sequence of operations for both of these cases to establish exactly why LOOP is faster, if one is so inclined.

Framework answered 14/2, 2022 at 20:15 Comment(7)

without having to take a detour through the CX register which in the combo case is first written to, then read from again" what? JNZ looks at the FLAGS not the contents of CX. And the flags are definitely produced from ALU output, not by a read dependent on write of the register. – Diagonal 14/2, 2022 at 20:21

@Ben Voigt. Thanks, will fix. I am still on my first coffee, brain doesn't work correctly yet. I was probably thinking about REP mechanism which needs to update CX. – Framework 14/2, 2022 at 20:35

Those timing numbers don't fully (or at all) include possible code-fetch bottleneck; e.g. every memory-fetch takes 4 cycles, so a 2-byte add reg,reg that takes 3 cycles according to that manual can actually only execute at one per 4 cycles on 8086, or one per 8 cycles on 8088 sustained throughput. – Peneus 14/2, 2022 at 20:43

@PeterCordes I last dealt with the 8086 thirty years ago. I collected my own instruction timings at the time using proper microbenchmarks. I don't have that data any more, and my recollection is way too vague to convincingly confirm or refute such an assertion. I therefore thought it best to quote from official documentation and leave it at that. – Framework 14/2, 2022 at 20:48

Sure, but you might want to link When specifying Intel 80x86 instruction execution time, what is included in the cycle count? on retrocomputing. – Peneus 14/2, 2022 at 20:54

@PeterCordes I can only link to things I am aware of. Since you are aware, I think I can safely assume you included that link in your own answer. – Framework 14/2, 2022 at 21:3

Yes, code-fetch is a major point of my answer. But you're aware now, too, so I thought you might want to edit your answer. I'm not saying you shouldn't have posted your answer that way in the first place, just that it turns out to be slightly misleading. – Peneus 14/2, 2022 at 21:6

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

It's not decode that's the problem, it's usually fetch on 8086.

Decode / execute timings, not including fetch:

Recommended topics

Hot tags