It's not decode that's the problem, it's usually fetch on 8086.
Starting two separate instruction-decode operations probably is more expensive than just fetching more microcode for one loop
instruction. I'd guess that's what accounts for the numbers in the table below that don't include code-fetch bottlenecks.
Equally or more importantly, 8086 is often bottlenecked by memory access, including code-fetch. (8088 almost always is, breathing through a straw with it's 8-bit bus, unlike 8086's 16-bit bus).
dec cx
is 1 byte, jnz rel8
is 2 bytes.
So 3 bytes total, vs. 2 for loop rel8
.
8086 performance can be approximated by counting memory accesses and multiply by four, since its 6-byte instruction prefetch buffer allows it to overlap code-fetch with decode and execution of other instructions. (Except for very slow instructions like mul
that would let the buffer fill up after at most three 2-byte fetches.)
See also Increasing Efficiency of binary -> gray code for 8086 for an example of optimizing something for 8086, with links to more resources like tables of instruction timings.
https://www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html has instruction timings for 8086 (taken from Intel manuals I think, as cited in njuffa's answer), but those are only execution, when fetch isn't a bottleneck. (i.e. just decoding from the prefetch buffer.)
Decode / execute timings, not including fetch:
DEC Decrement
operand bytes 8088 186 286 386 486 Pentium
r8 2 3 3 2 2 1 1 UV
r16 1 3 3 2 2 1 1 UV
r32 1 3 3 2 2 1 1 UV
mem 2+d(0,2) 23+EA 15 7 6 3 3 UV
Jcc Jump on condition code
operand bytes 8088 186 286 386 486 Pentium
near8 2 4/16 4/13 3/7+m 3/7+m 1/3 1 PV
near16 3 - - - 3/7+m 1/3 1 PV
LOOP Loop control with CX counter
operand bytes 8088 186 286 386 486 Pentium
short 2 5/17 5/15 4/8+m 11+m 6/7 5/6 NP
So even ignoring code-fetch differences:
dec
+ taken jnz
takes 3 + 16 = 19 cycles to decode / exec on 8086 / 8088.
- taken
loop
takes 17 cycles to decode / exec on 8086 / 8088.
(Taken branches are slow on 8086, and discard the prefetch buffer; there's no branch prediction. IDK if those timings include any of that penalty, since they apparently don't for other instructions and non-taken branches.)
8088/8086 are not pipelined except for the code-prefetch buffer. Finishing execution of one instruction and starting decode / exec of the next take it some time; even the cheapest instructions (like mov reg,reg
/ shift / rotate / stc
/std
/ etc.) take 2 cycles. Bizarrely more than nop
(3 cycles).
dec ecx
/jnz
, in that order? That's the drop-in replacement forloop
, other than modifying FLAGS. JMP is an unconditional jump. – PeneusLOOP
if it weren't "better" (smaller/faster) on the generation where it was introduced, right? If the Intel engineers hadn't found a way to have a faster single instruction than theDEC, JNZ
pair, why would they have bothered to make an encoding forLOOP
, taking up valuable space in the encoding table? – Diagonal