80286: Which is the fastest way to multiply by 10?
Asked Answered
S

1

2

To multiply a number by any any multiple of 2, I'll shift it those many times.

Is there any such technique to multiply a number by 10 in less cycles?

Simian answered 4/4, 2020 at 18:46 Comment(12)
Specifically on 80286, so immediate shifts are available, but imul reg,reg,10 is slow, and 32-bit addressing modes like lea ax, [eax + eax*4] aren't available for cheap x * 5? Do you care about performance of the code on any later or earlier CPUs, in case something that's optimal for 286 isn't optimal elsewhere? Do you have a link for 80286 instruction timings?Alby
Shift, add, shift? 10*x = (4*x + x) * 2 = ((x << 2) + x) << 1. This is the same way you do "long multiplication" by hand.Burbot
Yes my old friend, I am currently coding only for 80286 (16-bit)Simian
@NateEldredge How would the value of x remain persistent while adding it once the bits are shifted?Simian
You save it in another register. mov bx, ax ; shl ax, 2 ; add ax, bx ; shl ax, 1.Burbot
@NateEldredge: Yes, I think we're stuck with something like that. But is add same,same faster or slower than shl reg,1 on 286 for that last step? It probably doesn't matter what order you do anything in; 286 can't exploit the ILP in x*2 + x*8, and I think we need 1 mov. Unless you happened to already have the value in SI|DI and BX|BP, then you could lea ax, [bx + si] or something to start with x*2Alby
Will it be more efficient than MUL?Simian
@ProjectZero: On 286, yes vastly. The threshold for doing shifts/adds instead of a mul by a constant is at least a few set bits even on P5 Pentium; 10 only has 2 set bits. On modern Nehalem or later, yes better than 1-operand mul, but not better than imul ax, bx, 10. (3 cycle latency, 1/clock throughput, 1 uop)Alby
Could either of you'll please post an answer so that I could accept it?Simian
I'm not sure how shifts and adds compare, but you can also do it with four adds: mov bx, ax ; add ax, ax ; add ax, ax ; add ax, bx ; add ax, ax.Burbot
Without knowing where to find a 286 instruction timing table, I don't know what the fastest version would be so I don't know the answer. The generic method of breaking a multiply down into shifts and add/sub is well known and wouldn't be new. (And BTW, I mentioned P5 Pentium earlier because you can see how GCC optimizes multiplies by constants when tuning for it with gcc -O3 -march=pentium. Or even -march=i386. godbolt.org/z/qjD-a3. Oh, you could compile for MIPS to limit GCC to just using shifts and add/sub, not x86 LEA. Or maybe MPS430 as a 2-operand machine.Alby
Wow, 80286. It's borderline whether this question should be shifted over to retrocomputing.stackexchange.com/questions :-)Tomcat
G
5

The 80286 did not have a barrel shifter, that was introduced with the 80386. According to the timing tables in the Microsoft Macro Assembler 5.0 documentation (1987), SHL reg, immed8 takes 5+n cycles, whereas SHL reg, 1 takes 2 cycles. ADD reg, reg takes 2 cycles, as does MOV reg, reg. IMUL reg16, immed takes 21 cycles. Therefore, the fastest way to multiply by ten would appear to be:

           ;       // cycles
shl ax, 1  ; *2    // 2
mov bx, ax ; *2    // 4
shl ax, 1  ; *4    // 6
shl ax, 1  ; *8    // 8
add ax, bx ; *10   // 10

or, alternatively:

           ;      // cycles
mov bx, ax ; *1   // 2
shl ax, 1  ; *2   // 4
shl ax, 1  ; *4   // 6
add ax, bx ; *5   // 8
shl ax, 1  ; *10  // 10

Ten cycles either way.

Gosh answered 4/4, 2020 at 21:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.