80286: Which is the fastest way to multiply by 10?

About

Asked 4/4, 2020 at 18:46 Answered 4/4, 2020 at 21:12

Solved assembly x86-16 micro-optimization

To multiply a number by any any multiple of 2, I'll shift it those many times.

Is there any such technique to multiply a number by 10 in less cycles?

Simian answered 4/4, 2020 at 18:46 Comment(12)

Specifically on 80286, so immediate shifts are available, but imul reg,reg,10 is slow, and 32-bit addressing modes like lea ax, [eax + eax*4] aren't available for cheap x * 5? Do you care about performance of the code on any later or earlier CPUs, in case something that's optimal for 286 isn't optimal elsewhere? Do you have a link for 80286 instruction timings? – Alby 4/4, 2020 at 18:50

Shift, add, shift? 10*x = (4*x + x) * 2 = ((x << 2) + x) << 1. This is the same way you do "long multiplication" by hand. – Burbot 4/4, 2020 at 18:52

Yes my old friend, I am currently coding only for 80286 (16-bit) – Simian 4/4, 2020 at 18:52

@NateEldredge How would the value of x remain persistent while adding it once the bits are shifted? – Simian 4/4, 2020 at 18:54

You save it in another register. mov bx, ax ; shl ax, 2 ; add ax, bx ; shl ax, 1. – Burbot 4/4, 2020 at 18:55

@NateEldredge: Yes, I think we're stuck with something like that. But is add same,same faster or slower than shl reg,1 on 286 for that last step? It probably doesn't matter what order you do anything in; 286 can't exploit the ILP in x*2 + x*8, and I think we need 1 mov. Unless you happened to already have the value in SI|DI and BX|BP, then you could lea ax, [bx + si] or something to start with x*2 – Alby 4/4, 2020 at 18:56

Will it be more efficient than MUL? – Simian 4/4, 2020 at 18:58

@ProjectZero: On 286, yes vastly. The threshold for doing shifts/adds instead of a mul by a constant is at least a few set bits even on P5 Pentium; 10 only has 2 set bits. On modern Nehalem or later, yes better than 1-operand mul, but not better than imul ax, bx, 10. (3 cycle latency, 1/clock throughput, 1 uop) – Alby 4/4, 2020 at 19:0

Could either of you'll please post an answer so that I could accept it? – Simian 4/4, 2020 at 19:3

I'm not sure how shifts and adds compare, but you can also do it with four adds: mov bx, ax ; add ax, ax ; add ax, ax ; add ax, bx ; add ax, ax. – Burbot 4/4, 2020 at 19:14

Without knowing where to find a 286 instruction timing table, I don't know what the fastest version would be so I don't know the answer. The generic method of breaking a multiply down into shifts and add/sub is well known and wouldn't be new. (And BTW, I mentioned P5 Pentium earlier because you can see how GCC optimizes multiplies by constants when tuning for it with gcc -O3 -march=pentium. Or even -march=i386. godbolt.org/z/qjD-a3. Oh, you could compile for MIPS to limit GCC to just using shifts and add/sub, not x86 LEA. Or maybe MPS430 as a 2-operand machine. – Alby 4/4, 2020 at 19:17

Wow, 80286. It's borderline whether this question should be shifted over to retrocomputing.stackexchange.com/questions :-) – Tomcat 2/3 at 8:54

The 80286 did not have a barrel shifter, that was introduced with the 80386. According to the timing tables in the Microsoft Macro Assembler 5.0 documentation (1987), SHL reg, immed8 takes 5+n cycles, whereas SHL reg, 1 takes 2 cycles. ADD reg, reg takes 2 cycles, as does MOV reg, reg. IMUL reg16, immed takes 21 cycles. Therefore, the fastest way to multiply by ten would appear to be:

           ;       // cycles
shl ax, 1  ; *2    // 2
mov bx, ax ; *2    // 4
shl ax, 1  ; *4    // 6
shl ax, 1  ; *8    // 8
add ax, bx ; *10   // 10

or, alternatively:

           ;      // cycles
mov bx, ax ; *1   // 2
shl ax, 1  ; *2   // 4
shl ax, 1  ; *4   // 6
add ax, bx ; *5   // 8
shl ax, 1  ; *10  // 10

Ten cycles either way.

Gosh answered 4/4, 2020 at 21:12 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags