Assembly 8086 - Implementing any multiplication and division without MUL and DIV instruction

Asked 13/1, 2015 at 12:48 Answered 22/11, 2016 at 17:40

assembly cpu-usage division multiplication

I would like to know if there is a way to perform any multiplication or division without use of MUL or DIV instruction because they require a lot of CPU cycles. Can I exploit SHL or SHR instructions for this target? How can I implement the assembly code?

Retentive answered 13/1, 2015 at 12:48 Comment(5)

Of course you can do so but unless your are multiplying/dividing by a special class of values or a constant your code will likely by at least an order of magnitude slower. – Canara 13/1, 2015 at 12:51

Do you need to be able to multiple/divide by arbitrary numbers or by some predefined value(s)? – Crumley 13/1, 2015 at 12:51

If you are interested in this topic, you can take a look at "serial multiplier" and "parallel multiplier". This way you get a better understanding of how multiplications are performed and what are the disadvantages and advantages of both types – Whistle 13/1, 2015 at 13:19

This is not the first question about this. Did you do a search before? – Joella 13/1, 2015 at 13:54

Are you really targeting actual Intel 8086 chips from 1980-es, or modern incarnations of the same architecture? It makes a lot of difference. – Ligon 22/11, 2016 at 17:47

Just like everything else in assembly there are many ways to do multiplication and division.

Do division by multiplying by the reciprocal value.
Use shifts and adds/subs instead of multiplication.
Use the address calculation options of lea (multiplication only).

Myth busting

because they require a lot of CPU cycles

MUL and IMUL are blazingly fast on modern CPU's, see: http://www.agner.org/optimize/instruction_tables.pdf
DIV and IDIV are and always have been exceedingly slow.

An example for Intel Skylake (page 217):

MUL, IMUL r64: Latency 3 cycles, reciprocal throughput 1 cycle.

Note that this is the maximum latency to multiply two 64 ! bit values.
The CPU can complete one of these multiplications every CPU cycle if all it's doing is multiplications.
If you consider that the above example using shifts and adds to multiply by 7 has a latency of 4 cycles (3 using lea). There is no real way to beat a plain multiply on a modern CPU.

Multiplication by the reciprocal

According to Agner Fog's asm lib instruction page 12:

Division is slow on most microprocessors. In floating point calculations, we can do multiple divisions with the same divisor faster by multiplying with the reciprocal, for example:
float a, b, d;  
a /= d; b /= d;   
can be changed to:
float a, b, d, r;   
r = 1.0f / d;   
a *= r; b *= r;   
If we want to do something similar with integers then we have to scale the reciprocal divisor by 2n and then shift n places to the right after the multiplication.

Multiplying by the reciprocal works well when you need to divide by a constant or if you divide by the same variable many times in a row.
You can find really cool assembly code demonstrating the concept in Agner Fog's assembly library.

Shifts and adds/subs
A shift right is a divide by two shr - (Reduce).
A shift left is a multiply by two shl - (Larger).
You can add and substract to correct for non-powers of two along the way.

//Multiply by 7
mov ecx,eax
shl eax,3    //*8
sub eax,ecx  //*7

Division other than by powers of 2 using this method gets complex quickly.
You may wonder why I'm doing the operations in a weird order, but I'm trying to make the dependency chain as short as possible to maximize the number of instructions that can be executed in parallel.

Using Lea
Lea is an instruction to calculate address offsets.
It can calculate multiples of 2,3,4,5,8, and 9 in a single instruction.
Like so:

                      //Latency on AMD CPUs (K10 and later, including Jaguar and Zen)
                      //On Intel all take 1 cycle.
lea eax,[eax+eax]     //*2     1 cycle      
lea eax,[eax*2+eax]   //*3     2 cycles
lea eax,[eax*4]       //*4     2 cycles   more efficient: shl eax,2 (1 cycle)
lea eax,[eax*4+eax]   //*5     2 cycles 
lea eax,[eax*8]       //*8     2 cycles   more efficient: shl eax,3 (1 cycle)
lea eax,[eax*8+eax]   //*9     2 cycles

Note however that lea with a multiplier (scale factor) is considered a 'complex' instruction on AMD CPUs from K10 to Zen and has a latency of 2 CPU cycles. On earlier AMD CPUs (k8), lea always has 2-cycle latency even with a simple [reg+reg] or [reg+disp8] addressing mode.

AMD
Agner Fog's instruction tables are wrong for AMD Zen: 3-component or scaled-index LEA is still 2 cycles on Zen (with only 2 per clock throughput instead of 4) according to InstLatx64 (http://instlatx64.atw.hu/). Also, like earlier CPUs, in 64-bit mode lea r32, [r64 + whatever] has 2 cycle latency. So it's actually faster to use lea rdx, [rax+rax] instead of lea edx, [rax+rax] on AMD CPUs, unlike Intel where truncating the result to 32 bits is free.

The *4 and *8 can be done faster using shl because a simple shift takes only a single cycle.

On the plus side, lea does not alter the flags and it allows a free move to another destination register. Because lea can only shift left by 0, 1, 2, or 3 bits (aka multiply by 1, 2, 4, or 8) these are the only breaks you get.

Intel
On Intel CPUs (Sandybridge-family), any 2-component LEA (only one +) has single-cycle latency. So lea edx, [rax + rax*4] has single-cycle latency, but lea edx, [rax + rax + 12] has 3 cycle latency (and worse throughput). An example of this tradeoff is discussed in detail in C++ code for testing the Collatz conjecture faster than hand-written assembly - why?.

Salsala answered 25/8, 2016 at 18:3 Comment(5)

A faster multiply by 7: multiply by 8 and subtract. That works even if it overflows, and is what gcc does. (godbolt.org/g/jWxyuv). Checking what a compiler does is a great way to see if you've overlooked an alternative for something :) – Wringer 26/8, 2016 at 0:6

lea eax, [eax*4] would be more efficient as shl eax,2, because a scaled index with no base address is only available as [disp32 + idx*scale] (so it requires a 4-byte zero constant). But if it saves you a mov instruction to copy before shifting, then use lea. (Same for replacing the first case with add eax,eax.) – Wringer 1/9, 2017 at 0:16

@PeterCordes, all true, I just wanted to demonstrate that you can use shr/shl So if I substitute lea it kind of defeats the point. – Salsala 1/9, 2017 at 8:20

I was talking about the last block, where you use LEA exclusively. That lea eax,[eax*4] is only good if you actually copy-and-shift with a different destination register. It's useful to point out different ways, but bringing it all together with the optimal instruction for each thing would be good. – Wringer 1/9, 2017 at 16:16

@PeterCordes, I'm always amazed at your detailed and very informative knowledge of CPU micro-architecture. Thanks! – Salsala 31/10, 2018 at 9:52

Things like SHL/SHR, SAL/SAR, ADD/SUB are faster than MUL and DIV, but MUL and DIV work better for dynamic numbers. For example, if you know that you just need to divide by two, then it's a single-bit shift right. But if you don't know in advance the number, then you might be tempted to repeatedly SUB the values. For example, To determine AX divided by BX, you could just constantly subtract BX from AX until BX is > AX, keeping track of the count. But if you were dividing by 200, by 1 that would mean 200 loops and SUB operations.

MUL and DIV will work better in most cases when the numbers involved aren't hard-coded and known in advance. The only exceptions I can think of is when you know it's something like a multiple/divide by 2, 4, 8, etc. where the Shift operators will work fine.

Sublapsarianism answered 13/1, 2015 at 12:58 Comment(1)

That was true in the old days with serial machines. Today, pipelines are optimized for the high value targets, though shifts will probably always be faster for the special cases. The equation changes when you have out of order pipelines, multiple port ALUs, etc – Drudgery 14/1, 2015 at 15:27

Implementing multiplication is easier, if you remember, an shl operation performs the same operation as multiplying the specified operand by two. Shifting to the left two bit positions multiplies the operand by four. Shifting to the left three bit positions multiplies the operand by eight. In general, shifting an operand to the left n bits multiplies it by 2n. Any value can be multiplied by some constant using a series of shifts and adds or shifts and subtractions. For example, to multiply the ax register by ten, you need only multiply it by eight and then add in two times the original value. That is, 10*ax = 8*ax + 2*ax. The code to accomplish this is

            shl     ax, 1           ;Multiply AX by two
            mov     bx, ax          ;Save 2*AX for later
            shl     ax, 1           ;Multiply AX by four
            shl     ax, 1           ;Multiply AX by eight
            add     ax, bx          ;Add in 2*AX to get 10*AX

The ax register (or just about any register, for that matter) can be multiplied by most constant values much faster using shl than by using the mul instruction. This may seem hard to believe since it only takes two instructions to compute this product:

            mov     bx, 10
            mul     bx

However, if you look at the timings, the shift and add example above requires fewer clock cycles on most processors in the 80x86 family than the mul instruction. Of course, the code is somewhat larger (by a few bytes), but the performance improvement is usually worth it. Of course, on the later 80x86 processors, the mul instruction is quite a bit faster than the earlier processors, but the shift and add scheme is generally faster on these processors as well.

You can also use subtraction with shifts to perform a multiplication operation. Consider the following multiplication by seven:

            mov     bx, ax          ;Save AX*1
            shl     ax, 1           ;AX := AX*2
            shl     ax, 1           ;AX := AX*4
            shl     ax, 1           ;AX := AX*8
            sub     ax, bx          ;AX := AX*7

This follows directly from the fact that ax*7 = (ax*8)-ax.

A common error made by beginning assembly language students is subtracting or adding one or two rather than ax*1 or ax*2. The following does not compute ax*7:

            shl     ax, 1
            shl     ax, 1
            shl     ax, 1
            sub     ax, 1

It computes (8*ax)-1, something entirely different (unless, of course, ax = 1). Beware of this pitfall when using shifts, additions, and subtractions to perform multiplication operations.

Division is a bit harder, need to think...

Plangent answered 13/1, 2015 at 12:54 Comment(6)

Hmmm. There's got to be tons of material on this, though it probably predates Pentium for mainline processors. Thinking about it, the embedded environment, where silicon is precious and MUL and DIV are less important, may have more current material. – Drudgery 14/1, 2015 at 15:32

I found this with a brief search. It's not 8086 but the technique is the same: what-when-how.com/microcontrollers/… – Drudgery 14/1, 2015 at 15:35

Thank you for your answers! For division the things is harder. But I have not idea how can it be implemented :(. For me is good also the case where I know the result – Retentive 17/1, 2015 at 23:12

sorry,, i got busy with some other stuff, will look for a solution for division and add it to edit asap – Plangent 19/1, 2015 at 4:0

This info is quite outdated. On a new CPU a mul reg,reg is almost always faster than shifts and adds. Esp. because shifts and adds lengthen the dependency chain. Also x86 has a barrel shifter. shr eax,3 happens in a single cycle and can be paired with 3 other instructions for a 1/4 of a cycle cost. There is no need to create a long depency-chain for shifts by a single bits. Finally you forgot about lea, but that's a minor issue. – Salsala 25/8, 2016 at 17:40

@Rico: Why are you limiting your code to 8086 instructions? shl ax,3 works on 286 or later, which added the SHR r/m16,imm8 encoding. And of course with 32-bit (or 64-bit) addressing modes, you can use LEA. Even PPro / Pentium II has 4-cycle latency imul eax, eax, 10 according to agner.org/optimize, so I dispute your claim that the 5-instruction sequence is faster than mul on "most processors in the 80x86 family", since that includes all modern ones (at least up to Nehalem, the last member of the 80686 family before the Sandybridge family). – Wringer 1/9, 2017 at 0:5

Here is an example:

mov bx, 1000b
shl bx, 5
mov cx, bx
shr cx, 2
add bx, cx
add bx, 1000b

Priority answered 22/11, 2016 at 17:40 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags