Why is imul used for multiplying unsigned numbers?
Asked Answered
I

2

24

I compiled the following program:

#include <stdint.h>

uint64_t usquare(uint32_t x) {
  return (uint64_t)x * (uint64_t)x;
}

This disassembles to:

 0: 89 f8                   mov    eax,edi
 2: 48 0f af c0             imul   rax,rax
 6: c3                      ret  

But imul is the instruction for multiplying signed numbers. Why is it used by gcc then?

/edit: when using uint64_t the assembly is similar:

0:  48 0f af ff             imul   rdi,rdi
4:  48 89 f8                mov    rax,rdi
7:  c3                      ret  
Intine answered 3/3, 2017 at 20:10 Comment(3)
Two-operand imul only returns the lower half of the product, so signedness is not a concern.Emblematize
Which bit is the sign bit?Intine
The most significant bit, which would have a weight of -2^(k-1) instead of +2^(k-1)Emblematize
K
52

TL:DR: because it's a faster way of getting the correct result when we don't care about the high half (i.e. the output is only as wide as the 2 inputs). And more flexible register-allocation instead of forced use of RAX and RDX.

If it wasn't usable for this, Intel probably would have added two-operand versions of mul as well. But that wasn't necessary, as this answer explains.


WARNING This answer is long!

... and it's full of unneeded explanations - but I have always wanted to write something more lengthy about the multiplication.

A bit of theory

When multiplying two numbers a and b of length n the result is of length 2 n and, most importantly, the k-th digit only depends on the lowest k digits (a proof is given in Appendix A).

x86 imul's two forms

The x86 multiplication instruction imul comes in two form: the full form and the partial form.

The first form is of the kind n×n→2 n, meaning that it produces a result twice the size of the operands - we know from the theory why this makes sense.
For example

imul ax         ;16x16->32, Result is dx:ax
imul rax        ;64x64->128, Result is rdx:rax 

The second form is of the kind n×nn, this necessarily cuts out some information.
Particularly, this form takes only the lower n bits of the result.

imul ax, ax          ;16x16->16, Lower WORD of the result is ax
imul rax, rax        ;64x64->64, Lower QWORD of the result is rax 

Only the single-operand version is of the first form.

(There's also a 3-operand form, imul r64, r/m64, imm8/32, which allows you to copy-and-multiply by a constant in one instruction. It has no implicit operands and again doesn't write the high half anywhere so we can just treat it as equivalent to the imul r64, r/m64 dst *= src form.)

The two instructions: imul vs mul

Regardless of the form used, the processor always calculates the result with a size twice the operands' (i.e. like the first form).
In order to be able to do that, the operands are first converted from their size n to size 2 n (e.g. from 64 to 128 bits).
See Appendix B for more on this.

The multiplication is done and the full, or partial, result is stored in the destination.

The difference between imul and mul is in how the operands are converted.
Since the size is extended, this particular type of conversion is called extension.

The mul instruction simply fills the upper part with zeros - it zero extends.
The imul instructions replicate the high-order bit (the first from the left) - this is called sign extension and it has the interesting property of transforming a two's complement signed number of n bits into a signed number of 2 n bits with the same sign and modulus (i.e. it does the right thing, it is left to the reader to find a counter-example for the zero-extension case).

     How mul extends              How imul extends       
       an operand                   an operand

     +----+       +----+          +----+       +----+
     |0...|       |1...|          |0...|       |1...|
     +----+       +----+          +----+       +----+  

+----+----+  +----+----+     +----+----+  +----+----+
|0000|0...|  |0000|1...|     |0000|0...|  |1111|1...|
+----+----+  +----+----+     +----+----+  +----+----+

The thesis

The difference between imul and mul is noticeable only from the (n+1)-th bit onward.
For a 32-bit operand, it means that only the upper 32-bit part of the full result will eventually be different.

This is easy to see as the lower n bits are the same for both instructions and as we know from the theory the first n bits of the result only depend on the first n bits of the operands.

Thus the thesis: The result of the partial form of imul is identical to that of mul.

Then why does imul exist?

Original 8086 only had one-operand versions of mul and imul. Later versions of x86 added more flexible two and three operand versions of imul only, intended for the common use-case where you don't want the double-width result.

They only write one output register, which for modern x86 means they can decode to a single uop: https://agner.org/optimize/. (In modern x86 microarchitectures, each uop can write at most 1 register.) One-operand imul r32 is 3 uops on Intel CPUs: presumably one to multiply, another to split the 64-bit product into 2 halves and write the low half, and another to do the same for the high half. imul r64 is 2 uops; presumably the 128-bit result comes out of the multiplier already split in 64-bit halves.

mul still only exists in the very ancient one-operand form with fixed registers as part of the interface.

imul sets the flags according to a signed multiplication - CF and OF are set if the partial result has discarded any significant information (the technical condition being: the sign extension of the partial result is different from the full result) such as in case of overflow. This is also why the two and three operand forms are not called mul, which otherwise would have been a perfectly fit name.

The practice

To test all this in practice we can ask a compiler[live] for the assembly of the following program

#include <stdint.h>

uint64_t foo(uint32_t a)
{
    return a*(uint64_t)a;
}

While we know that for 64-bit target the code generated uses imul because a unint64_t fits a register and thus a 64×64→64 multiplication is available as imul <reg64>, <reg64>

foo(unsigned int):
        mov     eax, edi        ;edi = a
        imul    rax, rax        ;64x64->64
        ret

in 32-bit code there is no such multiplication using imul.
An imul <reg32> or imul <reg32>, <reg32>, <reg32> is necessary but that would produce a full result! And a full signed result is not generally equal to a full unsigned result.
In fact, the compiler reverts back to mul:

foo(unsigned int):
        mov     eax, DWORD PTR [esp+4]
        mul     eax
        ret

Appendix A

Without loss of generality, we can assume base 2 and that the numbers are n + 1 bits long (so that the indices run from 0 to n) - then

c = a·b = ∑i=0..n (ai·2i) · ∑j=0..n(bj·2j) = ∑i=0..n [ai·∑j=0..n (bj·2i+j)] (by the distributive property)

We see that the k-th digit of the result is the sum of all the addends such that i + j = k plus an eventual carry

ck = ∑i,j=0..n; i+j=k ai·bj·2i+j + Ck

The term Ck is the carry and, as it propagates towards higher bits, it depends only on the lower bits.
The second term cannot have a ai or bj with i or j > k as if the first were true then i = k + e, for a positive, non null, e and thus j = k - i = k - k -e = -e
But j cannot be negative!
The second case is similar and left to the reader.

Appendix B

As BeeOnRope pointed out in the comments the processor probably doesn't compute a full result if only the partial result is needed.

You probably means that this is only a way of thinking about it, conceptually. The processor does not necessarily do a full 128-bit multiplication when you use the 64x64 -> 64 form. Indeed, the truncated form takes only 1 uop on recent Intel, but the full form takes 2 uops, so some extra work is being done

Comment from BeeOnRope

Also, the sign extension is probably conceptually too.

Similarly the sign extension may happens "conceptually", but probably not in hardware. They won't have the extra wires and transistors just to do the sign or zero extension, which would add a lot of bulk to an already huge multiplier, but will use some other tricks to do the multiplication "as if" that had happened.

Comment from BeeOnRope


Binary numbers of length n are in the order of magnitude of 2n, thus the multiplication of two such numbers is in the order of magnitude 2n · 2n = 2n+n = 22 n. Just like a number of length 2 n.

Kirwin answered 3/3, 2017 at 22:28 Comment(5)
Great answer! I think you might want to clarify when you say things like Regardless of the form used, the processor always calculates the result with a size twice the operands' (i.e. like the first form) you probably means that this is only a way of thinking about it, conceptually. The processor does not necessarily do a full 128-bit multiplication when you use the 64x64 -> 64 form. Indeed, the truncated form takes only 1 uop on recent Intel, but the full form takes 2 uops, so some extra work is being done.Voltaism
Similarly the sign extension may happens "conceptually", but probably not in hardware. They won't have the extra wires and transistors just to do the sign or zero extension, which would add a lot of bulk to an already huge multiplier, but will use some other tricks to do the multiplication "as if" that had happened.Voltaism
@BeeOnRope: I think the extra uop in imul r64 is just to write the high half; uops can have at most one non-flag output in Intel uarches. And imul r32 has to split and/or zero-extend the multiplier output in a way that it isn't naturally split (at a 64-bit boundary) so it's 3 uops. Updated Margaret's answer with a TL:DR and some mention of uops.Depth
@PeterCordes Yeah I think you you definitely need an extra uop to write the high half, but it isn't clear to me if the entire 64->128 circuitry is invoked and the high output simply ignored. A wide multiplication is definitely more expensive than a narrow one, but it isn't clear if it's worth suppressing the high part of the result just to save a bit of power (or if that is even possible in the design of the multiplier).Voltaism
@BeeOnRope: I wouldn't be at all surprised if the bit-flipping of transistors that only lead to the high half is suppressed somehow to save power, especially for vpmullw vs. vpmulhw but also for integer. Maybe not power gating the high half because it's all connected. Maybe the high 64 of 128 not flipping happens naturally for narrow inputs, e.g. for mul r32, but further gating for imul r32,r32 or imul r64, r64 is possible. But that's all just power considerations, not clock-for-clock / uop performance which is the point I was trying to make in my edit.Depth
D
0
#include <stdint.h>

uint64_t fun0 ( uint32_t x )
{
    return (uint64_t)x * (uint64_t)x;
}
uint64_t fun1 ( uint32_t x )
{
    return ((uint64_t)x) * ((uint64_t)x);
}
uint64_t fun2 ( uint64_t x )
{
    return (x * x);
}



0000000000000000 <fun0>:
   0:   89 f8                   mov    %edi,%eax
   2:   48 0f af c0             imul   %rax,%rax
   6:   c3                      retq   
   7:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
   e:   00 00 

0000000000000010 <fun1>:
  10:   89 f8                   mov    %edi,%eax
  12:   48 0f af c0             imul   %rax,%rax
  16:   c3                      retq   
  17:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
  1e:   00 00 

0000000000000020 <fun2>:
  20:   48 89 f8                mov    %rdi,%rax
  23:   48 0f af c7             imul   %rdi,%rax
  27:   c3                      retq   

EDIT

even if you specify all 64 bit unsigned it generates the same result

0x00FF * 0x00FF = 0xFE01
0xFFFF * 0xFFFF = 0xFFFE0001
so
0xFF * 0xFF = 0x01

sign extension doesnt matter for the lower 64 bits so you can use imul for 8, 16, 32 and 64 bit operands signed or unsigned.

Domenic answered 3/3, 2017 at 20:26 Comment(3)
based on what harold is saying fun2 is a 32 bit multiply as well instead of a 64 bit?Domenic
But fun2 uses imul too, but takes an unsigned integerIntine
No I meant it's 64x64->64 as opposed to 64x64->128. Signedness only affects the upper half of the 128 bit productEmblematize

© 2022 - 2024 — McMap. All rights reserved.