x86 MUL Instruction from VS 2008/2010
Asked Answered
B

7

18

Will modern (2008/2010) incantations of Visual Studio or Visual C++ Express produce x86 MUL instructions (unsigned multiply) in the compiled code? I cannot seem to find or contrive an example where they appear in compiled code, even when using unsigned types.

If VS does not compile using MUL, is there a rationale why?

Benthos answered 28/10, 2010 at 2:49 Comment(6)
What instruction(s) would it use otherwise?Cronyism
@Jeff M I think perhaps the poster meant that IMUL is used instead in the compiled code.Wooded
@pst: I was just asking because I didn't have access to the compiler and couldn't see what instructions were actually used. I caved in and booted up my dev machine to figure it out. :)Cronyism
@Jeff M I'm curios (but not that curious ;-) and was trying to prompt the poster to add clarification :pWooded
(edit wasn't working) To further clarify my question, I was essentially wondering whether Intel had released some kind of optimization suggestions surrounding MUL vs IMUL. Or whether MS had furnished a rationale for the instructions they use (less likely.)Benthos
Possible duplicate of Why is imul used for multiplying unsigned numbers?Genaro
W
33

imul (signed) and mul (unsigned) both have a one-operand form that does edx:eax = eax * src. i.e. a 32x32b => 64b full multiply (or 64x64b => 128b).

186 added an imul dest(reg), src(reg/mem), immediate form, and 386 added an imul r32, r/m32 form, both of which which only compute the lower half of the result. (According to NASM's appendix B, see also the x86 tag wiki)

When multiplying two 32-bit values, the least significant 32 bits of the result are the same, whether you consider the values to be signed or unsigned. In other words, the difference between a signed and an unsigned multiply becomes apparent only if you look at the "upper" half of the result, which one-operand imul/mul puts in edx and two or three operand imul puts nowhere. Thus, the multi-operand forms of imul can be used on signed and unsigned values, and there was no need for Intel to add new forms of mul as well. (They could have made multi-operand mul a synonym for imul, but that would make disassembly output not match the source.)

In C, results of arithmetic operations have the same type as the operands (after integer promotion for narrow integer types). If you multiply two int together, you get an int, not a long long: the "upper half" is not retained. Hence, the C compiler only needs what imul provides, and since imul is easier to use than mul, the C compiler uses imul to avoid needing mov instructions to get data into / out of eax.

As a second step, since C compilers use the multiple-operand form of imul a lot, Intel and AMD invest effort into making it as fast as possible. It only writes one output register, not e/rdx:e/rax, so it was possible for CPUs to optimize it more easily than the one-operand form. This makes imul even more attractive.

The one-operand form of mul/imul is useful when implementing big number arithmetic. In C, in 32-bit mode, you should get some mul invocations by multiplying unsigned long long values together. But, depending on the compiler and OS, those mul opcodes may be hidden in some dedicated function, so you will not necessarily see them. In 64-bit mode, long long has only 64 bits, not 128, and the compiler will simply use imul.

Welles answered 28/10, 2010 at 7:51 Comment(6)
Are you certain of the causality of IMUL/MUL optimizations? Is it possible that VS prefers IMUL because it happens to already be faster (vice compilers prefering it, causing Intel/AMD to make it faster)?Benthos
@Mike: on the 80386, mul and imul offer the same speed, and C compilers were already using imul because of the convenience of choosing the registers. So I think that compilers chose first, and processor vendors followed, not the other way round.Welles
in 64-bit mode it'll use mul for __int128Genaro
In 64-bit mode does mul return the upper 64-bit word? I mean is the lower 64-bit word in rdx and the upper in rdx?Athenian
The immediate form of the imul instruction was added on the 186, not the 286.Conceal
@Myria: thanks, the current version of the NASM manual agrees with you and says 186. No idea why the older version had it wrong.Tyne
U
10

There's three different types of multiply instructions on x86. The first is MUL reg, which does an unsigned multiply of EAX by reg and puts the (64-bit) result into EDX:EAX. The second is IMUL reg, which does the same with a signed multiply. The third type is either IMUL reg1, reg2 (multiplies reg1 with reg2 and stores the 32-bit result into reg1) or IMUL reg1, reg2, imm (multiplies reg2 by imm and stores the 32-bit result into reg1).

Since in C, multiplies of two 32-bit values produce 32-bit results, compilers normally use the third type (signedness doesn't matter, the low 32 bits agree between signed and unsigned 32x32 multiplies). VC++ will generate the "long multiply" versions of MUL/IMUL if you actually use the full 64-bit results, e.g. here:

unsigned long long prod(unsigned int a, unsigned int b)
{
  return (unsigned long long) a * b;
}

The 2-operand (and 3-operand) versions of IMUL are faster than the one-operand versions simply because they don't produce a full 64-bit result. Wide multipliers are large and slow; it's much easier to build a smaller multiplier and synthesize long multiplies using Microcode if necessary. Also, MUL/IMUL writes two registers, which again is usually resolved by breaking it into multiple instructions internally - it's much easier for the instruction reordering hardware to keep track of two dependent instructions that each write one register (most x86 instructions look like that internally) than it is to keep track of one instruction that writes two.

Unscientific answered 1/11, 2010 at 4:23 Comment(2)
It's splitting the result that's the problem for modern Intel CPUs (SnB family), not actually doing the wide multiply. imul r,r (any size including 64bit) is 1 uop. imul r/m32 is 3 uops, while imul r/m64 is only 2 uops, maybe because it doesn't have to split the output of the 64bit multiplier hardware? imul r/m8 is 1 uop, presumably because the result goes in AX, so it's still only a single register. BTW, CPU architects would say that most x86 integer instructions write two output registers, including the flags.Tyne
(Update: x86 instructions technically have 2 outputs including FLAGS, but internally CPUs handle that by having each physical-register-file entry able to hold an integer result and a FLAGS output, so normal instructions like add edx, ecx and not eax are both single-uop one-output instructions in terms of allocating resources and write ports. Even though add writes FLAGS and not doesn't. And by renaming CF separately from the rest (SPAZO), inc eax is also single uop with no FLAGs input, despite leaving CF unmodified)Tyne
T
4

According to http://gmplib.org/~tege/x86-timing.pdf, the IMUL instruction has a lower latency and higher throughput (if I'm reading the table correctly). Perhaps VS is simply using the faster instruction (that's assuming that IMUL and MUL always produce the same output).

I don't have Visual Studio handy, so I tried to get something else with GCC. I also always get some variation of IMUL.

This:

unsigned int func(unsigned int a, unsigned int b)
{ 
    return a * b;
}

Assembles to this (with -O2):

_func:
LFB2:
        pushq   %rbp
LCFI0:
        movq    %rsp, %rbp
LCFI1:
        movl    %esi, %eax
        imull   %edi, %eax
        movzbl  %al, %eax
        leave
        ret
Thordia answered 28/10, 2010 at 3:38 Comment(0)
C
2

My intuition tells me that the compiler chose IMUL arbitrarily (or whichever was faster of the two) since the bits will be the same whether it uses unsigned MUL or signed IMUL. Any 32-bit integer multiplication will be 64-bits spanning two registers, EDX:EAX. The overflow goes into EDX which is essentially ignored since we only care about the 32-bit result in EAX. Using IMUL will sign-extend into EDX as necessary but again, we don't care since we're only interested in the 32-bit result.

Cronyism answered 28/10, 2010 at 3:20 Comment(0)
M
2

Right after I looked at this question I found MULQ in my generated code when dividing.

The full code is turning a large binary number into chunks of a billion so that it can be easily converted to a string.

C++ code:

for_each(TempVec.rbegin(), TempVec.rend(), [&](Short & Num){
    Remainder <<= 32;
    Remainder += Num;
    Num = Remainder / 1000000000;
    Remainder %= 1000000000;//equivalent to Remainder %= DecimalConvert
});

Optimized Generated Assembly

00007FF7715B18E8  lea         r9,[rsi-4]  
00007FF7715B18EC  mov         r13,12E0BE826D694B2Fh  
00007FF7715B18F6  nop         word ptr [rax+rax] 
00007FF7715B1900  shl         r8,20h  
00007FF7715B1904  mov         eax,dword ptr [r9]  
00007FF7715B1907  add         r8,rax  
00007FF7715B190A  mov         rax,r13  
00007FF7715B190D  mul         rax,r8  
00007FF7715B1910  mov         rcx,r8  
00007FF7715B1913  sub         rcx,rdx  
00007FF7715B1916  shr         rcx,1  
00007FF7715B1919  add         rcx,rdx  
00007FF7715B191C  shr         rcx,1Dh  
00007FF7715B1920  imul        rax,rcx,3B9ACA00h  
00007FF7715B1927  sub         r8,rax  
00007FF7715B192A  mov         dword ptr [r9],ecx  
00007FF7715B192D  lea         r9,[r9-4]  
00007FF7715B1931  lea         rax,[r9+4]  
00007FF7715B1935  cmp         rax,r14  
00007FF7715B1938  jne         NumToString+0D0h (07FF7715B1900h)  

Notice the MUL instruction 5 lines down. This generated code is extremely unintuitive, I know, in fact it looks nothing like the compiled code but DIV is extremely slow ~25 cycles for a 32 bit div, and ~75 according to this chart on modern PCs compared with MUL or IMUL (around 3 or 4 cycles) and so it makes sense to try to get rid of DIV even if you have to add all sorts of extra instructions.

I don't fully understand the optimization here, but if you would like to see a rational and a mathematical explanation of using compile time and multiplication to divide constants, see this paper.

This is an example of is the compiler making use of the performance and capability of the full 64 by 64 bit untruncated multiply without showing the c++ coder any sign of it.

Moro answered 3/3, 2015 at 5:53 Comment(5)
This is compiled using VS 2013, with default release settings. And I also found the same exact optimization on GCC -O2.Moro
A tremendous amount of effort was put into "strength reduction" of division (and multiplication) in GCC back in the 90s, when a lot of CPUs didn't even have a hardware divide instruction. The author of the paper you found was responsible for much of that work. If you are curious, read through expmed.c, the bulk of which is implementing this optimization. I didn't know MSVC could do this too but it doesn't surprise me.Aponeurosis
Blast from the past ^_^ That is an interesting example, and thanks for the link to the paper. @Aponeurosis thanks for additional context.Benthos
All decent compilers do this optimization: Why does GCC use multiplication by a strange number in implementing integer division?Genaro
mul rax,r8 is unusual syntax. The standard syntax is mul r8, with RAX as an implicit input and RDX:RAX as implicit outputs. felixcloutier.com/x86/mulTyne
A
2

As already explained C/C++ does not do word*word to double-word operations which is what the mul instruction is best for. But there are cases where you want word*word to double-word so you need an extension to C/C++.

GCC, Clang, and ICC provide provide a builtin type __int128 which you can use to indirectly get the mul instruction.

With MSVC it provides the _umul128 intrinsic (since at least VS 2010) which generates the mul instruction. With this intrinsic along with the _addcarry_u64 intrinsic you could build your own efficient __int128 type with MSVC.

Athenian answered 24/11, 2015 at 13:17 Comment(0)
R
-4

Classic mul

[1 qbyte * 1 qbyte] it's 1 ops;
[2 qbyte * 2 qbyte] it's 4 ops;
[4 qbyte * 4 qbyte] it's 20 ops;

Imul

imul probably  1 ops
mul probably 11 ops

then use long arithmetic, 2nd is optimal, else 1st variant;

Risa answered 15/3 at 12:43 Comment(2)
Are you talking about bigint math, like if the hardware only had an 8x8 -> 16-bit multiplier and had to implement 2-byte x 2-byte (16x16 -> 32-bit) multiply in terms of that? That would indeed take four multiplies, but also some adds. But in actual reality, CPUs have 64-bit multiply hardware so any operand-size for mul or imul is 2 or 3 uops on Intel CPUs. uops.info. mul rcx is 2 uops (64x64 -> 128-bit, with the second uop just writing RDX with the high half result.) mul ecx is 3 uops, 32x32 -> 64-bit, with the extra uop perhaps splitting the low 64 bits of the ALU output.Tyne
(Some low-power CPUs like first-gen Silvermont and older AMD had only 32-bit integer multiply hardware, so imul rdx, rcx cost multiple uops, with mul rcx only costing a couple more. uops.info/… shows Bonnell (first-gen Silvermont) with 6 uop imul r64, r64, 8 uop mul r64. And single-uop imul r32, r32.Tyne

© 2022 - 2024 — McMap. All rights reserved.