How can I multiply two hex 128 bit numbers in assembly
Asked Answered
B

2

5

I have two 128 bit numbers in memory in hexadecimal, for example (little endian):

x:0x12 0x45 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
y:0x36 0xa1 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

I've to perform the unsigned multiplication between these two numbers so my new number will be:

z:0xcc 0xe3 0x7e 0x2b 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

Now, I'm aware that I can move the half x and y number into rax and rbx registers and, for example, do the mul operation, and do the same with the other half. The problem is that by doing so I lose the carry-over and I've no idea how I can avoid that. It's about 4 hours I'm facing this problem and the only solution that can I see is the conversion in binary (and <-> shl,1).

Can you give me some input about this problem?
I think the best solution is to take one byte par time.

Billet answered 24/11, 2016 at 22:55 Comment(9)
x86 provides a 64 × 64 → 128 multiplication with the result in rdx:rax. You can use that operation to get the desired result. If you are interested, I can give details. Please tell me whether you want signed or unsigned multiplication.Burkitt
Gladly. The multiplication has to be unsigned. But I've to do a 128 x 128 -> 128 multiplication. If it goes in overflow I've to keep only the lowest part (it's normal) and set the OF flag. Let me knowBillet
Simply: The max number is (2^128)-1Billet
@Billet Are you satisfied with my answer? Is there any additional information you need?Burkitt
I'm trying to understand it. If I've question on the matter I will ask you again. Many thanks :-)Billet
Does Karatsuba algo serves your purpose?Aphasic
@AdrianColomitchi the numbers here are too short for that to help, the simple way to do it is already only 3 multiplicationsSyllabic
@harold: his numbers are short enough to do a 16x16 = 32 bit multiplication, but I think that's just an exampleVesical
@AdrianColomitchi Indeed, Karatsuba would be useful if OP cared about the high 128 bit of the result, too.Burkitt
G
10

As usual, ask a compiler how to do something efficiently: GNU C on 64-bit platforms supports __int128_t and __uint128_t.

__uint128_t mul128(__uint128_t a, __uint128_t b) { return a*b; }

compiles to (gcc6.2 -O3 on Godbolt)

   imul    rsi, rdx        # a_hi * b_lo
   mov     rax, rdi
   imul    rcx, rdi        # b_hi * a_lo
   mul     rdx             # a_lo * b_lo  widening multiply
   add     rcx, rsi        # add the cross products ...
   add     rdx, rcx        # ... into the high 64 bits.
   ret

Since this is targeting the x86-64 System V calling convention, a is in RSI:RDI, while b is in RCX:RDX. The result is returned in RDX:RAX.

Pretty nifty that it only takes one MOV instruction, since gcc doesn't need the high-half result of a_upper * b_lower or vice versa. It can destroy the high halves of the inputs with the faster 2-operand form of IMUL since they're only used once.

With -march=haswell to enable BMI2, gcc uses MULX to avoid even the one MOV.


Sometimes compiler output isn't perfect, but very often the general strategy is a good starting point for optimizing by hand.


Of course, if what you really wanted in the first place was 128-bit multiplies in C, just use the compiler's built-in support for it. That lets the optimizer do its job, often giving better results than if you'd written a couple parts in inline-asm. (https://gcc.gnu.org/wiki/DontUseInlineAsm).

Gerhardt answered 25/11, 2016 at 4:36 Comment(7)
With this example, rdx contains the half 128 bit numbers and rdi the other half?Billet
@Davide: highlighted the return-result location for you. The 128b result has its high half in RDX, low half in RAX. Notice that the low 64 bits of the result only depends on the low 64 bits of the two inputs. (Produced with MUL).Gerhardt
Doesn't works for me. I want to multiply 596a*18c2. Then in rsi I move 59h and in rdi 6ah. In rcx I move 18h and in rdx 0c2h. The result is in rdx 0x4d62 and in rax 0x5054, that doesn't correspond to the real multiplication, Result has to be 0x8a5b254Billet
@Davide: You don't split the inputs in half, you split them on a 64-bit boundary. You multiplied 0x000000590000006a * 0x00000018000000c2, and then apparently looked at the low 64 bits of the result. The result is 0x00004d6200005054, which this code returns as RDX = 00004d62, and RAX = 0x00005054, which is what you got.Gerhardt
Thanks Peter, sorry, but maybe you haven't understand. My result must be 0x0000000008a5b254 with that multiplication. With RDX = 00004d62 and RAX = 0x00005054 I can't do nothingBillet
@Davide: apparently you don't understand. You need to set rsi=0:rdi=0x596a and rcx=0:rdx=0x18c2 to do pass the inputs you described, and get the result you want. Also, if you know the high halves are zero, just use a single MUL instruction to get a 128b result.Gerhardt
great code works perfectly!!Chronometer
B
9

Let μ = 264, then we can decompose your 128 bit numbers a and b into a = a1μ + a2 and b = b1μ + b2. Then we can compute c = ab with 64 · 64 → 128 bit multiplications by first computing partial products:

q1μ + q2 = a2b2
r1μ + r2 = a1b2
s1μ + s2 = a2b1
t1μ + t2 = a1b1

and then accumulating them into a 256 bit result (watch the overflow when doing the additions!):

c = t1μ3 + (t2 + s1 + r1) μ2 + (s2 + r2 + q1) μ + q2

Burkitt answered 24/11, 2016 at 23:11 Comment(2)
Just an observation: might be easier for OP to relate better if you used his x, y, zChaunceychaunt
If you only need the low 128 of the result, you don't need some of the terms, and don't need to accumulate the full 256b result in the first place. Carry propagates strictly right-to-left, so nothing that happens to bits above the 128 you want can affect the result. (I'm sure you know this, but worth pointing out since not everyone does). Thus, the OP won't even need to compute the a1b1 multiply of high halves.Gerhardt

© 2022 - 2024 — McMap. All rights reserved.