How can I multiply two hex 128 bit numbers in assembly

Asked 24/11, 2016 at 22:55 Answered 25/11, 2016 at 4:36

Solved algorithm assembly byte x86-64 multiplication

I have two 128 bit numbers in memory in hexadecimal, for example (little endian):

x:0x12 0x45 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
y:0x36 0xa1 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

I've to perform the unsigned multiplication between these two numbers so my new number will be:

z:0xcc 0xe3 0x7e 0x2b 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

Now, I'm aware that I can move the half x and y number into rax and rbx registers and, for example, do the mul operation, and do the same with the other half. The problem is that by doing so I lose the carry-over and I've no idea how I can avoid that. It's about 4 hours I'm facing this problem and the only solution that can I see is the conversion in binary (and <-> shl,1).

Can you give me some input about this problem?
I think the best solution is to take one byte par time.

Billet answered 24/11, 2016 at 22:55 Comment(9)

x86 provides a 64 × 64 → 128 multiplication with the result in rdx:rax. You can use that operation to get the desired result. If you are interested, I can give details. Please tell me whether you want signed or unsigned multiplication. – Burkitt 24/11, 2016 at 22:58

Gladly. The multiplication has to be unsigned. But I've to do a 128 x 128 -> 128 multiplication. If it goes in overflow I've to keep only the lowest part (it's normal) and set the OF flag. Let me know – Billet 24/11, 2016 at 23:2

Simply: The max number is (2^128)-1 – Billet 24/11, 2016 at 23:39

@Billet Are you satisfied with my answer? Is there any additional information you need? – Burkitt 24/11, 2016 at 23:40

I'm trying to understand it. If I've question on the matter I will ask you again. Many thanks :-) – Billet 24/11, 2016 at 23:41

Does Karatsuba algo serves your purpose? – Aphasic 25/11, 2016 at 0:17

@AdrianColomitchi the numbers here are too short for that to help, the simple way to do it is already only 3 multiplications – Syllabic 25/11, 2016 at 0:30

@harold: his numbers are short enough to do a 16x16 = 32 bit multiplication, but I think that's just an example – Vesical 25/11, 2016 at 8:46

@AdrianColomitchi Indeed, Karatsuba would be useful if OP cared about the high 128 bit of the result, too. – Burkitt 25/11, 2016 at 13:22

As usual, ask a compiler how to do something efficiently: GNU C on 64-bit platforms supports __int128_t and __uint128_t.

__uint128_t mul128(__uint128_t a, __uint128_t b) { return a*b; }

compiles to (gcc6.2 -O3 on Godbolt)

   imul    rsi, rdx        # a_hi * b_lo
   mov     rax, rdi
   imul    rcx, rdi        # b_hi * a_lo
   mul     rdx             # a_lo * b_lo  widening multiply
   add     rcx, rsi        # add the cross products ...
   add     rdx, rcx        # ... into the high 64 bits.
   ret

Since this is targeting the x86-64 System V calling convention, a is in RSI:RDI, while b is in RCX:RDX. The result is returned in RDX:RAX.

Pretty nifty that it only takes one MOV instruction, since gcc doesn't need the high-half result of a_upper * b_lower or vice versa. It can destroy the high halves of the inputs with the faster 2-operand form of IMUL since they're only used once.

With -march=haswell to enable BMI2, gcc uses MULX to avoid even the one MOV.

Sometimes compiler output isn't perfect, but very often the general strategy is a good starting point for optimizing by hand.

Of course, if what you really wanted in the first place was 128-bit multiplies in C, just use the compiler's built-in support for it. That lets the optimizer do its job, often giving better results than if you'd written a couple parts in inline-asm. (https://gcc.gnu.org/wiki/DontUseInlineAsm).

Is there a 128 bit integer in gcc? for GNU C unsigned __int128
https://learn.microsoft.com/en-us/cpp/intrinsics/umul128?view=msvc-170 MSVC's _umul128 that does 64x64 => 128-bit multiply (on 64-bit CPUs only). Takes args as 64-bit halves, returns two halves.
Getting the high part of 64 bit integer multiplication - Including with MSVC intrinsics, but still only for 64-bit CPUs.
An efficient way to do basic 128 bit integer calculations in C++?

Gerhardt answered 25/11, 2016 at 4:36 Comment(7)

With this example, rdx contains the half 128 bit numbers and rdi the other half? – Billet 25/11, 2016 at 23:47

@Davide: highlighted the return-result location for you. The 128b result has its high half in RDX, low half in RAX. Notice that the low 64 bits of the result only depends on the low 64 bits of the two inputs. (Produced with MUL). – Gerhardt 26/11, 2016 at 4:21

Doesn't works for me. I want to multiply 596a*18c2. Then in rsi I move 59h and in rdi 6ah. In rcx I move 18h and in rdx 0c2h. The result is in rdx 0x4d62 and in rax 0x5054, that doesn't correspond to the real multiplication, Result has to be 0x8a5b254 – Billet 28/11, 2016 at 19:51

@Davide: You don't split the inputs in half, you split them on a 64-bit boundary. You multiplied 0x000000590000006a * 0x00000018000000c2, and then apparently looked at the low 64 bits of the result. The result is 0x00004d6200005054, which this code returns as RDX = 00004d62, and RAX = 0x00005054, which is what you got. – Gerhardt 29/11, 2016 at 0:55

Thanks Peter, sorry, but maybe you haven't understand. My result must be 0x0000000008a5b254 with that multiplication. With RDX = 00004d62 and RAX = 0x00005054 I can't do nothing – Billet 29/11, 2016 at 1:5

@Davide: apparently you don't understand. You need to set rsi=0:rdi=0x596a and rcx=0:rdx=0x18c2 to do pass the inputs you described, and get the result you want. Also, if you know the high halves are zero, just use a single MUL instruction to get a 128b result. – Gerhardt 29/11, 2016 at 1:12

great code works perfectly!! – Chronometer 31/12, 2023 at 11:42

Let μ = 2⁶⁴, then we can decompose your 128 bit numbers a and b into a = a₁μ + a₂ and b = b₁μ + b₂. Then we can compute c = ab with 64 · 64 → 128 bit multiplications by first computing partial products:

q₁μ + q₂ = a₂b₂
r₁μ + r₂ = a₁b₂
s₁μ + s₂ = a₂b₁
t₁μ + t₂ = a₁b₁

and then accumulating them into a 256 bit result (watch the overflow when doing the additions!):

c = t₁μ³ + (t₂ + s₁ + r₁) μ² + (s₂ + r₂ + q₁) μ + q₂

Burkitt answered 24/11, 2016 at 23:11 Comment(2)

Just an observation: might be easier for OP to relate better if you used his x, y, z – Chaunceychaunt 25/11, 2016 at 1:55

If you only need the low 128 of the result, you don't need some of the terms, and don't need to accumulate the full 256b result in the first place. Carry propagates strictly right-to-left, so nothing that happens to bits above the 128 you want can affect the result. (I'm sure you know this, but worth pointing out since not everyone does). Thus, the OP won't even need to compute the a1b1 multiply of high halves. – Gerhardt 25/11, 2016 at 4:25

Recommended topics

Hot tags