Getting the high part of 64 bit integer multiplication
Asked Answered
G

6

44

In C++, say that:

uint64_t i;
uint64_t j;

then i * j will yield an uint64_t that has as value the lower part of the multiplication between i and j, i.e., (i * j) mod 2^64. Now, what if I wanted the higher part of the multiplication? I know that there exists an assembly instruction to do something like that when using 32 bit integers, but I am not familiar at all with assembly, so I was hoping for help.

What is the most efficient way to make something like:

uint64_t k = mulhi(i, j);
Gamali answered 5/3, 2015 at 1:23 Comment(22)
Reference: blogs.msdn.com/b/oldnewthing/archive/2014/12/08/10578956.aspxAniseed
GCC has uint128_t for this purpose. Visual Studio has no such option though.Eelgrass
@MooingDuck Looks like uint128_t doesn't exist under my environment (I am using Xcode under osx). Moreover, that will explicitly compute both the higher and lower part of that multiplication, which I would like to avoid.Gamali
@NickyC thanks for the reference! As I said, I have little to no experience with assembly. Could you provide a simple example code that will do what I need? Sorry, I should definitely study assembly once and for all!Gamali
@MatteoMonti It is not feasible to compute higher part without lower part because the carry from lower part propagates to the higher part.Aniseed
@MatteoMonti It is not about assembly. I am just trying to show you the Maths.Aniseed
If performance is not a big concern, try an arbitrary length integer class to get the result.Tristichous
@NeilKirk performance is my main concern, actually...Gamali
So if I migrate to a platform where I have uint128_t that is likely to be the most efficient way to do what i need?Gamali
If performance is the real concern. You need to learn enough assembly to code this inline. On a 64 bit processor, there will (should? ) be instructions to multiply upper and lower 32 bit numbers.Chaff
#25096241 #28767255 #88271 #28807841Sperrylite
There is __int128 in gcc as well as llvm including Apple Clang. #13188129Sperrylite
Some more high bits of long multiplication in Java? Computing high 64 bits of a 64x64 int product in C Reasonably portable way to get top 64-bits from 64x64 bit multiply? Pure high-bit multiplication in assembly?Sperrylite
@LưuVĩnhPhúc thank you, I think I will just use 128 bits multiplication at this point. That sounds more performing than any other solution I could implement on my own, since I suppose that any possible optimization must have already been implemented by those that developed the compiler.Gamali
@phuclv: This question is not a duplicate of the one linked. That other question is focusing on 32 bit multiplications while this one is focusing on 64 bit multiplications. When people come to this question they follow the link (like I did) and have to go back to this question. I think it should be reopened (and maybe closed again with a better dup).Dubious
@Dubious there should be no difference. Simply double every variable type and the problem is solvedSperrylite
but yes, probably the other question doesn't have good enough generic answerSperrylite
@Sperrylite You cannot double the variable types as easily. You would need a 128 bit integer type.Dubious
@Dubious no, you don't need that just to take the higher part of a 64x64 multiplication, for example just widen the assembly instructions in the other question and you're good to go. And did you see my other linked questions?Sperrylite
@Sperrylite This is a C++ question, not an assembly question. Of course I can find a solution in assembly multiplying two 64 bit registers. The whole point is to know whether this is possible portably in C++. And if you are stuck in portable C++, the 32 bit question has a trivial answer (multiply two std::uint64_t) and the 64 bit question is difficult (because we don't have a std::uint128_t)Dubious
Let us continue this discussion in chat.Dubious
A better dupe is Computing high 64 bits of a 64x64 int product in C - and that has an answer that clearly shows how to derive good results for similar problems.Navada
H
29

If you're using gcc and the version you have supports 128 bit numbers (try using __uint128_t) then performing the 128 multiply and extracting the upper 64 bits is likely to be the most efficient way of getting the result.

If your compiler doesn't support 128 bit numbers, then Yakk's answer is correct. However, it may be too brief for general consumption. In particular, an actual implementation has to be careful of overflowing 64 bit integers.

The simple and portable solution he proposes is to break each of a and b into 2 32-bit numbers and then multiply those 32 bit numbers using the 64 bit multiply operation. If we write:

uint64_t a_lo = (uint32_t)a;
uint64_t a_hi = a >> 32;
uint64_t b_lo = (uint32_t)b;
uint64_t b_hi = b >> 32;

then it is obvious that:

a = (a_hi << 32) + a_lo;
b = (b_hi << 32) + b_lo;

and:

a * b = ((a_hi << 32) + a_lo) * ((b_hi << 32) + b_lo)
      = ((a_hi * b_hi) << 64) +
        ((a_hi * b_lo) << 32) +
        ((b_hi * a_lo) << 32) +
          a_lo * b_lo

provided the calculation is performed using 128 bit (or greater) arithmetic.

But this problem requires that we perform all the calculcations using 64 bit arithmetic, so we have to worry about overflow.

Since a_hi, a_lo, b_hi, and b_lo are all unsigned 32 bit numbers, their product will fit in an unsigned 64 bit number without overflow. However, the intermediate results of the above calculation will not.

The following code will implement mulhi(a, b) when the mathemetics must be performed modulo 2^64:

uint64_t    a_lo = (uint32_t)a;
uint64_t    a_hi = a >> 32;
uint64_t    b_lo = (uint32_t)b;
uint64_t    b_hi = b >> 32;

uint64_t    a_x_b_hi =  a_hi * b_hi;
uint64_t    a_x_b_mid = a_hi * b_lo;
uint64_t    b_x_a_mid = b_hi * a_lo;
uint64_t    a_x_b_lo =  a_lo * b_lo;

uint64_t    carry_bit = ((uint64_t)(uint32_t)a_x_b_mid +
                         (uint64_t)(uint32_t)b_x_a_mid +
                         (a_x_b_lo >> 32) ) >> 32;

uint64_t    multhi = a_x_b_hi +
                     (a_x_b_mid >> 32) + (b_x_a_mid >> 32) +
                     carry_bit;

return multhi;
                                              

As Yakk points out, if you don't mind being off by +1 in the upper 64 bits, you can omit the calculation of the carry bit.

Horacehoracio answered 6/3, 2015 at 17:49 Comment(0)
B
20

TL:DR with GCC for a 64-bit ISA: (a * (unsigned __int128)b) >> 64 compiles nicely, to a single full-multiply or high-half multiply instruction. No need to mess around with inline asm.


Unfortunately current compilers don't optimize @craigster0's nice portable version, so if you want to take advantage of 64-bit CPUs, you can't use it except as a fallback for targets you don't have an #ifdef for. (I don't see a generic way to optimize it; you need a 128-bit type or an intrinsic.)


GNU C (gcc, clang, or ICC) has unsigned __int128 on most 64-bit platforms. (Or in older versions, __uint128_t). GCC doesn't implement this type on 32-bit platforms, though.

This is an easy and efficient way to get the compiler to emit a 64-bit full-multiply instruction and keep the high half. (GCC knows that a uint64_t cast to a 128-bit integer still has the upper half all zero, so you don't get a 128-bit multiply using three 64-bit multiplies.)

MSVC also has a __umulh intrinsic for 64-bit high-half multiplication, but again it's only available on 64-bit platforms (and specifically x86-64 and AArch64. The docs also mention IPF (IA-64) having _umul128 available, but I don't have MSVC for Itanium available. (Probably not relevant anyway.)

#define HAVE_FAST_mul64 1

#ifdef __SIZEOF_INT128__     // GNU C
 static inline
 uint64_t mulhi64(uint64_t a, uint64_t b) {
     unsigned __int128 prod =  a * (unsigned __int128)b;
     return prod >> 64;
 }

#elif defined(_M_X64) || defined(_M_ARM64)     // MSVC
   // MSVC for x86-64 or AArch64
   // possibly also  || defined(_M_IA64) || defined(_WIN64)
   // but the docs only guarantee x86-64!  Don't use *just* _WIN64; it doesn't include AArch64 Android / Linux

  // https://learn.microsoft.com/en-gb/cpp/intrinsics/umulh
  #include <intrin.h>
  #define mulhi64 __umulh

#elif defined(_M_IA64) // || defined(_M_ARM)       // MSVC again
  // https://learn.microsoft.com/en-gb/cpp/intrinsics/umul128
  // incorrectly say that _umul128 is available for ARM
  // which would be weird because there's no single insn on AArch32
  #include <intrin.h>
  static inline
  uint64_t mulhi64(uint64_t a, uint64_t b) {
     unsigned __int64 HighProduct;
     (void)_umul128(a, b, &HighProduct);
     return HighProduct;
  }

#else

# undef HAVE_FAST_mul64
  uint64_t mulhi64(uint64_t a, uint64_t b);  // non-inline prototype
  // or you might want to define @craigster0's version here so it can inline.
#endif

For x86-64, AArch64, and PowerPC64 (and others), this compiles to one mul instruction, and a couple movs to deal with the calling convention (which should optimize away after this inlines). From the Godbolt compiler explorer (with source + asm for x86-64, PowerPC64, and AArch64):

     # x86-64 gcc7.3.  clang and ICC are the same.  (x86-64 System V calling convention)
     # MSVC makes basically the same function, but with different regs for x64 __fastcall
    mov     rax, rsi
    mul     rdi              # RDX:RAX = RAX * RDI
    mov     rax, rdx
    ret

(or with clang -march=haswell to enable BMI2: mov rdx, rsi / mulx rax, rcx, rdi to put the high-half in RAX directly. gcc is dumb and still uses an extra mov.)

For AArch64 (with gcc unsigned __int128 or MSVC with __umulh):

test_var:
    umulh   x0, x0, x1
    ret

With a compile-time constant power of 2 multiplier, we usually get the expected right-shift to grab a few high bits. But gcc amusingly uses shld (see the Godbolt link).


Unfortunately current compilers don't optimize @craigster0's nice portable version. You get 8x shr r64,32, 4x imul r64,r64, and a bunch of add/mov instructions for x86-64. i.e. it compiles to a lot of 32x32 => 64-bit multiplies and unpacks of the results. So if you want something that takes advantage of 64-bit CPUs, you need some #ifdefs.

A full-multiply mul 64 instruction is 2 uops on Intel CPUs, but still only 3 cycle latency, same as imul r64,r64 which only produces a 64-bit result. So the __int128 / intrinsic version is 5 to 10 times cheaper in latency and throughput (impact on surrounding code) on modern x86-64 than the portable version, from a quick eyeball guess based on http://agner.org/optimize/.

Check it out on the Godbolt compiler explorer on the above link.

gcc does fully optimize this function when multiplying by 16, though: you get a single right shift, more efficient than with unsigned __int128 multiply.

Bagman answered 21/6, 2018 at 0:36 Comment(1)
Related: it seems I wrote an earlier version of this answer on a different question a couple years before this: Getting the high half and low half of a full integer multiplyBagman
S
14

This is a unit-tested version that I came up with tonight that provides the full 128-bit product. On inspection it seems to be simpler than most of the other solutions online (in e.g. Botan library and other answers here) because it takes advantage of the how the MIDDLE PART does not overflow as explained in the code comments.

For context I wrote it for this github project: https://github.com/catid/fp61

//------------------------------------------------------------------------------
// Portability Macros

// Compiler-specific force inline keyword
#ifdef _MSC_VER
# define FP61_FORCE_INLINE inline __forceinline
#else
# define FP61_FORCE_INLINE inline __attribute__((always_inline))
#endif


//------------------------------------------------------------------------------
// Portable 64x64->128 Multiply
// CAT_MUL128: r{hi,lo} = x * y

// Returns low part of product, and high part is set in r_hi
FP61_FORCE_INLINE uint64_t Emulate64x64to128(
    uint64_t& r_hi,
    const uint64_t x,
    const uint64_t y)
{
    const uint64_t x0 = (uint32_t)x, x1 = x >> 32;
    const uint64_t y0 = (uint32_t)y, y1 = y >> 32;
    const uint64_t p11 = x1 * y1, p01 = x0 * y1;
    const uint64_t p10 = x1 * y0, p00 = x0 * y0;
    /*
        This is implementing schoolbook multiplication:

                x1 x0
        X       y1 y0
        -------------
                   00  LOW PART
        -------------
                00
             10 10     MIDDLE PART
        +       01
        -------------
             01 
        + 11 11        HIGH PART
        -------------
    */

    // 64-bit product + two 32-bit values
    const uint64_t middle = p10 + (p00 >> 32) + (uint32_t)p01;

    /*
        Proof that 64-bit products can accumulate two more 32-bit values
        without overflowing:

        Max 32-bit value is 2^32 - 1.
        PSum = (2^32-1) * (2^32-1) + (2^32-1) + (2^32-1)
             = 2^64 - 2^32 - 2^32 + 1 + 2^32 - 1 + 2^32 - 1
             = 2^64 - 1
        Therefore it cannot overflow regardless of input.
    */

    // 64-bit product + two 32-bit values
    r_hi = p11 + (middle >> 32) + (p01 >> 32);

    // Add LOW PART and lower half of MIDDLE PART
    return (middle << 32) | (uint32_t)p00;
}

#if defined(_MSC_VER) && defined(_WIN64)
// Visual Studio 64-bit

# include <intrin.h>
# pragma intrinsic(_umul128)
# define CAT_MUL128(r_hi, r_lo, x, y) \
    r_lo = _umul128(x, y, &(r_hi));

#elif defined(__SIZEOF_INT128__)
// Compiler supporting 128-bit values (GCC/Clang)

# define CAT_MUL128(r_hi, r_lo, x, y)                   \
    {                                                   \
        unsigned __int128 w = (unsigned __int128)x * y; \
        r_lo = (uint64_t)w;                             \
        r_hi = (uint64_t)(w >> 64);                     \
    }

#else
// Emulate 64x64->128-bit multiply with 64x64->64 operations

# define CAT_MUL128(r_hi, r_lo, x, y) \
    r_lo = Emulate64x64to128(r_hi, x, y);

#endif // End CAT_MUL128
Sulphide answered 30/7, 2018 at 5:5 Comment(5)
(Your comments alternate between above the code and below the code.)Ulrike
I ported this to C#, and it's faster than any other 64x64 function I've come across!Literati
Don't know if it matters, but this breaks on Aarch64. _umul128 is not available.Leucocyte
How about signed multiply? Does that just remember the sign, do unsigned multiply, and apply the sign?Plenipotent
Thanks! That was what I was looking for! I adjusted it to the 256 case and this is the only version that works faster then my previous implementation. All attempts based on Karatsuba algorithm were actually slower due to amount of temporary carry over bit states I had to store.Elm
H
5

Long multiplication should be ok performance.

Split a*b into (hia+loa)*(hib+lob). This gives 4 32 bit multiplies plus some shifts. Do them in 64 bits, and do the carries manually, and you'll get the high portion.

Note that an approximation of the high portion can be done with fewer multiplies -- accurate within 2^33 or so with 1 multiply, and within 1 with 3 multiplies.

I do not think there is a portable alternative.

Herren answered 5/3, 2015 at 4:25 Comment(2)
Why not portable? You can even do arbitrary precision math in C portably without any assemblySperrylite
@luru I mean fast portable alternative. This is basically a bignum with a tiny max size.Herren
F
0

This will provide two values:

  • mult_result, equal to ((a * b) % (2 ** 64))
  • mult_carry; equal to (((a * b) - ((a * b) % (2 ** 64))) / (2 ** 64))

This would be equivalent to multiplying (in base-10) (3 * 7), which results in 21. In this case, mult_result would be equal to 1, and mult_carry equal to 2. Here, mult_result represents (1 * (10 ** 0)) (1), while mult_carry represents (2 * (10 ** 1)) (20). Added together, they will yield the expected result: 21.

Another example would be (((2 ** 64) - 1) * ((2 ** 64) - 1)). We will wind up with the following values:

  • mult_result: 1
  • mult_carry: 18446744073709551614

Here, mult_result represents (1 * ((2 ** 64) ** 0)), equal to 1. The mult_carry represents (18446744073709551614 * ((2 ** 64) ** 1)), equal to 340282366920938463426481119284349108224 (too large to represent with a uint64_t). If they are expanded and summed, the expected result is yielded: 340282366920938463426481119284349108225.

This means that mult_carry must always be handled with the understanding that it is a representation of the next "digit" up in base-64, much the same as we do with base-10 (see the example for (3 * 7)). It is, however, safe in that it will avoid overflow.

C++ code:

#include <tuple>

inline constexpr uint64_t UINT64_T_MAX = std::numeric_limits<uint64_t>::max();

std::tuple<uint64_t, uint64_t> overflow_multiply(const uint64_t a, const uint64_t b) {
  const uint64_t a_upper_half = (a >> 32);
  const uint64_t a_lower_half = (a & ((1ULL << 32) - 1));
  const uint64_t b_upper_half = (b >> 32);
  const uint64_t b_lower_half = (b & ((1ULL << 32) - 1));
  const uint64_t p1 = (a_upper_half * b_upper_half);
  const uint64_t p2 = (a_upper_half * b_lower_half);
  const uint64_t p3 = (a_lower_half * b_upper_half);
  const uint64_t p4 = (a_lower_half * b_lower_half);
  const uint64_t p2_r32 = (p2 >> 32);
  const uint64_t p3_r32 = (p3 >> 32);
  const uint64_t p2_l32 = (p2 << 32);
  const uint64_t p3_l32 = (p3 << 32);
  const uint64_t mult_result = (p4 + p2_l32 + p3_l32);
  const uint64_t mult_carry = (
    p1 + p2_r32 + p3_r32 + 
    (mult_result < p4) +
    (p3_l32 > (UINT64_T_MAX - p2_l32))
  );
  return { mult_result, mult_carry };
}
Falsity answered 15/1 at 16:4 Comment(2)
Is this faster than craigster0's code in practice on some real 32-bit architectures with some compilers? (Or 64-bit architectures for portable code that is willing to sacrifice a lot of speed to avoid needing extensions like GNU unsigned __int128 or MSVC __umulh?) It looks pretty similar, but looks to me like this is doing a bit more work even for just the parts that lead to mult_carry, assuming the rest optimizes away if the mult_result part isn't used by the caller.Bagman
There's another Q&A about doing a full multiply to produce a 64 or 128-bit result from 32 or 64-bit inputs, if that was the goal. Getting the high half and low half of a full integer multiply It doesn't have an answer showing a portable fallback.Bagman
L
-3

Here's the asm for ARMv8 or Aarch64 version:

// High (p1) and low (p0) product
uint64_t p0, p1;
// multiplicand and multiplier
uint64_t a = ..., b = ...;

p0 = a*b; asm ("umulh %0,%1,%2" : "=r"(p1) : "r"(a), "r"(b));

And here's the asm for old DEC compilers:

p0 = a*b; p1 = asm("umulh %a0, %a1, %v0", a, b);

If you have x86's BMI2 and would like to use mulxq:

asm ("mulxq %3, %0, %1" : "=r"(p0), "=r"(p1) : "d"(a), "r"(b));

And the generic x86 multiply using mulq:

asm ("mulq %3" : "=a"(p0), "=d"(p1) : "a"(a), "g"(b) : "cc");
Leucocyte answered 18/10, 2019 at 21:12 Comment(8)
This is GNU C inline asm, which means you could have used unsigned __int128 instead, like my answer shows. What's the use-case for this? Do some GCC or clang versions fail to emit just a umulh for (a * (unsigned __int128)b) >> 64? Oh, I just had a look at my answer, and it shows AArch64 GCC emitting umulh.Bagman
@Peter - You did not answer OP's question. He wanted the asm for the operation; not a disassembly of C code.Leucocyte
That's obviously not the case; the accepted answer is pure C++ with no mention of asm or inline asm. I would highly recommend against future readers ever using inline asm for this, especially on a 64-bit target, so showing how to wrap umulh with GNU C inline asm seems totally useless to me. Especially when my answer already shows that the instruction exists.Bagman
Whatever, @Peter. I try to answer the question that was asked. You are free to answer the question you wished was asked.Leucocyte
I don't think I'm engaging in wishful thinking. I see we disagree over interpreting the question (and/or what might be useful to future readers). The question says "but I am not familiar at all with assembly, so I was hoping for help." They're avoiding an XY problem by asking how to get the high half in C++, with inline asm as an option they thing might be useful, not a requirement. It's not even tagged [inline-assembly].Bagman
Anyway, I'm also free to downvote answers I think are poor advice for future readers, and have done so. Is there a compiler that accepts this but not __int128? (Other than ancient gcc). If so, say so in your answer and I'll upvote.Bagman
"g" includes immediate, which x86 mul doesn't support. godbolt.org/z/r0NeIi. Probably best to use "r" to stop clang from shooting itself in the foot and storing first if you give it the option of memory.Bagman
The OP clearly isn't asking specifically for assembly. They only say they know such assembly exists and ask for the best way to get the compiler to emit those instructions. Inline assembly hasn't been advisable where it can be avoided for well over a decade.Wounded

© 2022 - 2024 — McMap. All rights reserved.