How can I multiply 64 bit operands and get 128 bit result portably?

Asked 2/8, 2014 at 13:53 Answered 14/10, 2019 at 16:54

For x64 I can use this:

 {
   uint64_t hi, lo;
  // hi,lo = 64bit x 64bit multiply of c[0] and b[0]

   __asm__("mulq %3\n\t"
    : "=d" (hi),
  "=a" (lo)
    : "%a" (c[0]),
  "rm" (b[0])
    : "cc" );

   a[0] += hi;
   a[1] += lo;
 }

But I'd like to perform the same calculation portably. For instance to work on x86.

Ruffled answered 2/8, 2014 at 13:53 Comment(10)

What are the types of c[0] and b[0] ? Why not just multiply two uint64_t types? – Refugiorefulgence 2/8, 2014 at 13:57

What is the problem? and what is the question? – Remediable 2/8, 2014 at 14:0

mulq is the 64 bit instruction that is the problem and c&b is uint64_t – Ruffled 2/8, 2014 at 14:3

write it in C++ for correctness. MEASURE if it's too slow. if it is, check the generated 32-bit assembly. – Pronuba 2/8, 2014 at 14:6

there is no any generated 32 bit assembly and its c not c++ why you navigate me to other ways than writing a basic solution – Ruffled 2/8, 2014 at 14:21

If it is C rather than C++, why did you tag the question C++? Very hard to understand why you would want to use asm to perform trivial multiplication. I also cannot understand your question. I don't know what you are asking. – Bernadine 2/8, 2014 at 14:28

deleted c++ tag and just want to make it without using inline asm thats all and i want be able to compile it for 32 bit machines mulq is 64 bit instruction – Ruffled 2/8, 2014 at 14:33

OK, you want a pure C 64 bit multiplication that produces a 128 bit result? Does this do what you need? codeproject.com/Tips/618570/UInt-Multiplication-Squaring – Bernadine 2/8, 2014 at 14:37

Or maybe this is what you're looking for: #8776573 – Nicks 2/8, 2014 at 14:41

@DavidHeffernan Kudos on the cleanup! – Proa 2/9, 2014 at 2:39

As I understand the question, you want a portable pure C implementation of 64 bit multiplication, with output to a 128 bit value, stored in two 64 bit values. In which case this article purports to have what you need. That code is written for C++. It doesn't take much to turn it into C code:

void mult64to128(uint64_t op1, uint64_t op2, uint64_t *hi, uint64_t *lo)
{
    uint64_t u1 = (op1 & 0xffffffff);
    uint64_t v1 = (op2 & 0xffffffff);
    uint64_t t = (u1 * v1);
    uint64_t w3 = (t & 0xffffffff);
    uint64_t k = (t >> 32);

    op1 >>= 32;
    t = (op1 * v1) + k;
    k = (t & 0xffffffff);
    uint64_t w1 = (t >> 32);

    op2 >>= 32;
    t = (u1 * op2) + k;
    k = (t >> 32);

    *hi = (op1 * op2) + w1 + k;
    *lo = (t << 32) + w3;
}

Bernadine answered 2/8, 2014 at 14:44 Comment(5)

This uses 4 multiplications. You could reduce it to 3 by using Karatsuba's trick. Not sure how much it would increase speed. – Deceitful 24/1, 2023 at 17:58

@ThomasAhle: It wouldn't; modern CPUs typically have pipelined multipliers, so you can get 1 result per 1 or 2 clocks. Having to do a lot more adds is worse, especially when you have to work around C's lack of an add-with-carry primitive operation so 128-bit additions cost even more than they would in asm. – Necromancy 29/1, 2023 at 14:58

You mean it'd be slow because we'd have to worry about overflow when adding two 64 bit numbers? Though sometimes you know the numbers are less than 2^60, say, so you can split in 30 bit chucks and have room to spare for carries. – Deceitful 12/2, 2023 at 23:14

@ThomasAhle: didn't see your reply since you didn't @ notify me. It'd be slower because you have significantly more total operations. Multiply isn't that slow compared to addition, and often a more important consideration is total number of micro-uops. (Usually one per instruction). Especially with out-of-order exec being able to overlap execution with surrounding code that hopefully doesn't bottleneck on multiply throughput. If actually compiling this C for a 64-bit CPU, the asm for just this will have a lot of non-multiply instructions to split up 64-bit integer regs to 32-bit halves... – Necromancy 29/7, 2023 at 9:59

Working with 30-bit or 60-bit chunks can be a good technique if you're limited to pure C. Although probably not needed here since none of these 64-bit additions need to generate carry-out or take carry-in. – Necromancy 29/7, 2023 at 10:1

Since you have gcc as a tag, note that you can just use gcc's 128-bit integer type:

typedef unsigned __int128 uint128_t;
// ...
uint64_t x, y;
// ...
uint128_t result = (uint128_t)x * y;
uint64_t lo = result;
uint64_t hi = result >> 64;

Williwaw answered 5/3, 2015 at 4:49 Comment(2)

Getting the high part of 64 bit integer multiplication shows an ifdeffed version of that with MSVC's _umulh intrinsic. MSVC also has a _umul128. – Necromancy 19/10, 2019 at 0:17

Non-standard extensions are not "portable", and "portable" is even in the title of the question. – Annunciate 20/3 at 12:11

The accepted solution isn't really the best solution, in my opinion.

It is confusing to read.
It has some funky carry handling.
It doesn't take advantage of the fact that 64-bit arithmetic may be available.
It displeases ARMv6, the God of Absolutely Ridiculous Multiplies. Whoever uses UMAAL shall not lag but have eternal 64-bit to 128-bit multiplies in 4 instructions.

Joking aside, it is much better to optimize for ARMv6 than any other platform because it will have the most benefit. x86 needs a complicated routine and it would be a dead end optimization.

The best way I have found (and used in xxHash3) is this, which takes advantage of multiple implementations using macros:

It is a tiny bit slower than mult64to128 on x86 (by 1-2 instructions), but a lot faster on ARMv6.

#include <stdint.h>
#ifdef _MSC_VER
#  include <intrin.h>
#endif

/* Prevents a partial vectorization from GCC. */
#if defined(__GNUC__) && !defined(__clang__) && defined(__i386__)
  __attribute__((__target__("no-sse")))
#endif
static uint64_t multiply64to128(uint64_t lhs, uint64_t rhs, uint64_t *high)
{
    /*
     * GCC and Clang usually provide __uint128_t on 64-bit targets,
     * although Clang also defines it on WASM despite having to use
     * builtins for most purposes - including multiplication.
     */
#if defined(__SIZEOF_INT128__) && !defined(__wasm__)
    __uint128_t product = (__uint128_t)lhs * (__uint128_t)rhs;
    *high = (uint64_t)(product >> 64);
    return (uint64_t)(product & 0xFFFFFFFFFFFFFFFF);

    /* Use the _umul128 intrinsic on MSVC x64 to hint for mulq. */
#elif defined(_MSC_VER) && defined(_M_IX64)
#   pragma intrinsic(_umul128)
    /* This intentionally has the same signature. */
    return _umul128(lhs, rhs, high);

#else
    /*
     * Fast yet simple grade school multiply that avoids
     * 64-bit carries with the properties of multiplying by 11
     * and takes advantage of UMAAL on ARMv6 to only need 4
     * calculations.
     */

    /* First calculate all of the cross products. */
    uint64_t lo_lo = (lhs & 0xFFFFFFFF) * (rhs & 0xFFFFFFFF);
    uint64_t hi_lo = (lhs >> 32)        * (rhs & 0xFFFFFFFF);
    uint64_t lo_hi = (lhs & 0xFFFFFFFF) * (rhs >> 32);
    uint64_t hi_hi = (lhs >> 32)        * (rhs >> 32);

    /* Now add the products together. These will never overflow. */
    uint64_t cross = (lo_lo >> 32) + (hi_lo & 0xFFFFFFFF) + lo_hi;
    uint64_t upper = (hi_lo >> 32) + (cross >> 32)        + hi_hi;

    *high = upper;
    return (cross << 32) | (lo_lo & 0xFFFFFFFF);
#endif /* portable */
}

On ARMv6, you can't get much better than this, at least on Clang:

multiply64to128:
        push    {r4, r5, r11, lr}
        umull   r12, r5, r2, r0
        umull   r2, r4, r2, r1
        umaal   r2, r5, r3, r0
        umaal   r4, r5, r3, r1
        ldr     r0, [sp, #16]
        mov     r1, r2
        strd    r4, r5, [r0]
        mov     r0, r12
        pop     {r4, r5, r11, pc}

The accepted solution generates a bunch of adds and adc, as well as an extra umull in Clang due to an instcombine bug.

I further explain the portable method in the link I posted.

Sclerotic answered 14/10, 2019 at 16:54 Comment(3)

__attribute__((__target__("no-sse"))) will probably stop it from inlining into functions with different target options, which might defeat constant propagation as well as adding call/ret overhead (especially in the nasty stack-args calling convention most 32-bit code uses). But that's only for 32-bit x86 so it's probably not going to hurt many use-cases. – Necromancy 14/10, 2019 at 17:49

That is true. However, from testing on a Sandy Bridge, the shuffles completely bottleneck the algorithm. – Sclerotic 14/10, 2019 at 18:59

Have you reported the missed-optimization bug to gcc's bugzilla? I was just pointing out that the workaround isn't perfect, but IDK if there's a way to do anything cheaper without using -fno-tree-vectorize for the whole file. Your attribute is might well be the best choice if -O3 -march=native shoots itself in the foot that badly. – Necromancy 15/10, 2019 at 1:42

Recommended topics

Hot tags