_umul128 on Windows 32 bits

C

2

5

In Visual C++, _umul128 is undefined when targeting 32-bit Windows.

How can two unsigned 64-bit integers be multiplied when targeting Win32? The solution only needs to work on Visual C++ 2017 targeting 32-bit Windows.

Chiquia answered 22/10, 2017 at 4:2 Comment(4)

msdn.microsoft.com/en-us/library/windows/desktop/… – Jagir 22/10, 2017 at 4:15

Do you need to do a lot of this? It could possibly be worth using SSE2 or AVX2, depending on what you want to do. Although probably not, because there's no SIMD add-with-carry. But still, see #17863911 for getting a 64-bit result. It would take significantly more instructions to get a 128b result, though. #6738783 – Wabash 22/10, 2017 at 4:17

SIMD signed with unsigned multiplication for 64-bit * 64-bit to 128-bit, possible speedup for 32-bit code vs. using adc + mul (or BMI2 mulx) as a building blocks for large arrays of numbers if you have AVX2 or AVX512, but for one at a time scalar is probably better. – Wabash 22/10, 2017 at 4:24

Also see SSE multiplication of 2 64-bit integers and Is it possible to use SSE and SSE2 to make a 128-bit wide integer? – Treed 9/1, 2019 at 3:41

C

2

I found the following code (from xmrrig), which seems to do the job just fine:

static inline uint64_t __umul128(uint64_t multiplier, uint64_t multiplicand, 
    uint64_t *product_hi) 
{
    // multiplier   = ab = a * 2^32 + b
    // multiplicand = cd = c * 2^32 + d
    // ab * cd = a * c * 2^64 + (a * d + b * c) * 2^32 + b * d
    uint64_t a = multiplier >> 32;
    uint64_t b = multiplier & 0xFFFFFFFF;
    uint64_t c = multiplicand >> 32;
    uint64_t d = multiplicand & 0xFFFFFFFF;

    //uint64_t ac = a * c;
    uint64_t ad = a * d;
    //uint64_t bc = b * c;
    uint64_t bd = b * d;

    uint64_t adbc = ad + (b * c);
    uint64_t adbc_carry = adbc < ad ? 1 : 0;

    // multiplier * multiplicand = product_hi * 2^64 + product_lo
    uint64_t product_lo = bd + (adbc << 32);
    uint64_t product_lo_carry = product_lo < bd ? 1 : 0;
    *product_hi = (a * c) + (adbc >> 32) + (adbc_carry << 32) + product_lo_carry;

    return product_lo;
}

Chiquia answered 25/10, 2017 at 2:43 Comment(5)

Unfortunately this compiles really poorly on MSVC. It uses a function-call to __allmul for uint64*uint64 => 64-bit multiplication even when the upper half of both inputs is a compile-time 0. i.e. when it could have used one mul instruction to do a 32x32 => 64b full-multiply. godbolt.org/g/53Qxyo. If there's an intrinsic for 32x32 => 64b multiply, it would make it a lot more efficient. clang compiles it fairly nicely; there's still a lot of work to do, but it uses the adc instruction for all the add-with-carry. MSVC uses some adc, but branches for others :/ – Wabash 25/10, 2017 at 4:1

Also, make sure you don't use this in 64-bit code where the intrinsic is available. (I guess using the same name is a good way to ensure compile-time error if the compiler provides it). It compiles in 64-bit mode, but to crap instead of a single mul instruction. – Wabash 25/10, 2017 at 4:2

If you need this to actually be fast in 32-bit code, maybe consider using gmplib.org. They have a 32-bit asm implementation for multiply, but I don't see a special case for 2 limbs * 2 limbs = 4 limbs. IDK how fast the general-case function is when you pass it those sizes. gmplib.org/repo/gmp/file/tip/mpn/x86/mul_basecase.asm – Wabash 25/10, 2017 at 4:4

@Peter Cordes You can fix the MSVC __allmul problem by changing the type of a-d to uint32_t and casting to uint64_t at the point of the multiply. MSVC uses adc for uint64_t arithmetic. I think the only way to make to generate adc for the explicit carries is with the _addcarry_u## intrinsics, as seen here. _addcarry_u64 is not available in 32-bit code so you may have to roll your own 64-bit adds with two uint32_ts. – Vogul 11/10, 2019 at 18:12

in MSVC you should use __emulu(a,b) instead a * b, for better compiler code generation – Duplex 2/1, 2022 at 19:48

W

3

This answer has a version of the xmrrig function from the other answer optimized for MSVC 32-bit mode. The original is fine with other compilers, especially clang.

I looked at MSVC's output for @Augusto's function, and it's really bad. Using __emulu for 32x32 => 64b multiplication improved it significantly (because MSVC is dumb and doesn't optimize uint64_t * uint64_t = uint64_t for the case where the inputs are known to really only be 32-bit, with the upper half zero). Other compilers (gcc and clang) generate a single mul instruction instead of calling a helper function. There are other problems with MSVC's code-gen for this that I don't know how to fix by tweaking the source, though. I think if you want good performance on that compiler, you'll have to use inline asm (or a separately-compiled asm function).

If you need more flexible arbitrary-precision (larger numbers), see GMPlib's low-level functions which have asm implementations, instead of trying to build a 256b multiply out of this __umul128. But if you need this exactly, then it's worth trying. Sticking with C++ enables constant-propagation and CSE optimizations that you wouldn't get with asm.

clang compiles this with no major problems, actually using adc for all the add-with-carry (except one which it saves with a setc instruction). MSVC branches on the carry checks, and just makes nasty code. GCC doesn't do a very good job either, with some branching on carry. (Because gcc doesn't know how to turn carry = sum < a into an adc, gcc bug 79173.)

IDK if MSVC or gcc support any add-with-carry intrinsics for 64-bit integers in 32-bit mode. _addcarry_u64 generates poor code with gcc anyway (in 64-bit mode), but ICC might do ok. IDK about MSVC.

If you want an asm implementation, I'd suggest using clang 5.0's output from this function. You could probably find some optimizations by hand, but it's certainly better than MSVCs. But of course most of the arguments in https://gcc.gnu.org/wiki/DontUseInlineAsm apply: blocking constant-propagation is a major problem if you ever multiply by something that inlining turns into a constant, or by a number with the upper half known to be zero.

Source + asm output for MSVC 32-bit and clang5.0 32-bit on Godbolt

nice code with clang. Kinda bad code with MSVC, but better than before. Kinda bad with gcc also (no change vs. other answer).

#include <stdint.h>

#ifdef _MSC_VER
#  include <intrin.h>
#else
// MSVC doesn't optimize 32x32 => 64b multiplication without its intrinsic
// But good compilers can just use this to get a single mul instruction
static inline
uint64_t __emulu(uint32_t x, uint32_t y) {
     return x * (uint64_t)y;
}
#endif

// This is still pretty ugly with MSVC, branching on the carry
//  and using XMM store / integer reload to zero a register!
// But at least it inlines 4 mul instructions
//  instead of calling a generic 64x64 => 64b multiply helper function
uint64_t __umul128(uint64_t multiplier, uint64_t multiplicand, 
    uint64_t *product_hi) 
{
    // multiplier   = ab = a * 2^32 + b
    // multiplicand = cd = c * 2^32 + d
    // ab * cd = a * c * 2^64 + (a * d + b * c) * 2^32 + b * d
    uint64_t a = multiplier >> 32;
    uint64_t b = (uint32_t)multiplier; // & 0xFFFFFFFF;
    uint64_t c = multiplicand >> 32;
    uint64_t d = (uint32_t)multiplicand; // & 0xFFFFFFFF;

    //uint64_t ac = __emulu(a, c);
    uint64_t ad = __emulu(a, d);
    //uint64_t bc = __emulu(b, c);
    uint64_t bd = __emulu(b, d);

    uint64_t adbc = ad + __emulu(b , c);
    uint64_t adbc_carry = (adbc < ad); // ? 1 : 0;
    // MSVC gets confused by the ternary and makes worse code than using a boolean in an integer context for 1 : 0

    // multiplier * multiplicand = product_hi * 2^64 + product_lo
    uint64_t product_lo = bd + (adbc << 32);
    uint64_t product_lo_carry = (product_lo < bd); // ? 1 : 0;
    *product_hi = __emulu(a , c) + (adbc >> 32) + (adbc_carry << 32) + product_lo_carry;

    return product_lo;
}

Make sure you only use this in 32-bit code. In 64-bit code, it fails to optimize to a single 64-bit mul instruction (which produces both 64-bit halves of the full result). Compilers that implement the GNU C++ extensions (clang, gcc, ICC) can use unsigned __int128 and get good code. e.g. a * (unsigned __int128)b produces a 128b result. (Example on Godbolt).

Wabash answered 25/10, 2017 at 4:59 Comment(3)

For adbc_carry = (adbc < ad) you might try adbc_carry = !!(adbc < ad). The MSVC compiler recognizes the pattern and does the boolean to int conversion. I read about it on a Microsoft blog. – Treed 21/8, 2018 at 18:28

Useful answer, upvoted! There exists also _addcarry_u64() intrinsic which adds with carry and returns carry bit, don't know how clever is MSVC in guessing to use it automatically, but using it explicitly may improve speed a bit. Also don't know if this intrinsic is really present on all 32-bit machines, but at least linked URL doesn't say any special requirement to CPU. – Mere 19/5, 2021 at 3:32

@Arty: MSVC, clang, and GCC unfortunately don't define _addcary_64 for 32-bit targets. (I checked on gcc.godbolt.org/z/9jMGbYKnr). No obvious reason they couldn't; adc/adc is easy to emit. But in general 64-bit scalar intrinsics are only defined for x86-64 target machines, which makes sense for _lzcnt_u64 and so on where there's no super obvious emulation. I could have tried to use _addcarry_u32, but since I want two uint64_t halves that would be harder to use. And compilers other than ICC tend to do a bad job with _addcarry anyway, not keeping carry in CF. – Wabash 19/5, 2021 at 4:4

C

2

I found the following code (from xmrrig), which seems to do the job just fine:

static inline uint64_t __umul128(uint64_t multiplier, uint64_t multiplicand, 
    uint64_t *product_hi) 
{
    // multiplier   = ab = a * 2^32 + b
    // multiplicand = cd = c * 2^32 + d
    // ab * cd = a * c * 2^64 + (a * d + b * c) * 2^32 + b * d
    uint64_t a = multiplier >> 32;
    uint64_t b = multiplier & 0xFFFFFFFF;
    uint64_t c = multiplicand >> 32;
    uint64_t d = multiplicand & 0xFFFFFFFF;

    //uint64_t ac = a * c;
    uint64_t ad = a * d;
    //uint64_t bc = b * c;
    uint64_t bd = b * d;

    uint64_t adbc = ad + (b * c);
    uint64_t adbc_carry = adbc < ad ? 1 : 0;

    // multiplier * multiplicand = product_hi * 2^64 + product_lo
    uint64_t product_lo = bd + (adbc << 32);
    uint64_t product_lo_carry = product_lo < bd ? 1 : 0;
    *product_hi = (a * c) + (adbc >> 32) + (adbc_carry << 32) + product_lo_carry;

    return product_lo;
}

Chiquia answered 25/10, 2017 at 2:43 Comment(5)

Unfortunately this compiles really poorly on MSVC. It uses a function-call to __allmul for uint64*uint64 => 64-bit multiplication even when the upper half of both inputs is a compile-time 0. i.e. when it could have used one mul instruction to do a 32x32 => 64b full-multiply. godbolt.org/g/53Qxyo. If there's an intrinsic for 32x32 => 64b multiply, it would make it a lot more efficient. clang compiles it fairly nicely; there's still a lot of work to do, but it uses the adc instruction for all the add-with-carry. MSVC uses some adc, but branches for others :/ – Wabash 25/10, 2017 at 4:1

Also, make sure you don't use this in 64-bit code where the intrinsic is available. (I guess using the same name is a good way to ensure compile-time error if the compiler provides it). It compiles in 64-bit mode, but to crap instead of a single mul instruction. – Wabash 25/10, 2017 at 4:2

If you need this to actually be fast in 32-bit code, maybe consider using gmplib.org. They have a 32-bit asm implementation for multiply, but I don't see a special case for 2 limbs * 2 limbs = 4 limbs. IDK how fast the general-case function is when you pass it those sizes. gmplib.org/repo/gmp/file/tip/mpn/x86/mul_basecase.asm – Wabash 25/10, 2017 at 4:4

@Peter Cordes You can fix the MSVC __allmul problem by changing the type of a-d to uint32_t and casting to uint64_t at the point of the multiply. MSVC uses adc for uint64_t arithmetic. I think the only way to make to generate adc for the explicit carries is with the _addcarry_u## intrinsics, as seen here. _addcarry_u64 is not available in 32-bit code so you may have to roll your own 64-bit adds with two uint32_ts. – Vogul 11/10, 2019 at 18:12

in MSVC you should use __emulu(a,b) instead a * b, for better compiler code generation – Duplex 2/1, 2022 at 19:48

Recommended topics

Hot tags