In Visual C++, _umul128
is undefined when targeting 32-bit Windows.
How can two unsigned 64-bit integers be multiplied when targeting Win32? The solution only needs to work on Visual C++ 2017 targeting 32-bit Windows.
In Visual C++, _umul128
is undefined when targeting 32-bit Windows.
How can two unsigned 64-bit integers be multiplied when targeting Win32? The solution only needs to work on Visual C++ 2017 targeting 32-bit Windows.
I found the following code (from xmrrig), which seems to do the job just fine:
static inline uint64_t __umul128(uint64_t multiplier, uint64_t multiplicand,
uint64_t *product_hi)
{
// multiplier = ab = a * 2^32 + b
// multiplicand = cd = c * 2^32 + d
// ab * cd = a * c * 2^64 + (a * d + b * c) * 2^32 + b * d
uint64_t a = multiplier >> 32;
uint64_t b = multiplier & 0xFFFFFFFF;
uint64_t c = multiplicand >> 32;
uint64_t d = multiplicand & 0xFFFFFFFF;
//uint64_t ac = a * c;
uint64_t ad = a * d;
//uint64_t bc = b * c;
uint64_t bd = b * d;
uint64_t adbc = ad + (b * c);
uint64_t adbc_carry = adbc < ad ? 1 : 0;
// multiplier * multiplicand = product_hi * 2^64 + product_lo
uint64_t product_lo = bd + (adbc << 32);
uint64_t product_lo_carry = product_lo < bd ? 1 : 0;
*product_hi = (a * c) + (adbc >> 32) + (adbc_carry << 32) + product_lo_carry;
return product_lo;
}
__allmul
for uint64*uint64 => 64-bit multiplication even when the upper half of both inputs is a compile-time 0. i.e. when it could have used one mul
instruction to do a 32x32 => 64b full-multiply. godbolt.org/g/53Qxyo. If there's an intrinsic for 32x32 => 64b multiply, it would make it a lot more efficient. clang compiles it fairly nicely; there's still a lot of work to do, but it uses the adc
instruction for all the add-with-carry. MSVC uses some adc
, but branches for others :/ –
Wabash mul
instruction. –
Wabash __allmul
problem by changing the type of a
-d
to uint32_t
and casting to uint64_t
at the point of the multiply. MSVC uses adc
for uint64_t
arithmetic. I think the only way to make to generate adc
for the explicit carries is with the _addcarry_u##
intrinsics, as seen here. _addcarry_u64
is not available in 32-bit code so you may have to roll your own 64-bit adds with two uint32_t
s. –
Vogul This answer has a version of the xmrrig function from the other answer optimized for MSVC 32-bit mode. The original is fine with other compilers, especially clang.
I looked at MSVC's output for @Augusto's function, and it's really bad. Using __emulu
for 32x32 => 64b multiplication improved it significantly (because MSVC is dumb and doesn't optimize uint64_t * uint64_t = uint64_t
for the case where the inputs are known to really only be 32-bit, with the upper half zero). Other compilers (gcc and clang) generate a single mul
instruction instead of calling a helper function. There are other problems with MSVC's code-gen for this that I don't know how to fix by tweaking the source, though. I think if you want good performance on that compiler, you'll have to use inline asm (or a separately-compiled asm function).
If you need more flexible arbitrary-precision (larger numbers), see GMPlib's low-level functions which have asm implementations, instead of trying to build a 256b multiply out of this __umul128
. But if you need this exactly, then it's worth trying. Sticking with C++ enables constant-propagation and CSE optimizations that you wouldn't get with asm.
clang compiles this with no major problems, actually using adc
for all the add-with-carry (except one which it saves with a setc
instruction). MSVC branches on the carry checks, and just makes nasty code. GCC doesn't do a very good job either, with some branching on carry. (Because gcc doesn't know how to turn carry = sum < a
into an adc
, gcc bug 79173.)
IDK if MSVC or gcc support any add-with-carry intrinsics for 64-bit integers in 32-bit mode. _addcarry_u64
generates poor code with gcc anyway (in 64-bit mode), but ICC might do ok. IDK about MSVC.
If you want an asm implementation, I'd suggest using clang 5.0's output from this function. You could probably find some optimizations by hand, but it's certainly better than MSVCs. But of course most of the arguments in https://gcc.gnu.org/wiki/DontUseInlineAsm apply: blocking constant-propagation is a major problem if you ever multiply by something that inlining turns into a constant, or by a number with the upper half known to be zero.
Source + asm output for MSVC 32-bit and clang5.0 32-bit on Godbolt
nice code with clang. Kinda bad code with MSVC, but better than before. Kinda bad with gcc also (no change vs. other answer).
#include <stdint.h>
#ifdef _MSC_VER
# include <intrin.h>
#else
// MSVC doesn't optimize 32x32 => 64b multiplication without its intrinsic
// But good compilers can just use this to get a single mul instruction
static inline
uint64_t __emulu(uint32_t x, uint32_t y) {
return x * (uint64_t)y;
}
#endif
// This is still pretty ugly with MSVC, branching on the carry
// and using XMM store / integer reload to zero a register!
// But at least it inlines 4 mul instructions
// instead of calling a generic 64x64 => 64b multiply helper function
uint64_t __umul128(uint64_t multiplier, uint64_t multiplicand,
uint64_t *product_hi)
{
// multiplier = ab = a * 2^32 + b
// multiplicand = cd = c * 2^32 + d
// ab * cd = a * c * 2^64 + (a * d + b * c) * 2^32 + b * d
uint64_t a = multiplier >> 32;
uint64_t b = (uint32_t)multiplier; // & 0xFFFFFFFF;
uint64_t c = multiplicand >> 32;
uint64_t d = (uint32_t)multiplicand; // & 0xFFFFFFFF;
//uint64_t ac = __emulu(a, c);
uint64_t ad = __emulu(a, d);
//uint64_t bc = __emulu(b, c);
uint64_t bd = __emulu(b, d);
uint64_t adbc = ad + __emulu(b , c);
uint64_t adbc_carry = (adbc < ad); // ? 1 : 0;
// MSVC gets confused by the ternary and makes worse code than using a boolean in an integer context for 1 : 0
// multiplier * multiplicand = product_hi * 2^64 + product_lo
uint64_t product_lo = bd + (adbc << 32);
uint64_t product_lo_carry = (product_lo < bd); // ? 1 : 0;
*product_hi = __emulu(a , c) + (adbc >> 32) + (adbc_carry << 32) + product_lo_carry;
return product_lo;
}
Make sure you only use this in 32-bit code. In 64-bit code, it fails to optimize to a single 64-bit mul
instruction (which produces both 64-bit halves of the full result). Compilers that implement the GNU C++ extensions (clang, gcc, ICC) can use unsigned __int128
and get good code. e.g. a * (unsigned __int128)b
produces a 128b result. (Example on Godbolt).
adbc_carry = (adbc < ad)
you might try adbc_carry = !!(adbc < ad)
. The MSVC compiler recognizes the pattern and does the boolean to int conversion. I read about it on a Microsoft blog. –
Treed _addcary_64
for 32-bit targets. (I checked on gcc.godbolt.org/z/9jMGbYKnr). No obvious reason they couldn't; adc/adc is easy to emit. But in general 64-bit scalar intrinsics are only defined for x86-64 target machines, which makes sense for _lzcnt_u64
and so on where there's no super obvious emulation. I could have tried to use _addcarry_u32
, but since I want two uint64_t
halves that would be harder to use. And compilers other than ICC tend to do a bad job with _addcarry
anyway, not keeping carry in CF. –
Wabash I found the following code (from xmrrig), which seems to do the job just fine:
static inline uint64_t __umul128(uint64_t multiplier, uint64_t multiplicand,
uint64_t *product_hi)
{
// multiplier = ab = a * 2^32 + b
// multiplicand = cd = c * 2^32 + d
// ab * cd = a * c * 2^64 + (a * d + b * c) * 2^32 + b * d
uint64_t a = multiplier >> 32;
uint64_t b = multiplier & 0xFFFFFFFF;
uint64_t c = multiplicand >> 32;
uint64_t d = multiplicand & 0xFFFFFFFF;
//uint64_t ac = a * c;
uint64_t ad = a * d;
//uint64_t bc = b * c;
uint64_t bd = b * d;
uint64_t adbc = ad + (b * c);
uint64_t adbc_carry = adbc < ad ? 1 : 0;
// multiplier * multiplicand = product_hi * 2^64 + product_lo
uint64_t product_lo = bd + (adbc << 32);
uint64_t product_lo_carry = product_lo < bd ? 1 : 0;
*product_hi = (a * c) + (adbc >> 32) + (adbc_carry << 32) + product_lo_carry;
return product_lo;
}
__allmul
for uint64*uint64 => 64-bit multiplication even when the upper half of both inputs is a compile-time 0. i.e. when it could have used one mul
instruction to do a 32x32 => 64b full-multiply. godbolt.org/g/53Qxyo. If there's an intrinsic for 32x32 => 64b multiply, it would make it a lot more efficient. clang compiles it fairly nicely; there's still a lot of work to do, but it uses the adc
instruction for all the add-with-carry. MSVC uses some adc
, but branches for others :/ –
Wabash mul
instruction. –
Wabash __allmul
problem by changing the type of a
-d
to uint32_t
and casting to uint64_t
at the point of the multiply. MSVC uses adc
for uint64_t
arithmetic. I think the only way to make to generate adc
for the explicit carries is with the _addcarry_u##
intrinsics, as seen here. _addcarry_u64
is not available in 32-bit code so you may have to roll your own 64-bit adds with two uint32_t
s. –
Vogul © 2022 - 2024 — McMap. All rights reserved.
adc
+mul
(or BMI2mulx
) as a building blocks for large arrays of numbers if you have AVX2 or AVX512, but for one at a time scalar is probably better. – Wabash