TL:DR with GCC for a 64-bit ISA: (a * (unsigned __int128)b) >> 64
compiles nicely, to a single full-multiply or high-half multiply instruction. No need to mess around with inline asm.
Unfortunately current compilers don't optimize @craigster0's nice portable version, so if you want to take advantage of 64-bit CPUs, you can't use it except as a fallback for targets you don't have an #ifdef
for. (I don't see a generic way to optimize it; you need a 128-bit type or an intrinsic.)
GNU C (gcc, clang, or ICC) has unsigned __int128
on most 64-bit platforms. (Or in older versions, __uint128_t
). GCC doesn't implement this type on 32-bit platforms, though.
This is an easy and efficient way to get the compiler to emit a 64-bit full-multiply instruction and keep the high half. (GCC knows that a uint64_t cast to a 128-bit integer still has the upper half all zero, so you don't get a 128-bit multiply using three 64-bit multiplies.)
MSVC also has a __umulh
intrinsic for 64-bit high-half multiplication, but again it's only available on 64-bit platforms (and specifically x86-64 and AArch64. The docs also mention IPF (IA-64) having _umul128
available, but I don't have MSVC for Itanium available. (Probably not relevant anyway.)
#define HAVE_FAST_mul64 1
#ifdef __SIZEOF_INT128__ // GNU C
static inline
uint64_t mulhi64(uint64_t a, uint64_t b) {
unsigned __int128 prod = a * (unsigned __int128)b;
return prod >> 64;
}
#elif defined(_M_X64) || defined(_M_ARM64) // MSVC
// MSVC for x86-64 or AArch64
// possibly also || defined(_M_IA64) || defined(_WIN64)
// but the docs only guarantee x86-64! Don't use *just* _WIN64; it doesn't include AArch64 Android / Linux
// https://learn.microsoft.com/en-gb/cpp/intrinsics/umulh
#include <intrin.h>
#define mulhi64 __umulh
#elif defined(_M_IA64) // || defined(_M_ARM) // MSVC again
// https://learn.microsoft.com/en-gb/cpp/intrinsics/umul128
// incorrectly say that _umul128 is available for ARM
// which would be weird because there's no single insn on AArch32
#include <intrin.h>
static inline
uint64_t mulhi64(uint64_t a, uint64_t b) {
unsigned __int64 HighProduct;
(void)_umul128(a, b, &HighProduct);
return HighProduct;
}
#else
# undef HAVE_FAST_mul64
uint64_t mulhi64(uint64_t a, uint64_t b); // non-inline prototype
// or you might want to define @craigster0's version here so it can inline.
#endif
For x86-64, AArch64, and PowerPC64 (and others), this compiles to one mul
instruction, and a couple mov
s to deal with the calling convention (which should optimize away after this inlines).
From the Godbolt compiler explorer (with source + asm for x86-64, PowerPC64, and AArch64):
# x86-64 gcc7.3. clang and ICC are the same. (x86-64 System V calling convention)
# MSVC makes basically the same function, but with different regs for x64 __fastcall
mov rax, rsi
mul rdi # RDX:RAX = RAX * RDI
mov rax, rdx
ret
(or with clang -march=haswell
to enable BMI2: mov rdx, rsi
/ mulx rax, rcx, rdi
to put the high-half in RAX directly. gcc is dumb and still uses an extra mov
.)
For AArch64 (with gcc unsigned __int128
or MSVC with __umulh
):
test_var:
umulh x0, x0, x1
ret
With a compile-time constant power of 2 multiplier, we usually get the expected right-shift to grab a few high bits. But gcc amusingly uses shld
(see the Godbolt link).
Unfortunately current compilers don't optimize @craigster0's nice portable version. You get 8x shr r64,32
, 4x imul r64,r64
, and a bunch of add
/mov
instructions for x86-64. i.e. it compiles to a lot of 32x32 => 64-bit multiplies and unpacks of the results. So if you want something that takes advantage of 64-bit CPUs, you need some #ifdef
s.
A full-multiply mul 64
instruction is 2 uops on Intel CPUs, but still only 3 cycle latency, same as imul r64,r64
which only produces a 64-bit result. So the __int128
/ intrinsic version is 5 to 10 times cheaper in latency and throughput (impact on surrounding code) on modern x86-64 than the portable version, from a quick eyeball guess based on http://agner.org/optimize/.
Check it out on the Godbolt compiler explorer on the above link.
gcc does fully optimize this function when multiplying by 16, though: you get a single right shift, more efficient than with unsigned __int128
multiply.
uint128_t
for this purpose. Visual Studio has no such option though. – Eelgrassuint128_t
that is likely to be the most efficient way to do what i need? – Gamali__int128
in gcc as well as llvm including Apple Clang. #13188129 – Sperrylite