Unsigned 64x64->128 bit integer multiply on 32-bit platforms
Asked Answered
M

1

7

In the context of exploratory activity I have started to take a look at integer & fixed-point arithmetic building blocks for 32-bit platforms. My primary target would be ARM32 (specifically armv7), with a side glance to RISC-V32 which I expect to grow in importance in the embedded space. The first sample building block I chose to examine is unsigned 64x64->128 bit integer multiplication. Other questions on this site about this building block do not provide detailed coverage of 32-bit platforms.

Over the past thirty years, I have implemented this and other arithmetic building blocks multiple times, but always in assembly language, for various architectures. However, at this point in time my hope and desire is that these could be programmed in straight ISO-C, without the use of intrinsics. Ideally a single version of the C code would generate good machine code across architectures. I know that the approach of manipulating HLL code to control machine code is generally brittle, but hope that processor architectures and toolchains have matured enough to make this feasible.

Some approaches used in assembly language implementations are not well suited for porting to C. In the exemplary code below I have selected six variants that seemed amenable to an HLL implementation. Besides the generation of partial products, which is common to all variants, the two basic approaches are: (1) Sum the partial products using 64-bit arithmetic, letting the compiler take care of the carry propagation between 32-bit halves. In this case there are multiple choices in which order to sum the partial products. (2) Use 32-bit arithmetic for the summing, simulating the carry flag directly. In this case we have a choice of generating the carry after an addition (a = a + b; carry = a < b;) or before the addition (carry = ~a < b; a = a + b;). Variants 1 through 3 below fall into the former category, variants 5 and 6 fall into the latter.

At Compiler Explorer, I focused on the toolchains gcc 12.2 and clang 15.0 for the platforms of interest. I compiled with -O3. The general finding is that on average clang generates more efficient code than gcc, and that the differences between the variants (number of instructions and registers used) are more pronounced with clang. While this may be understandable in the case of RISC-V as the newer architecture, it surprised me in the case of armv7 which has been around for well over a dozen years.

Three cases in particular struck me as noteworthy. While I have worked with compiler engineers before and have a reasonable understanding of basic code transformation, phase ordering issues, etc, the only technique I aware of that might apply to this code is idiom recognition, and I do not see how this could explain the observations by itself. The first case is variant 3, where clang 15.0 produces extremely tight code comprising just 10 instructions that I don't think can be improved upon:

umul64wide:
        push    {r4, lr}
        umull   r12, r4, r2, r0
        umull   lr, r0, r3, r0
        umaal   lr, r4, r2, r1
        umaal   r0, r4, r3, r1
        ldr     r1, [sp, #8]
        strd    r0, r4, [r1]
        mov     r0, r12
        mov     r1, lr
        pop     {r4, pc}

By contrast, gcc generates twice the number of instructions and requires twice the number of registers. I hypothesize that it does not recognize how to use the multiply-accumulate instruction umaal here, but is that the full story? The reverse situation, but not quite as dramatic, occurs in variant 6, where gcc 12.2 produces this sequence of 18 instructions, with low register usage:

umul64wide:
        mov     ip, r0
        push    {r4, r5, lr}
        mov     lr, r1
        umull   r0, r1, r0, r2
        ldr     r4, [sp, #12]
        umull   r5, ip, r3, ip
        adds    r1, r1, r5
        umull   r2, r5, lr, r2
        adc     ip, ip, #0
        umull   lr, r3, lr, r3
        adds    r1, r1, r2
        adc     r2, ip, #0
        adds    r2, r2, r5
        adc     r3, r3, #0
        adds    r2, r2, lr
        adc     r3, r3, #0
        strd    r2, r3, [r4]
        pop     {r4, r5, pc}

The generated code nicely turns the simulated carry propagation into real carry propagation. clang 15.0 uses nine instructions and five registers more, and I cannot really make out what it is trying to do without spending much more time on analysis. The third observation is with regard to the differences seen in the machine code produced for variant 5 vs. variant 6, in particular with clang. These use the same basic algorithm, with one variant computing the simulated carry before the additions, the other after it. I did find in the end that one variant, namely variant 4, seems to be efficient across both tool chains and both architectures. However, before I proceed to other building blocks and face a similar struggle, I would like to inquire:

(1) Are there coding idioms or algorithms I have not considered in the code below that might lead to superior results? (2) Are there specific optimization switches, e.g. a hypothetical -ffrobnicate (see here), that are not included in -O3 that would help the compilers generate efficient code more consistently for these kind of bit-manipulation scenarios? Explanations as to what compiler mechanisms are likely responsible for the cases of significant differences in code generation observed, and how one might influence or work round them, could also be helpful.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

#define VARIANT         (3)
#define USE_X64_ASM_REF (0)

/* Multiply two unsigned 64-bit operands a and b. Returns least significant 64 
   bits of product as return value, most significant 64 bits of product via h.
*/
uint64_t umul64wide (uint64_t a, uint64_t b, uint64_t *h)
{
    uint32_t a_lo = (uint32_t)a;
    uint32_t a_hi = a >> 32;
    uint32_t b_lo = (uint32_t)b;
    uint32_t b_hi = b >> 32;
    uint64_t p0 = (uint64_t)a_lo * b_lo;
    uint64_t p1 = (uint64_t)a_lo * b_hi;
    uint64_t p2 = (uint64_t)a_hi * b_lo;
    uint64_t p3 = (uint64_t)a_hi * b_hi;
#if VARIANT == 1
    uint32_t c = (uint32_t)(((p0 >> 32) + (uint32_t)p1 + (uint32_t)p2) >> 32);
    *h = p3 + (p1 >> 32) + (p2 >> 32) + c;
    return p0 + ((p1 + p2) << 32);
#elif VARIANT == 2
    uint64_t s = (p0 >> 32) + p1;
    uint64_t t = (uint32_t)s + p2;
    *h = (s >> 32) + (t >> 32) + p3;
    return (uint32_t)p0 + (t << 32);
#elif VARIANT == 3
    *h = (p1 >> 32) + (((p0 >> 32) + (uint32_t)p1 + p2) >> 32) + p3;
    return p0 + ((p1 + p2) << 32);
#elif VARIANT == 4
    uint64_t t = (p0 >> 32) + p1 + (uint32_t)p2;
    *h = (p2 >> 32) + (t >> 32) + p3;
    return (uint32_t)p0 + (t << 32);
#elif VARIANT == 5
    uint32_t r0, r1, r2, r3, r4, r5, r6;
    r0 = (uint32_t)p0;
    r1 = p0 >> 32;    
    r5 = (uint32_t)p1;
    r2 = p1 >> 32;
    r1 = r1 + r5;
    r6 = r1 < r5;
    r2 = r2 + r6;
    r6 = (uint32_t)p2;
    r5 = p2 >> 32;
    r1 = r1 + r6;
    r6 = r1 < r6;
    r2 = r2 + r6;
    r4 = (uint32_t)p3;
    r3 = p3 >> 32;
    r2 = r2 + r5;
    r6 = r2 < r5;
    r3 = r3 + r6;
    r2 = r2 + r4;
    r6 = r2 < r4;
    r3 = r3 + r6;
    *h =   ((uint64_t)r3 << 32) | r2;
    return ((uint64_t)r1 << 32) | r0;
#elif VARIANT == 6
    uint32_t r0, r1, r2, r3, r4, r5, r6;
    r0 = (uint32_t)p0;
    r1 = p0 >> 32;    
    r5 = (uint32_t)p1;
    r2 = p1 >> 32;
    r4 = ~r1;
    r4 = r4 < r5;
    r1 = r1 + r5;
    r2 = r2 + r4;
    r6 = (uint32_t)p2;
    r5 = p2 >> 32;
    r4 = ~r1;
    r4 = r4 < r6;
    r1 = r1 + r6;
    r2 = r2 + r4;
    r4 = (uint32_t)p3;
    r3 = p3 >> 32;
    r6 = ~r2;
    r6 = r6 < r5;
    r2 = r2 + r5;
    r3 = r3 + r6;
    r6 = ~r2;
    r6 = r6 < r4;
    r2 = r2 + r4;
    r3 = r3 + r6;
    *h =   ((uint64_t)r3 << 32) | r2;
    return ((uint64_t)r1 << 32) | r0;
#else
#error unsupported VARIANT
#endif
}

#if defined(__SIZEOF_INT128__)
uint64_t umul64wide_ref (uint64_t a, uint64_t b, uint64_t *h)
{
    unsigned __int128 prod = ((unsigned __int128)a) * b;
    *h = (uint64_t)(prod >> 32);
    return (uint64_t)prod;
}
#elif defined(_MSC_VER) && defined(_WIN64)
#include <intrin.h>
uint64_t umul64wide_ref (uint64_t a, uint64_t b, uint64_t *h)
{
    *h = __umulh (a, b);
    return a * b;
}
#elif USE_X64_ASM_REF
uint64_t umul64wide_ref (uint64_t a, uint64_t b, uint64_t *h)
{
    uint64_t res_l, res_h;
    __asm__ (
        "movq  %2, %%rax;\n\t"          // rax = a
        "mulq  %3;\n\t"                 // rdx:rax = a * b
        "movq  %%rdx, %0;\n\t"          // res_h = rdx
        "movq  %%rax, %1;\n\t"          // res_l = rax
        : "=rm" (res_h), "=rm"(res_l)
        : "rm"(a), "rm"(b)
        : "%rax", "%rdx");
    *h = res_h;
    return res_l;
}
#else // generic (and slow) reference implementation
#define ADDCcc(a,b,cy,t0,t1) \
  (t0=(b)+cy, t1=(a), cy=t0<cy, t0=t0+t1, t1=t0<t1, cy=cy+t1, t0=t0)

#define ADDcc(a,b,cy,t0,t1) \
  (t0=(b), t1=(a), t0=t0+t1, cy=t0<t1, t0=t0)

#define ADDC(a,b,cy,t0,t1) \
  (t0=(b)+cy, t1=(a), t0+t1)

uint64_t umul64wide_ref (uint64_t a, uint64_t b, uint64_t *h)
{
    uint32_t cy, t0, t1;
    uint32_t a_lo = (uint32_t)a;
    uint32_t a_hi = a >> 32;
    uint32_t b_lo = (uint32_t)b;
    uint32_t b_hi = b >> 32;
    uint64_t p0 = (uint64_t)a_lo * b_lo;
    uint64_t p1 = (uint64_t)a_lo * b_hi;
    uint64_t p2 = (uint64_t)a_hi * b_lo;
    uint64_t p3 = (uint64_t)a_hi * b_hi;
    uint32_t p0_lo = (uint32_t)p0;
    uint32_t p0_hi = p0 >> 32;
    uint32_t p1_lo = (uint32_t)p1;
    uint32_t p1_hi = p1 >> 32;
    uint32_t p2_lo = (uint32_t)p2;
    uint32_t p2_hi = p2 >> 32;
    uint32_t p3_lo = (uint32_t)p3;
    uint32_t p3_hi = p3 >> 32;
    uint32_t r0 = p0_lo;
    uint32_t r1 = ADDcc  (p0_hi, p1_lo, cy, t0, t1);
    uint32_t r2 = ADDCcc (p1_hi, p2_hi, cy, t0, t1);
    uint32_t r3 = ADDC   (p3_hi,     0, cy, t0, t1);
    r1 = ADDcc  (r1, p2_lo, cy, t0, t1);
    r2 = ADDCcc (r2, p3_lo, cy, t0, t1);
    r3 = ADDC   (r3,     0, cy, t0, t1);
    *h =   ((uint64_t)r3 << 32) + r2;
    return ((uint64_t)r1 << 32) + r0;
}
#endif 

/*
  https://groups.google.com/forum/#!original/comp.lang.c/qFv18ql_WlU/IK8KGZZFJx4J
  From: geo <[email protected]>
  Newsgroups: sci.math,comp.lang.c,comp.lang.fortran
  Subject: 64-bit KISS RNGs
  Date: Sat, 28 Feb 2009 04:30:48 -0800 (PST)

  This 64-bit KISS RNG has three components, each nearly
  good enough to serve alone.    The components are:
  Multiply-With-Carry (MWC), period (2^121+2^63-1)
  Xorshift (XSH), period 2^64-1
  Congruential (CNG), period 2^64
*/
static uint64_t kiss64_x = 1234567890987654321ULL;
static uint64_t kiss64_c = 123456123456123456ULL;
static uint64_t kiss64_y = 362436362436362436ULL;
static uint64_t kiss64_z = 1066149217761810ULL;
static uint64_t kiss64_t;
#define MWC64  (kiss64_t = (kiss64_x << 58) + kiss64_c, \
                kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t, \
                kiss64_c += (kiss64_x < kiss64_t), kiss64_x)
#define XSH64  (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17), \
                kiss64_y ^= (kiss64_y << 43))
#define CNG64  (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)

int main (void)
{
    uint64_t a, b, res_hi, res_lo, ref_hi, ref_lo, count = 0;

    printf ("Smoke test of umul64wide variant %d\n", VARIANT);
    do {
        a = KISS64;
        b = KISS64;
        ref_lo = umul64wide_ref (a, b, &ref_hi);
        res_lo = umul64wide (a, b, &res_hi);
        if ((res_lo ^ ref_lo) | (res_hi ^ ref_hi)) {
            printf ("!!!! error: a=%016llx b=%016llx res=%016llx_%016llx ref=%016llx_%016llx\n",
                    a, b, res_hi, res_lo, ref_hi, ref_lo);
            return EXIT_FAILURE;
        }
        if (!(count & 0xfffffff)) printf ("\r%llu", count);
        count++;
    } while (count);
    return EXIT_SUCCESS;
}
Monophthong answered 7/12, 2022 at 8:22 Comment(3)
It's not clear if you're willing to depend on extensions, e.g., gcc and clang may define __UINT_FAST64_TYPE__ which can be used for 32 x 32 => 64, letting the compiler do the work - as would uint64_t - but this might result in function calls. Are you after pure idiomatic C where only uint32_t can be relied on, with no special 32 x 32 => 64 or inline assembly like umulh on ARM, etc., and using this as a building block for 64 x 64 ?Simonetta
@BrettHale Pure ISO C99 code is what I desire, so stdint.h is available. I mentioned "no intrinsics" in the question. By the same token, no inline assembly. Vendor extensions in compilers are not desired (portability). 40 years ago, I hand-assembled Z80 instructions and entered the resulting hex codes. Now, no human can stand up to a computer in either chess or Go, and I would like my compilers to transform C code into well selected and arranged machine instructions on its own. I last coded this particular building block in assembly language for a Maxwell-class GPU about five years ago.Monophthong
@BrettHale: If you were willing to use extensions, you'd use Clang _ExtInt(128) which should work even on 32-bit machines (unlike unsigned __int128). ISO C23 may standardize unsigned _BitInt(128) foo. (en.wikipedia.org/wiki/C2x). But as yet, only clang supports that, not GCC trunk even with -std=gnu2x: godbolt.org/z/r3cejeKsr . (A value that wide gets returned by reference on 32-bit ISAs, but otherwise clang sees to compile it somewhat efficiently for ARM32, but x86 32 looks messy, setting CF too early and having to save it across a mul.)Troudeloup
S
2

I avoided the use of the ((x += y) < y) overflow test, since not every ISA handles conditional flags efficiently, and may inhibit re-ordering when using the results of flag register(s); x86[-64] is the obvious example, though later BMI(2) instructions may help mitigate this. I also added a 32 x 32 -> 64 bit C implementation for comparison - but I would expect any modern ISA to at least supply a 'high word' multiply like ARM's umulh.

/******************************************************************************/

/* stackoverflow.com/questions/74713642 */

#include <inttypes.h>
#include <stdio.h>

/* umul_32_32 : 32 x 32 => 64 */

/* force inline (non-portable), or implement it as macro, e.g.,
 * #define umul_32_32(rh, rl, x, y) do { ... } while (0) */

#if (1)

static inline __attribute__((always_inline))
uint64_t umul_32_32 (uint32_t x, uint32_t y)
{
    return (((uint64_t) x) * y);
}

#else

/* if no widening multiply is available, the compiler probably
 * generates something at least as efficient as the following -
 * or (worst case) it calls a builtin function. */

static inline __attribute__((always_inline))
uint64_t umul_32_32 (uint32_t x, uint32_t y)
{
    uint32_t m0, m1, m2, m3; /* (partial products) */
    uint32_t x0, x1, y0, y1;

    x0 = x & UINT16_MAX, x1 = x >> (16);
    y0 = y & UINT16_MAX, y1 = y >> (16);

    m0 = x0 * y0, m1 = x1 * y0;
    m2 = x0 * y1, m3 = x1 * y1;
    m1 += m0 >> (16);
    m3 += m2 >> (16);
    m1 += m2 & UINT16_MAX;

    uint32_t rh = m3 + (m1 >> (16));
    uint32_t rl = m1 << (16) | (m0 & UINT16_MAX);

    return (((uint64_t) rh) << 32 | rl);

    /* 32 x 32 => 64 : no branching or carry overflow tests. */
}

#endif

/* ensure the function is called to inspect code gen / assembly,
 * otherwise gcc and clang evaluate this at compile time. */

__attribute__((noinline)) void umul_64_64 (

    uint64_t *rh, uint64_t *rl, uint64_t x, uint64_t y)
{
    uint64_t m0, m1, m2, m3; /* (partial products) */
    uint32_t x0, x1, y0, y1;

    x0 = (uint32_t) (x), x1 = (uint32_t) (x >> (32));
    y0 = (uint32_t) (y), y1 = (uint32_t) (y >> (32));

    m0 = umul_32_32(x0, y0), m1 = umul_32_32(x1, y0);
    m2 = umul_32_32(x0, y1), m3 = umul_32_32(x1, y1);
    m1 += m0 >> (32);
    m3 += m2 >> (32);
    m1 += m2 & UINT32_MAX;

    *rh = m3 + (m1 >> (32));
    *rl = m1 << (32) | (m0 & UINT32_MAX);

    /* 64 x 64 => 128 : no branching or carry overflow tests. */
}

#if (0)

int main (void)
{
    uint64_t x = UINT64_MAX, y = UINT64_MAX, rh, rl;

    umul_64_64(& rh, & rl, x, y);
    fprintf(stdout, "0x%016" PRIX64 ":0x%016" PRIX64 "\n", rh, rl);

    return (0);
}

#endif

/******************************************************************************/

For ARM-7, I'm getting more or less the same results as your 'variant 3' code, which isn't surprising, since it's the same essential idea. I tried different flags on gcc-12 and gcc-trunk, but couldn't improve it.

I'd hazard a guess that with Apple's investment in AArch64 silicon, there's simply been more aggressive optimization and funding directed toward clang that benefits 32-bit ARM-7 as well. But that's pure speculation. It's a pretty glaring disparity for such a major platform though.

Simonetta answered 7/12, 2022 at 19:44 Comment(6)
Note that the target architectures here are primarily ARM and secondarily RISC-V, each in their 32-bit incarnation. RISC-V would more aptly named MIPS 5.0, and as such does not provide a carry flag and must use SETcc of some sort to emulate it. My variants 5 and 6 are an exploration of the question: If I were to write C code in a kind of "native RISC-V" style, would I get decent code for that from ARM compilers? Based on this one case, the answer seems to be "possibly, but not very likely".Monophthong
@Monophthong Re: "possibly, but not very likely": does it mean that "toolchains have not matured enough to make this feasible"? If so, then is this still true in 2024?Involutional
@Involutional You can easily try the current state of affairs by using Compiler Explorer. From what I can see, compilers improve all the time on reducing N different HLL source representations into a single "canonical" internal representation. However, there are certainly still differences in being able to convert that into the most "ARM-esque" machine code, where (for 32-bit ARM!) gcc has a clear edge over clang by using architectural features in the most efficient manner. For other architectures, notably 32-bit RISC-V, 64-bit ARM, and x86-64, clang wins.Monophthong
@Monophthong Under "architectural features" do you mean "HW features (extensions)"? Both GCC and Clang allow to specify "supported extensions for each architecture" using -march option. Here is an example. Also: 1) Have a look at "target-features"="xxx" in the generated IR (from the example above). 2) There is a confusion in the terminology. In some places (e.g. /proc/cpuinfo on Ubuntu on ARM, LLVM IR on ARM) the term "feature" is used, in another places the term "extension" (GCC / Clang manual) is used.Involutional
@Monophthong 3) FYI: cpuinfo is not (?) required to use feature names from the (ARM) manual. Hence, feature names from cpuinfo may differ in comp. with the manual.Involutional
@Involutional By architectural features I meant ISA plus available ISA-variations (-march=, -marm, etc). -mtune seems to have (close to) zero influence on codes that I have looked at. Both gcc and clang appear to support the same relevant command line flags. For the codes I have looked at for various 32-bit ARM targets (Cortex A and Cortex M; -O3 and -Os), gcc generally generated better assembly code than clang. Depending on what codes you look at, your experience may differ.Monophthong

© 2022 - 2024 — McMap. All rights reserved.