In the context of exploratory activity I have started to take a look at integer & fixed-point arithmetic building blocks for 32-bit platforms. My primary target would be ARM32 (specifically armv7
), with a side glance to RISC-V32 which I expect to grow in importance in the embedded space. The first sample building block I chose to examine is unsigned 64x64->128 bit integer multiplication. Other questions on this site about this building block do not provide detailed coverage of 32-bit platforms.
Over the past thirty years, I have implemented this and other arithmetic building blocks multiple times, but always in assembly language, for various architectures. However, at this point in time my hope and desire is that these could be programmed in straight ISO-C, without the use of intrinsics. Ideally a single version of the C code would generate good machine code across architectures. I know that the approach of manipulating HLL code to control machine code is generally brittle, but hope that processor architectures and toolchains have matured enough to make this feasible.
Some approaches used in assembly language implementations are not well suited for porting to C. In the exemplary code below I have selected six variants that seemed amenable to an HLL implementation. Besides the generation of partial products, which is common to all variants, the two basic approaches are: (1) Sum the partial products using 64-bit arithmetic, letting the compiler take care of the carry propagation between 32-bit halves. In this case there are multiple choices in which order to sum the partial products. (2) Use 32-bit arithmetic for the summing, simulating the carry flag directly. In this case we have a choice of generating the carry after an addition (a = a + b; carry = a < b;
) or before the addition (carry = ~a < b; a = a + b;
). Variants 1 through 3 below fall into the former category, variants 5 and 6 fall into the latter.
At Compiler Explorer, I focused on the toolchains gcc 12.2 and clang 15.0 for the platforms of interest. I compiled with -O3
. The general finding is that on average clang generates more efficient code than gcc, and that the differences between the variants (number of instructions and registers used) are more pronounced with clang. While this may be understandable in the case of RISC-V as the newer architecture, it surprised me in the case of armv7
which has been around for well over a dozen years.
Three cases in particular struck me as noteworthy. While I have worked with compiler engineers before and have a reasonable understanding of basic code transformation, phase ordering issues, etc, the only technique I aware of that might apply to this code is idiom recognition, and I do not see how this could explain the observations by itself. The first case is variant 3, where clang 15.0 produces extremely tight code comprising just 10 instructions that I don't think can be improved upon:
umul64wide:
push {r4, lr}
umull r12, r4, r2, r0
umull lr, r0, r3, r0
umaal lr, r4, r2, r1
umaal r0, r4, r3, r1
ldr r1, [sp, #8]
strd r0, r4, [r1]
mov r0, r12
mov r1, lr
pop {r4, pc}
By contrast, gcc generates twice the number of instructions and requires twice the number of registers. I hypothesize that it does not recognize how to use the multiply-accumulate instruction umaal
here, but is that the full story? The reverse situation, but not quite as dramatic, occurs in variant 6, where gcc 12.2 produces this sequence of 18 instructions, with low register usage:
umul64wide:
mov ip, r0
push {r4, r5, lr}
mov lr, r1
umull r0, r1, r0, r2
ldr r4, [sp, #12]
umull r5, ip, r3, ip
adds r1, r1, r5
umull r2, r5, lr, r2
adc ip, ip, #0
umull lr, r3, lr, r3
adds r1, r1, r2
adc r2, ip, #0
adds r2, r2, r5
adc r3, r3, #0
adds r2, r2, lr
adc r3, r3, #0
strd r2, r3, [r4]
pop {r4, r5, pc}
The generated code nicely turns the simulated carry propagation into real carry propagation. clang 15.0 uses nine instructions and five registers more, and I cannot really make out what it is trying to do without spending much more time on analysis. The third observation is with regard to the differences seen in the machine code produced for variant 5 vs. variant 6, in particular with clang. These use the same basic algorithm, with one variant computing the simulated carry before the additions, the other after it. I did find in the end that one variant, namely variant 4, seems to be efficient across both tool chains and both architectures. However, before I proceed to other building blocks and face a similar struggle, I would like to inquire:
(1) Are there coding idioms or algorithms I have not considered in the code below that might lead to superior results? (2) Are there specific optimization switches, e.g. a hypothetical -ffrobnicate
(see here), that are not included in -O3
that would help the compilers generate efficient code more consistently for these kind of bit-manipulation scenarios? Explanations as to what compiler mechanisms are likely responsible for the cases of significant differences in code generation observed, and how one might influence or work round them, could also be helpful.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#define VARIANT (3)
#define USE_X64_ASM_REF (0)
/* Multiply two unsigned 64-bit operands a and b. Returns least significant 64
bits of product as return value, most significant 64 bits of product via h.
*/
uint64_t umul64wide (uint64_t a, uint64_t b, uint64_t *h)
{
uint32_t a_lo = (uint32_t)a;
uint32_t a_hi = a >> 32;
uint32_t b_lo = (uint32_t)b;
uint32_t b_hi = b >> 32;
uint64_t p0 = (uint64_t)a_lo * b_lo;
uint64_t p1 = (uint64_t)a_lo * b_hi;
uint64_t p2 = (uint64_t)a_hi * b_lo;
uint64_t p3 = (uint64_t)a_hi * b_hi;
#if VARIANT == 1
uint32_t c = (uint32_t)(((p0 >> 32) + (uint32_t)p1 + (uint32_t)p2) >> 32);
*h = p3 + (p1 >> 32) + (p2 >> 32) + c;
return p0 + ((p1 + p2) << 32);
#elif VARIANT == 2
uint64_t s = (p0 >> 32) + p1;
uint64_t t = (uint32_t)s + p2;
*h = (s >> 32) + (t >> 32) + p3;
return (uint32_t)p0 + (t << 32);
#elif VARIANT == 3
*h = (p1 >> 32) + (((p0 >> 32) + (uint32_t)p1 + p2) >> 32) + p3;
return p0 + ((p1 + p2) << 32);
#elif VARIANT == 4
uint64_t t = (p0 >> 32) + p1 + (uint32_t)p2;
*h = (p2 >> 32) + (t >> 32) + p3;
return (uint32_t)p0 + (t << 32);
#elif VARIANT == 5
uint32_t r0, r1, r2, r3, r4, r5, r6;
r0 = (uint32_t)p0;
r1 = p0 >> 32;
r5 = (uint32_t)p1;
r2 = p1 >> 32;
r1 = r1 + r5;
r6 = r1 < r5;
r2 = r2 + r6;
r6 = (uint32_t)p2;
r5 = p2 >> 32;
r1 = r1 + r6;
r6 = r1 < r6;
r2 = r2 + r6;
r4 = (uint32_t)p3;
r3 = p3 >> 32;
r2 = r2 + r5;
r6 = r2 < r5;
r3 = r3 + r6;
r2 = r2 + r4;
r6 = r2 < r4;
r3 = r3 + r6;
*h = ((uint64_t)r3 << 32) | r2;
return ((uint64_t)r1 << 32) | r0;
#elif VARIANT == 6
uint32_t r0, r1, r2, r3, r4, r5, r6;
r0 = (uint32_t)p0;
r1 = p0 >> 32;
r5 = (uint32_t)p1;
r2 = p1 >> 32;
r4 = ~r1;
r4 = r4 < r5;
r1 = r1 + r5;
r2 = r2 + r4;
r6 = (uint32_t)p2;
r5 = p2 >> 32;
r4 = ~r1;
r4 = r4 < r6;
r1 = r1 + r6;
r2 = r2 + r4;
r4 = (uint32_t)p3;
r3 = p3 >> 32;
r6 = ~r2;
r6 = r6 < r5;
r2 = r2 + r5;
r3 = r3 + r6;
r6 = ~r2;
r6 = r6 < r4;
r2 = r2 + r4;
r3 = r3 + r6;
*h = ((uint64_t)r3 << 32) | r2;
return ((uint64_t)r1 << 32) | r0;
#else
#error unsupported VARIANT
#endif
}
#if defined(__SIZEOF_INT128__)
uint64_t umul64wide_ref (uint64_t a, uint64_t b, uint64_t *h)
{
unsigned __int128 prod = ((unsigned __int128)a) * b;
*h = (uint64_t)(prod >> 32);
return (uint64_t)prod;
}
#elif defined(_MSC_VER) && defined(_WIN64)
#include <intrin.h>
uint64_t umul64wide_ref (uint64_t a, uint64_t b, uint64_t *h)
{
*h = __umulh (a, b);
return a * b;
}
#elif USE_X64_ASM_REF
uint64_t umul64wide_ref (uint64_t a, uint64_t b, uint64_t *h)
{
uint64_t res_l, res_h;
__asm__ (
"movq %2, %%rax;\n\t" // rax = a
"mulq %3;\n\t" // rdx:rax = a * b
"movq %%rdx, %0;\n\t" // res_h = rdx
"movq %%rax, %1;\n\t" // res_l = rax
: "=rm" (res_h), "=rm"(res_l)
: "rm"(a), "rm"(b)
: "%rax", "%rdx");
*h = res_h;
return res_l;
}
#else // generic (and slow) reference implementation
#define ADDCcc(a,b,cy,t0,t1) \
(t0=(b)+cy, t1=(a), cy=t0<cy, t0=t0+t1, t1=t0<t1, cy=cy+t1, t0=t0)
#define ADDcc(a,b,cy,t0,t1) \
(t0=(b), t1=(a), t0=t0+t1, cy=t0<t1, t0=t0)
#define ADDC(a,b,cy,t0,t1) \
(t0=(b)+cy, t1=(a), t0+t1)
uint64_t umul64wide_ref (uint64_t a, uint64_t b, uint64_t *h)
{
uint32_t cy, t0, t1;
uint32_t a_lo = (uint32_t)a;
uint32_t a_hi = a >> 32;
uint32_t b_lo = (uint32_t)b;
uint32_t b_hi = b >> 32;
uint64_t p0 = (uint64_t)a_lo * b_lo;
uint64_t p1 = (uint64_t)a_lo * b_hi;
uint64_t p2 = (uint64_t)a_hi * b_lo;
uint64_t p3 = (uint64_t)a_hi * b_hi;
uint32_t p0_lo = (uint32_t)p0;
uint32_t p0_hi = p0 >> 32;
uint32_t p1_lo = (uint32_t)p1;
uint32_t p1_hi = p1 >> 32;
uint32_t p2_lo = (uint32_t)p2;
uint32_t p2_hi = p2 >> 32;
uint32_t p3_lo = (uint32_t)p3;
uint32_t p3_hi = p3 >> 32;
uint32_t r0 = p0_lo;
uint32_t r1 = ADDcc (p0_hi, p1_lo, cy, t0, t1);
uint32_t r2 = ADDCcc (p1_hi, p2_hi, cy, t0, t1);
uint32_t r3 = ADDC (p3_hi, 0, cy, t0, t1);
r1 = ADDcc (r1, p2_lo, cy, t0, t1);
r2 = ADDCcc (r2, p3_lo, cy, t0, t1);
r3 = ADDC (r3, 0, cy, t0, t1);
*h = ((uint64_t)r3 << 32) + r2;
return ((uint64_t)r1 << 32) + r0;
}
#endif
/*
https://groups.google.com/forum/#!original/comp.lang.c/qFv18ql_WlU/IK8KGZZFJx4J
From: geo <[email protected]>
Newsgroups: sci.math,comp.lang.c,comp.lang.fortran
Subject: 64-bit KISS RNGs
Date: Sat, 28 Feb 2009 04:30:48 -0800 (PST)
This 64-bit KISS RNG has three components, each nearly
good enough to serve alone. The components are:
Multiply-With-Carry (MWC), period (2^121+2^63-1)
Xorshift (XSH), period 2^64-1
Congruential (CNG), period 2^64
*/
static uint64_t kiss64_x = 1234567890987654321ULL;
static uint64_t kiss64_c = 123456123456123456ULL;
static uint64_t kiss64_y = 362436362436362436ULL;
static uint64_t kiss64_z = 1066149217761810ULL;
static uint64_t kiss64_t;
#define MWC64 (kiss64_t = (kiss64_x << 58) + kiss64_c, \
kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t, \
kiss64_c += (kiss64_x < kiss64_t), kiss64_x)
#define XSH64 (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17), \
kiss64_y ^= (kiss64_y << 43))
#define CNG64 (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)
int main (void)
{
uint64_t a, b, res_hi, res_lo, ref_hi, ref_lo, count = 0;
printf ("Smoke test of umul64wide variant %d\n", VARIANT);
do {
a = KISS64;
b = KISS64;
ref_lo = umul64wide_ref (a, b, &ref_hi);
res_lo = umul64wide (a, b, &res_hi);
if ((res_lo ^ ref_lo) | (res_hi ^ ref_hi)) {
printf ("!!!! error: a=%016llx b=%016llx res=%016llx_%016llx ref=%016llx_%016llx\n",
a, b, res_hi, res_lo, ref_hi, ref_lo);
return EXIT_FAILURE;
}
if (!(count & 0xfffffff)) printf ("\r%llu", count);
count++;
} while (count);
return EXIT_SUCCESS;
}
__UINT_FAST64_TYPE__
which can be used for 32 x 32 => 64, letting the compiler do the work - as woulduint64_t
- but this might result in function calls. Are you after pure idiomatic C where onlyuint32_t
can be relied on, with no special32 x 32 => 64
or inline assembly likeumulh
on ARM, etc., and using this as a building block for 64 x 64 ? – Simonettastdint.h
is available. I mentioned "no intrinsics" in the question. By the same token, no inline assembly. Vendor extensions in compilers are not desired (portability). 40 years ago, I hand-assembled Z80 instructions and entered the resulting hex codes. Now, no human can stand up to a computer in either chess or Go, and I would like my compilers to transform C code into well selected and arranged machine instructions on its own. I last coded this particular building block in assembly language for a Maxwell-class GPU about five years ago. – Monophthong_ExtInt(128)
which should work even on 32-bit machines (unlikeunsigned __int128
). ISO C23 may standardizeunsigned _BitInt(128) foo
. (en.wikipedia.org/wiki/C2x). But as yet, only clang supports that, not GCC trunk even with-std=gnu2x
: godbolt.org/z/r3cejeKsr . (A value that wide gets returned by reference on 32-bit ISAs, but otherwise clang sees to compile it somewhat efficiently for ARM32, but x86 32 looks messy, setting CF too early and having to save it across a mul.) – Troudeloup