Efficient overflow-immune arithmetic mean in C/C++
Asked Answered
M

2

10

The arithmetic mean of two unsigned integers is defined as:

mean = (a+b)/2

Directly implementing this in C/C++ may overflow and produce a wrong result. A correct implementation would avoid this. One way of coding it could be:

mean = a/2 + b/2 + (a%2 + b%2)/2

But this produces rather a lot of code with typical compilers. In assembler, this usually can be done much more efficiently. For example, the x86 can do this in the following way (assembler pseudo code, I hope you get the point):

ADD a,b   ; addition, leaving the overflow condition in the carry bit
RCR a,1   ; rotate right through carry, effectively a division by 2

After those two instructions, the result is in a, and the remainder of the division is in the carry bit. If correct rounding is desired, a third ADC instruction would have to add the carry into the result.

Note that the RCR instruction is used, which rotates a register through the carry. In our case, it is a rotate by one position, so that the previous carry becomes the most significant bit in the register, and the new carry holds the previous LSB from the register. It seems that MSVC doesn't even offer an intrinsic for this instruction.

Is there a known C/C++ pattern that can be expected to be recognized by an optimizing compiler so that it produces such efficient code? Or, more generally, is there a rational way how to program in C/C++ source level so that the carry bit is being used by the compiler to optimize the generated code?

EDIT:

A 1-hour lecture about std::midpoint: https://www.youtube.com/watch?v=sBtAGxBh-XI

Wow!

EDIT2: Great discussion on Microsoft blog

Madriene answered 7/2, 2022 at 13:6 Comment(19)
In C++20: en.cppreference.com/w/cpp/numeric/midpointLaurenalaurence
@m88: Good find! On MSVC, however, the generated code is still not quite as efficient, because it seems to implement the formula: mean = a + (b - a) / 2 when a>b, and a - (a - b)/2 otherwise. Godbolt shows some 10 assembler instructions.Madriene
Wouldn't overflow rather set the overflow flag and not carry? How does this even make sense on the machine code level?Eusebioeusebius
Consider ((wider_type)a+b)/2.Overflight
@Lundin: On the x86, there are two flags: carry and overflow. The overflow flag captures the overflow state for signed integer add, whereas the carry captures the overflow from unsigned integer add. The same instructions are used for either type of integer. It is the unsigned case I'm interested in.Madriene
You should clarify that you're looking for unsigned arithmetic mean. Your "add + rcr" gives the wrong answer for signed integers. Many compilers have an intrinsic for "add and report carry-out" which you can use. Then divide by 2 and set the top bit based on carry-out, or use a rotation intrinsic.Enormous
@Raymond Chen: Good point! I've edited the text.Madriene
@Madriene Yeah well you mention "If correct rounding is desired, a third ADC instruction would have to add the carry into the result." This extra addition by +1 only makes sense for rounding direction of signed negative numbers. But in case of signed numbers, carry isn't set, but overflow. Hence my confusion.Eusebioeusebius
int a = 50, b = 100; cout << ((a >> 1) + (b >> 1)); This seems to be working for me. It would not be trustworthy with two odd numbers because of truncation ( (7+3)/2 would show up as 4).Homeomorphism
@Topological Sort: The odd numbers are the problem.Madriene
@sh-: I'm wondering if using the carry flag is that efficient on a modern x86 CPU. There's only one flag register, after all, and it's shared by many operations. I think it's still renamed, but it does introduce some quite tricky dependencies for the CPU to track.Costume
@MSalters: Reading CF is no worse than reading the integer output of another instruction. It always has to get renamed whenever written, usually sharing a PRF entry with an integer reg for instructions that write a register and set FLAGS, and like any reg is capable of being renamed <pipeline-width> times in the same clock cycle. CF is even renamed separately from the rest of EFLAGS, since some instructions write only it, others leave it unmodified. Instructions that read CF, like adc and rcr, simply have an input dependency on that register, along with the integer reg(s).Diploblastic
@MSalters: On the x86, the conditional jump instructions also depend on the flags, so both the compiler and the processor are bound to have infrastructure to keep track of them. I wouldn't expect any significant performance problem.Madriene
@chux-ReinstateMonica: Widening compiles really efficiently, as expected for types less than a reg width. Surprisingly also fairly decent for uint64_t using unsigned __int128 as the wider type: compilers materialize the high half 0/1 and then shrd it in. godbolt.org/z/sz53eEYh9 shows that and other formulae proposed in answers and comments. On ARM64, that only takes 3 total instructions, adds/adcs/extr. So it's only 1 worse than you could do in asm, if ARM64 has an RCR.Diploblastic
@Peter Cordes: Super comparison! Very interesting to see such differences in optimization!Madriene
@m88: Now that C++20 provides a portable way to expose it, compiles might want to start providing an intrinsic / builtin for non-overflowing unsigned midpoint that can compile to add/rcr or whatever is optimal on the target ISA.Diploblastic
Not really a solution, but worth mentioning, SSE2/AVX2 have instructions for packed averaging of uint8/uint16: felixcloutier.com/x86/pavgb:pavgwElfrieda
Added an issue: github.com/llvm/llvm-project/issues/53648Paraplegia
@RaymondChen: Nice write-up in your blog article, especially showing various ISAs and the effect of their calling conventions requiring sign-extended or whatever for the case where this function doesn't inline. Your AArch64 code has a missed optimization, though: clang uses add x8, x8, w0, uxtw to fold one of the two uxtw operations into the add. godbolt.org/z/9McP9MTjc (ARM64 has a nice selection of stuff like that for mixing 32 and 64-bit integers / pointers, including addressing modes that SXTW a 32-bit val).Diploblastic
R
8

The following method avoids overflow and should result in fairly efficient assembly (example) without depending on non-standard features:

    mean = (a&b) + (a^b)/2;
Retinite answered 7/2, 2022 at 14:1 Comment(3)
@MSalters: I think it's intended to give identical results to add/rcr, i.e. the unsigned mean, as discussed in comments on the question. The Godbolt link in this answer shows it for unsigned mean(unsigned a, unsigned b)Diploblastic
Quite good! I found this: stackoverflow.com/questions/4844165/… which isn't about C/C++, but offers the same solution. Thanks!Madriene
This trick is also used in swar and can even be used to average 16/24-bit color pixels: Quick colour averaging, Fast Averaging of High Color (16 bit) Pixels, How does this color blending trick that works on color components in parallel work?Iconostasis
P
7

There are three typical methods to compute average without overflow, one of which is limited to uint32_t (on 64-bit architectures).

// average "SWAR" / Montgomery
uint32_t avg(uint32_t a, uint32_t b) {
   return (a & b) + ((a ^ b) >> 1);
}

// in case the relative magnitudes are known
uint32_t avg2(uint32_t min, uint32_t max) {
  return min + (max - min) / 2;
}
// in case the relative magnitudes are not known
uint32_t avg2_constrained(uint32_t a, uint32_t b) {
  return a + (int32_t)(b - a) / 2;
}

// average increase width (not applicable to uint64_t)
uint32_t avg3(uint32_t a, uint32_t b) {
   return ((uint64_t)a + b) >> 1;
}

The corresponding assembler sequences from clang in two architectures are

avg(unsigned int, unsigned int)
    mov     eax, esi
    and     eax, edi
    xor     esi, edi
    shr     esi
    add     eax, esi

avg2(unsigned int, unsigned int)
    sub     esi, edi
    shr     esi
    lea     eax, [rsi + rdi]

avg3(unsigned int, unsigned int)
    mov     ecx, edi
    mov     eax, esi
    add     rax, rcx
    shr     rax

vs.

avg(unsigned int, unsigned int)         
    and     w8, w1, w0
    eor     w9, w1, w0
    add     w0, w8, w9, lsr #1
    ret
avg2(unsigned int, unsigned int)
    sub     w8, w1, w0
    add     w0, w0, w8, lsr #1
    ret
avg3(unsigned int, unsigned int):                                       
    mov     w8, w1
    add     x8, x8, w0, uxtw
    lsr     x0, x8, #1
    ret

Out of these three versions, avg2 would perform in ARM64 as well, as the optimal sequence using carry flag -- and also it's likely that avg3 would perform as well, noticing that the mov w8,w1 is used to clear the top 32-bits, which may be unnecessary given that the compiler knows they are cleared by any previous instruction that is used to produce the value.

Similar statement can be made of the Intel version for avg3, which would in optimal case compiled to just the two meaningful instructions:

add     rax, rcx
shr     rax

See https://godbolt.org/z/5TMd3zr81 for online comparison.

The "SWAR"/Montgomery version is typically only justified, when trying to compute multiple averages packed to a single (large) integer in which case the full formula contains masking with the bit positions of the highest bits: return (a & b) + (((a ^ b) >> 1) & ~kH;.

Paraplegia answered 7/2, 2022 at 14:25 Comment(6)
godbolt.org/z/sz53eEYh9 shows widening and various other formulae with GCC and clang for x86-64 and ARM64. (including for uint64_t widening to unsigned __int128, where they materialize a high half 0 or 1 and double-shift / extr it back in, emulating RCR.)Diploblastic
Is avg2 correct when a > b?Madriene
@sh-: No, it's not. Try it with a constant input on Godbolt, e.g. just add a=10, b=5; to that function, and it then optimizes to mov eax, -2147483641. I wondered if that was too good to be true; turns out it is unless you know which input is always greater.Diploblastic
@Peter Cordes: This has a better chance, but I'm still not entirely sure there is no invalid corner case: a + int(b - a) / 2 (of course, the correct size int must be used).Madriene
Oh yes, I grabbed the formula from the T* midpoint(T *a, T *b) overload, where (b-a) is signed.Paraplegia
@AkiSuihkonen: So the constraint on avg2_constrained is that there's no signed overflow in b-a and/or the unsigned->signed conversion doesn't overflow? For a=0; b=0xFFFFFFFFu, it produces 0 rather than 2147483647, so it doesn't work for all unsigned inputs. godbolt.org/z/WrfG9fjTdDiploblastic

© 2022 - 2024 — McMap. All rights reserved.