The arithmetic mean of two unsigned integers is defined as:
mean = (a+b)/2
Directly implementing this in C/C++ may overflow and produce a wrong result. A correct implementation would avoid this. One way of coding it could be:
mean = a/2 + b/2 + (a%2 + b%2)/2
But this produces rather a lot of code with typical compilers. In assembler, this usually can be done much more efficiently. For example, the x86 can do this in the following way (assembler pseudo code, I hope you get the point):
ADD a,b ; addition, leaving the overflow condition in the carry bit
RCR a,1 ; rotate right through carry, effectively a division by 2
After those two instructions, the result is in a
, and the remainder of the division is in the carry bit. If correct rounding is desired, a third ADC
instruction would have to add the carry into the result.
Note that the RCR instruction is used, which rotates a register through the carry. In our case, it is a rotate by one position, so that the previous carry becomes the most significant bit in the register, and the new carry holds the previous LSB from the register. It seems that MSVC doesn't even offer an intrinsic for this instruction.
Is there a known C/C++ pattern that can be expected to be recognized by an optimizing compiler so that it produces such efficient code? Or, more generally, is there a rational way how to program in C/C++ source level so that the carry bit is being used by the compiler to optimize the generated code?
EDIT:
A 1-hour lecture about std::midpoint
: https://www.youtube.com/watch?v=sBtAGxBh-XI
Wow!
mean = a + (b - a) / 2
when a>b, anda - (a - b)/2
otherwise. Godbolt shows some 10 assembler instructions. – Madriene((wider_type)a+b)/2
. – Overflightadc
andrcr
, simply have an input dependency on that register, along with the integer reg(s). – Diploblasticuint64_t
usingunsigned __int128
as the wider type: compilers materialize the high half 0/1 and thenshrd
it in. godbolt.org/z/sz53eEYh9 shows that and other formulae proposed in answers and comments. On ARM64, that only takes 3 total instructions, adds/adcs/extr. So it's only 1 worse than you could do in asm, if ARM64 has an RCR. – Diploblasticuint8
/uint16
: felixcloutier.com/x86/pavgb:pavgw – Elfriedaadd x8, x8, w0, uxtw
to fold one of the twouxtw
operations into the add. godbolt.org/z/9McP9MTjc (ARM64 has a nice selection of stuff like that for mixing 32 and 64-bit integers / pointers, including addressing modes that SXTW a 32-bit val). – Diploblastic