Clang does the same thing. Probably for compiler-construction and CPU architecture reasons:
Disentangling that logic into just a swap may allow better optimization in some cases; definitely something it makes sense for a compiler to do early so it can follow values through the swap.
Xor-swap is total garbage for swapping registers, the only advantage being that it doesn't need a temporary. But xchg reg,reg
already does that better.
I'm not surprised that GCC's optimizer recognizes the xor-swap pattern and disentangles it to follow the original values. In general, this makes constant-propagation and value-range optimizations possible through swaps, especially for cases where the swap wasn't conditional on the values of the vars being swapped. This pattern-recognition probably happens soon after transforming the program logic to GIMPLE (SSA) representation, so at that point it will forget that the original source ever used an xor swap, and not think about emitting asm that way.
Hopefully sometimes that lets it then optimize down to only a single mov
, or two mov
s, depending on register allocation for the surrounding code (e.g. if one of the vars can move to a new register, instead of having to end up back in the original locations). And whether both variables are actually used later, or only one. Or if it can fully disentangle an unconditional swap, maybe no mov
instructions.
But worst case, three mov
instructions needing a temporary register is still better, unless it's running out of registers. I'd guess GCC is not smart enough to use xchg reg,reg
instead of spilling something else or saving/restoring another tmp reg, so there might be corner cases where this optimization actually hurts.
(Apparently GCC -Os
does have a peephole optimization to use xchg reg,reg
instead of 3x mov: PR 92549 was fixed for GCC10. It looks for that quite late, during RTL -> assembly. And yes, it works here: turning your xor-swap into an xchg: https://godbolt.org/z/zs969xh47)
xor-swap has worse latency and defeats mov-elimination
with no memory reads, and the same number of instructions, I don't see any bad impacts and feels odd that it be changed. Clearly there is something I did not think through though, but what is it?
Instruction count is only a rough proxy for one of three things that are relevant for perf analysis: front-end uops, latency, and back-end execution ports. (And machine-code size in bytes: x86 machine-code instructions are variable-length.)
It's the same size in machine-code bytes, and same number of front-end uops, but the critical-path latency is worse: 3 cycles from input a
to output a
for xor-swap, and 2 from input b
to output a
, for example.
MOV-swap has at worst 1-cycle and 2-cycle latencies from inputs to outputs, or less with mov-elimination. (Which can also avoid using back-end execution ports, especially relevant for CPUs like IvyBridge and Tiger Lake with a front-end wider than the number of integer ALU ports. And Ice Lake, except Intel disabled mov-elimination on it as an erratum workaround; not sure if it's re-enabled for Tiger Lake or not.)
Also related:
If you're going to branch, just duplicate the averaging code
GCC's real missed optimization here (even with -O3
) is that tail-duplication results in about the same static code size, just a couple extra bytes since these are mostly 2-byte instructions. The big win is that the a<b
path then becomes the same length as the other, instead of twice as long to first do a swap and then run the same 3 uops for averaging.
update: GCC will do this for you with -ftracer
(https://godbolt.org/z/es7a3bEPv), optimizing away the swap. (That's only enabled manually or as part of -fprofile-use
, not at -O3
, so it's probably not a good idea to use all the time without PGO, potentially bloating machine code in cold functions / code-paths.)
Doing it manually in the source (Godbolt):
uint32_t average_1_taildup(uint32_t a, uint32_t b)
{
if(a < b){
return a + (b - a) / 2;
}else {
return b + (a - b) / 2;
}
}
# GCC11.2 -O3
average_1_taildup:
cmp edi, esi
jnb .L5
sub esi, edi
shr esi
lea eax, [rsi+rdi]
ret
.L5:
sub edi, esi
shr edi
lea eax, [rdi+rsi]
ret
Clang compiles both version 1 and 1_taildup
into code using cmov
(e.g. cmp / mov / cmovb / cmovb, or making a bit of a mess for the tail-duplication version).
But if you're going to go branchless then your average_3
is superior:
uint32_t average_3(uint32_t a, uint32_t b)
{
return (a & b) + ((a ^ b) >> 1);
}
# clang 13.0 -O3
average_3:
mov eax, esi
and eax, edi
xor esi, edi
shr esi
add eax, esi
ret
Both GCC and clang's versions are only 5 instructions (plus ret), but clang arranges it so the critical-path latency is only 3 cycles (3 single-uop instructions) from either input to the output, even without mov
-elimination. (GCC has one chain that's 4 instructions long including a mov.)
Efficient non-overflowing unsigned mean
See also Efficient overflow-immune arithmetic mean in C/C++ - widening to uint64_t can be even cheaper, especially when inlining, on a 64-bit machine. (As discussed in comments under the question, e.g. https://godbolt.org/z/sz53eEYh9 shows code from the existing answers at the time I commented.)
Another good option is this, but usually not as good as widening:
return (a&b) + (a^b)/2;
If compilers recognized any of these idioms, they could use the asm add
/rcr
trick, though, which is even more efficient than widening to unsigned __int128
for averaging uint64_t
.
xor
instruction takes more resources in the processor than amov
. To compute the result, the processor dispatches the instruction to an execution unit that performs an XOR. That is one of the simpler execution units, is very fast, and there are usually several of them in modern processors. But amov
often takes even less resources: The processor does not dispatch the instruction to an execution unit; it merely performs a “rename” in its internal register file… – Osteoarthritiseax
andebx
had several instructions again while a new add instruction is being dispatched using values they have from a later time. To performmov eax, edi
, the processor merely notes that the “current” contents ofeax
are now in internal register 107, which also holds the contents ofedi
. – Osteoarthritisif(a < b){ uint32_t tmp = a; a = b; b = tmp; }
– Finnougrianif(a < b){ uint32_t tmp = a; a = b; b = tmp; }
:) – Philippevilleuint64_t
.return ((uint64_t)a + (uint64_t)b) / 2;
is shorter and simpler source code, and compiles to two zero-extendmov
, onelea
addition, one shift, and no compares or branches. – Skinksetc
oradc
to materialize the carry flag, and oneshrd
, plus a few extra moves and zeros upfront.shrd reg, reg, imm8
is usually cheap (single-uop on most modern Intel CPUs, it appears, though more on AMD). – Skink