Why is the XOR swap optimized into a normal swap using the MOV instruction?
Asked Answered
P

1

8

While testing things around Compiler Explorer, I tried out the following overflow-free function for calculating the average of 2 unsigned 32-bit integers:

uint32_t average_1(uint32_t a, uint32_t b)
{
    if(a < b){
        a ^= b;
        b ^= a;
        a ^= b;
    }
    return b + (a - b) / 2;
}

which get compiled into this: (same with either -O1, -O2, -O3 optimization activated)

average_1:
        cmp     edi, esi
        jnb     .L2
        mov     eax, edi
        mov     edi, esi
        mov     esi, eax
.L2:
        sub     edi, esi
        shr     edi
        lea     eax, [rdi+rsi]
        ret

which the swapping part of the code is optimized to use the mov instruction with 1 additional register.

I've read through these questions:

Why don't people use xor swaps?
Cost of swapping variables through mov, xor

and gets that:

  • Using the XOR swap is more difficult at explaining itself, and
  • The mov instruction method has no negative effects on execution speed, by requiring more memory reads.

But in this case, seeing not the memory but only the eax, esi, edi register being used, I thought the compiled assembly code could also be generated as:

average_1:
        cmp     edi, esi
        jnb     .L2
        xor     edi, esi
        xor     esi, edi
        xor     edi, esi

...

with no memory reads, and the same number of instructions, I don't see any bad impacts and feels odd that it be changed. Clearly, there is something I did not think through though, but what is it?


Edit: To be clear, here my question is not about "why not use the XOR swap" but rather
"when XOR swap is used, though it doesn't seem to affect execution speed in this particular case, why is it still optimized out?"

Philippeville answered 7/3, 2022 at 14:15 Comment(11)
Peter Cordes may come along and explain the details of various processors as they relate to this. But, in general, an xor instruction takes more resources in the processor than a mov. To compute the result, the processor dispatches the instruction to an execution unit that performs an XOR. That is one of the simpler execution units, is very fast, and there are usually several of them in modern processors. But a mov often takes even less resources: The processor does not dispatch the instruction to an execution unit; it merely performs a “rename” in its internal register file…Osteoarthritis
… A typical processor in a modern general-purpose computer has many more registers than the named architectural registers. It uses them to execution many instructions at once—some multiply instruction may be using the values that eax and ebx had several instructions again while a new add instruction is being dispatched using values they have from a later time. To perform mov eax, edi, the processor merely notes that the “current” contents of eax are now in internal register 107, which also holds the contents of edi.Osteoarthritis
So the compiler designers have added code to the compiler to recognize the idiomatic XOR swap and replace it with more efficient move instructions.Osteoarthritis
I think it's the other way around: xor swaps have no positive effects in execution speed. Checking Godbolt then not even gcc for 8 bit AVR picked xor instructions. The main purposes of xor swaps seem to rather be: 1) posing 2) obfuscation. There seems to be no reason for not writing it as if(a < b){ uint32_t tmp = a; a = b; b = tmp; }Finnougrian
@Finnougrian Yup, there is no reason to do so. But that is some fancy thing that everyone knows and is sorta cool to resist trying it out. Here my question is not about "why not use the XOR swap" but rather "when XOR swap is used, since it seems doesn't effect execution speed, why is it still optimized out?"Philippeville
After all, the one I keep using will still be if(a < b){ uint32_t tmp = a; a = b; b = tmp; } :)Philippeville
@Eric Maybe Peter has already explained the details? Definitely worth a read if you have some spare time. :-)Pennell
@AdrianMole Ah, I'd ran through the "free MOV" section from the Intel dev manual but not this SO post. Much love.Philippeville
see Reasons for avoidance of XOR swap in practiceCairngorm
Based on your calling conventions, you are building for x86-64, so you can do much better by simply upcasting to uint64_t. return ((uint64_t)a + (uint64_t)b) / 2; is shorter and simpler source code, and compiles to two zero-extend mov, one lea addition, one shift, and no compares or branches.Skink
Even as 32-bit code this is arguably superior. You get one addition, one setc or adc to materialize the carry flag, and one shrd, plus a few extra moves and zeros upfront. shrd reg, reg, imm8 is usually cheap (single-uop on most modern Intel CPUs, it appears, though more on AMD).Skink
U
9

Clang does the same thing. Probably for compiler-construction and CPU architecture reasons:

  • Disentangling that logic into just a swap may allow better optimization in some cases; definitely something it makes sense for a compiler to do early so it can follow values through the swap.

  • Xor-swap is total garbage for swapping registers, the only advantage being that it doesn't need a temporary. But xchg reg,reg already does that better.


I'm not surprised that GCC's optimizer recognizes the xor-swap pattern and disentangles it to follow the original values. In general, this makes constant-propagation and value-range optimizations possible through swaps, especially for cases where the swap wasn't conditional on the values of the vars being swapped. This pattern-recognition probably happens soon after transforming the program logic to GIMPLE (SSA) representation, so at that point it will forget that the original source ever used an xor swap, and not think about emitting asm that way.

Hopefully sometimes that lets it then optimize down to only a single mov, or two movs, depending on register allocation for the surrounding code (e.g. if one of the vars can move to a new register, instead of having to end up back in the original locations). And whether both variables are actually used later, or only one. Or if it can fully disentangle an unconditional swap, maybe no mov instructions.

But worst case, three mov instructions needing a temporary register is still better, unless it's running out of registers. I'd guess GCC is not smart enough to use xchg reg,reg instead of spilling something else or saving/restoring another tmp reg, so there might be corner cases where this optimization actually hurts.

(Apparently GCC -Os does have a peephole optimization to use xchg reg,reg instead of 3x mov: PR 92549 was fixed for GCC10. It looks for that quite late, during RTL -> assembly. And yes, it works here: turning your xor-swap into an xchg: https://godbolt.org/z/zs969xh47)

xor-swap has worse latency and defeats mov-elimination

with no memory reads, and the same number of instructions, I don't see any bad impacts and feels odd that it be changed. Clearly there is something I did not think through though, but what is it?

Instruction count is only a rough proxy for one of three things that are relevant for perf analysis: front-end uops, latency, and back-end execution ports. (And machine-code size in bytes: x86 machine-code instructions are variable-length.)

It's the same size in machine-code bytes, and same number of front-end uops, but the critical-path latency is worse: 3 cycles from input a to output a for xor-swap, and 2 from input b to output a, for example.

MOV-swap has at worst 1-cycle and 2-cycle latencies from inputs to outputs, or less with mov-elimination. (Which can also avoid using back-end execution ports, especially relevant for CPUs like IvyBridge and Tiger Lake with a front-end wider than the number of integer ALU ports. And Ice Lake, except Intel disabled mov-elimination on it as an erratum workaround; not sure if it's re-enabled for Tiger Lake or not.)

Also related:


If you're going to branch, just duplicate the averaging code

GCC's real missed optimization here (even with -O3) is that tail-duplication results in about the same static code size, just a couple extra bytes since these are mostly 2-byte instructions. The big win is that the a<b path then becomes the same length as the other, instead of twice as long to first do a swap and then run the same 3 uops for averaging.

update: GCC will do this for you with -ftracer (https://godbolt.org/z/es7a3bEPv), optimizing away the swap. (That's only enabled manually or as part of -fprofile-use, not at -O3, so it's probably not a good idea to use all the time without PGO, potentially bloating machine code in cold functions / code-paths.)

Doing it manually in the source (Godbolt):

uint32_t average_1_taildup(uint32_t a, uint32_t b)
{
    if(a < b){
        return a + (b - a) / 2;
    }else {
        return b + (a - b) / 2;
    }
}
# GCC11.2 -O3
average_1_taildup:
        cmp     edi, esi
        jnb     .L5
        sub     esi, edi
        shr     esi
        lea     eax, [rsi+rdi]
        ret
.L5:
        sub     edi, esi
        shr     edi
        lea     eax, [rdi+rsi]
        ret

Clang compiles both version 1 and 1_taildup into code using cmov (e.g. cmp / mov / cmovb / cmovb, or making a bit of a mess for the tail-duplication version).

But if you're going to go branchless then your average_3 is superior:

uint32_t average_3(uint32_t a, uint32_t b)
{
    return (a & b) + ((a ^ b) >> 1);
}
# clang 13.0 -O3
average_3:
        mov     eax, esi
        and     eax, edi
        xor     esi, edi
        shr     esi
        add     eax, esi
        ret

Both GCC and clang's versions are only 5 instructions (plus ret), but clang arranges it so the critical-path latency is only 3 cycles (3 single-uop instructions) from either input to the output, even without mov-elimination. (GCC has one chain that's 4 instructions long including a mov.)

Efficient non-overflowing unsigned mean

See also Efficient overflow-immune arithmetic mean in C/C++ - widening to uint64_t can be even cheaper, especially when inlining, on a 64-bit machine. (As discussed in comments under the question, e.g. https://godbolt.org/z/sz53eEYh9 shows code from the existing answers at the time I commented.)

Another good option is this, but usually not as good as widening:

  return (a&b) + (a^b)/2;

If compilers recognized any of these idioms, they could use the asm add/rcr trick, though, which is even more efficient than widening to unsigned __int128 for averaging uint64_t.

Unspeakable answered 7/3, 2022 at 20:19 Comment(3)
Thank you. Should I see the GCC's miss optimization as a bug or does it basically doesn't do tail duplication (if so that would be weird)?Philippeville
From the GCC's Page it says tail duplication are applied if flag -ftrace is used, which doesn't appear in either -O1, -O2 and -O3 optimizations, might be this issue?Philippeville
@Uduru: Interesting, yeah adding -ftracer gets GCC to optimize your original source that way, eliminating the swap. godbolt.org/z/es7a3bEPv (thus being a great example of why it's useful for GCC to fully see through xor-swaps.) GCC does do some loop and function tail-duplication sometimes, but IDK what kind of threshold it has for it. -ftracer is on with -fprofile-use, which makes sense: it can increase code-size more than you'd want if done too aggressively in cold code; it's neutral here because the swap can then optimize away.Unspeakable

© 2022 - 2024 — McMap. All rights reserved.