The ISO C++ rules apply to your source code (always, regardless of the target machine). Not to the asm the compiler chooses to make, especially for targets where signed integer wrapping just works.
The "as if" rules requires that the asm implementation of the function produce the same result as the C++ abstract machine, for every input value where the abstract machine doesn't encounter signed integer overflow (or other undefined behaviour). It doesn't matter how the asm produces those results, that's the entire point of the as-if rule. In some cases, like yours, the most efficient implementation would wrap and unwrap for some values that the abstract machine wouldn't. (Or in general, not wrap where the abstract machine does for unsigned
or gcc -fwrapv
.)
One effect of signed integer overflow being UB in the C++ abstract machine is that it lets the compiler optimize an int
loop counter to pointer width, not redoing sign-extension every time through the loop or things like that. Also, compilers can infer value-range restrictions. But that's totally separate from how they implement the logic into asm for some target machine. UB doesn't mean "required to fail", in fact just the opposite, unless you compile with -fsanitize=undefined
. It's extra freedom for the optimizer to make asm that doesn't match the source if you interpreted the source with more guarantees than ISO C++ actually gives (plus any guarantees the implementation makes beyond that, like if you use gcc -fwrapv
.)
For an expression like x/2
, every possible int x
has well-defined behaviour. For 2*x
, the compiler can assume that x >= INT_MIN/2
and x <= INT_MAX/2
, because larger magnitudes would involve UB.
2*(x-1)+1
implies a legal value-range for x
from (INT_MIN+1)/2
to (INT_MAX+1)/2
. e.g. on a 32-bit 2's complement target, -1073741823
(0xc0000001) to 1073741824
(0x40000000). On the positive side, 2*0x3fffffff
doesn't overflow, doesn't wrap on increment because 2*x
was even.
2*x - 1
implies a legal value-range for x
from INT_MIN/2 + 1
to INT_MAX/2
. e.g. on a 32-bit 2's complement target, -1073741823
(0xc0000001) to 1073741823
(0x3fffffff). So the largest value the expression can produce is 2^n - 3
, because INT_MAX will be odd.
In this case, the more complicated expression's legal value-range is a superset of the simpler expression, but in general that's not always the case.
They produce the same result for every x
that's a well-defined input for both of them. And x86 asm (where wrapping is well-defined) that works like one or the other can implement either, producing correct results for all non-UB cases. So the compiler would be doing a bad job if it didn't make the same efficient asm for both.
In general, 2's complement and unsigned binary integer math is commutative and associative (for operations where that's mathematically true, like +
and *
), and compilers can and should take full advantage. e.g. rearranging a+b+c+d
to (a+b)+(c+d)
to shorten dependency chains. (See an answer on Why doesn't GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)? for an example of GCC doing it with integer, but not FP.)
Unfortunately, GCC has sometimes been reluctant to do signed-int optimizations like that because its internals were treating signed integer math as non-associative, perhaps because of a misguided application of C++ UB rules to optimizing asm for the target machine. That's a GCC missed optimization; Clang didn't have that problem.
Further reading:
The whole situation is basically a mess, and the designers of C didn't anticipate the current sophistication of optimizing compilers. Languages like Rust are better suited to it: if you want wrapping, you can (and must) tell the compiler about it on a per-operation basis, for both signed and unsigned types. Like x.wrapping_add(1)
.
Re: why does clang split up the 2*x
and the -1
with lea
/dec
Clang is optimizing for latency on Intel CPUs before Ice lake, saving one cycle of latency at the cost of an extra uop of throughput cost. (Compilers often favour latency since modern CPUs are often wide enough to chew through the throughput costs, although it does eat up space in the out-of-order exec window for hiding cache miss latency.)
lea eax, [rdi + rdi - 1]
has 3 cycle latency on Skylake, vs. 1 for the LEA it used. (See Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly? for some details). On AMD Zen family, it's break-even for latency (a complex LEA only has 2c latency) while still costing an extra uop. On Ice Lake and later Intel, even a 3-component LEA is still only 1 cycle so it's pure downside there. See https://uops.info/, the entry for LEA_B_I_D8 (R32)
(Base, Index, 8-bit displacement, with scale-factor = 1.)
This tuning decision is unrelated to integer overflow.
int
s, so I have edited the question. – Ryals