If you need this non-zero guarantee for an input to __builtin_clz
(which gives an undefined result for an input of zero, otherwise counts leading-zero bits), a useful trick is y = x|1
. That doesn't affect the number of leading zeros for any non-zero input, because you'd setting the last position CLZ will look at, only if all the higher bits are zero. If x=1
, then x|1
is still 1
.
Similarly as setup for __builtin_ctz
(count trailing zeros), y = x | (1UL<<31)
will set the MSB. Or more portably, x | ( 1ULL << (sizeof(x)*CHAR_BIT-1) )
.
Unconditionally setting one of the bits will change some non-zero values; this is only useful in certain cases.
To get exactly the semantics you asked for, y = x ? x : 1
can be expressed that way, or as x + !x
as suggested in another answer, which asks the compiler to materialize a boolean integer and add it.
Of course, a smart compiler will hopefully pick a way that's efficient on the target machine. For x86, one might hope that a smart compiler would do something like this if it's ok to overwrite x
:
cmp eax, 1 ; CF=1 only if EAX==0 because only 0 is < 1U
adc eax, 0 ; x += (eax < 1U)
While for AArch64, tst
or cmp
and then cinc
can conditionally copy-and-increment a value depending on any flags conditions.
On the Godbolt compiler explorer -
int make_nonzero_booleanize(int x){
return x + !x;
}
int make_nonzero_ternary(int x){
return x ? x : 1;
}
int make_nonzero_if(int x){
// the original. Compilers may compile it to branchless code
// especially gcc -O3 instead of -O2 is more aggressive about if-conversion
int y;
if (x)
y = x;
else
y = 1;
return y;
}
# GCC12 -O3 for x86-64
make_nonzero_booleanize(int): # return x + !x
cmp edi, 1
mov eax, edi
adc eax, 0 # Apparently a GCC dev already thought of this peephole and taught GCC10 about it
# I was expecting compilers to naively do xor-zeroing + test/setcc + add
ret
make_nonzero_ternary(int):
mov eax, edi
test edi, edi
mov edx, 1
cmove eax, edx # standard ternary using conditional-move
ret
make_nonzero_if(int):
# same asm as ternary; GCC did an if-conversion optimization
// ARM64 gcc12 -O2
make_nonzero_booleanize(int):
cmp w0, 0
cinc w0, w0, eq // return x==0 ? x : x+1
ret
make_nonzero_ternary(int):
cmp w0, 0
csinc w0, w0, wzr, ne // return x==0 ? x : 0+1
ret
make_nonzero_if(int):
// identical to ternary, if-conversion even with -O2
Both ways are equally good; the csinc
result still has a data dependency on the input w0
even if the condition is true and the result comes from incrementing the zero register (wzr
).
# RISC-V 32-bit clang 15 -O3
make_nonzero_booleanize(int):
seqz a1, a0 # tmp = (x==0)
add a0, a0, a1 # return x + tmp
ret
make_nonzero_ternary(int):
mv a1, a0 # tmp = x
li a0, 1 # retval = 1
beqz a1, .LBB1_2 # if (x != 0) {
mv a0, a1 # retval = tmp
.LBB1_2: # }
ret # return retval
make_nonzero_if(int):
# identical to ternary, same branching.
Clang didn't even do a good job with the branching strategy it picked here; it could have done a bnez
over a retval=1
, avoiding both mv
instructions. Unless there's some tuning thing where a branch over an mv
is handled by some RISC-V microarchitectures as a conditional-move? Hopefully after inlining it wouldn't be this bad, and might put a branch target for the ==0
special case somewhere else, where it didn't have to jump over it.
So x + !x
is better with some real-world compilers on x86-64 and RISC-V, and equal on AArch64.
If you care about this in some specific context with a specific compiler for some ISA, look at how it compiles in your code. (If you know anything about what's efficient in asm or not.)
Older GCC versions:
GCC7 and earlier does better with the ternary, avoiding the mov eax, edi
, instead doing mov eax, 1
and then using cmovnz
to copy EDI or not. Hopefully the inefficiency in GCC8 and later only happens when not inlining this tiny function; GCC is known to waste instructions when it has to work around hard register constraints from calling conventions or inline asm.
GCC9 and earlier doesn't know the cmp
/adc
trick for x86-64, instead using a literal implementation of x + !x
, which is unfortunately not great because x86 sucks at materializing a boolean condition as a 0/1 integer due to the poor design of setcc
.
# GCC9 and earlier -O3
make_nonzero_booleanize(int):
xor eax, eax # zero the full reg
test edi, edi # set FLAGS according to x
sete al # EAX = !x
add eax, edi # EAX += x
ret
To be fair, adc eax, 0
costs 2 uops on Intel before Sandybridge. (And GCC tuning choices maybe didn't realize that adc
with immediate 0 was special cased; in general adc
costs 2 uops on Intel before Broadwell.)
Branchless might be slower if x == 0
is very rare
Even in the best case, the branchless code for x86-64, AArch64, and RISC-V is lengthening the critical path dependency chain by 2 instructions, both with 1 cycle latency on most CPUs. (For example, csinc
can't run until the flags result of cmp
is ready.)
But a correctly predicted branch is a control dependency, so it's handled separately by the CPU, optimistically assuming that y=x
will be the case, only bailing out if it later detects a mispredict.
But writing an if()
in the C source doesn't guarantee you'll get branchy asm. (That or an ||
makes it more likely to compile branchy than other ways of writing it, though.)
Related:
x
long before you get to the shown code. – Intertypex != 0
guard. I dunno about C though, sorry. – Ixtley = !x + x;
. It will be 1 if x==0 and x otherwise. – Consortium