Branchless code means typically evaluating all possible outcomes of a conditional statement with a weight from the set [0, 1], so that the Sum{ weight_i } = 1. Most of the calculations are essentially discarded. Some optimization can result from the fact, that E_i
doesn't have to be correct when the corresponding weight w_i
(or mask m_i
) is zero.
result = (w_0 * E_0) + (w_1 * E_1) + ... + (w_n * E_n) ;; or
result = (m_0 & E_0) | (m_1 & E_1) | ... | (m_n * E_n)
where m_i stands for a bitmask.
Speed can be achieved also through parallel processing of E_i with a horizontal collapse.
This is contradictory to the semantics of if (a) b; else c;
or it's ternary shorthand a ? b : c
, where only one expression out of [b, c] is evaluated.
Thus ternary operation is no magic bullet for branchless code. A decent compiler produces branchless code equally for
t = data[n];
if (t >= 128) sum+=t;
vs.
movl -4(%rdi,%rdx), %ecx
leal (%rax,%rcx), %esi
addl $-128, %ecx
cmovge %esi, %eax
Variations of branchless code include presenting the problem through other branchless non-linear functions, such as ABS, if present in the target machine.
e.g.
2 * min(a,b) = a + b - ABS(a - b),
2 * max(a,b) = a + b + ABS(a - b)
or even:
ABS(x) = sqrt(x*x) ;; caveat -- this is "probably" not efficient
In addition to <<
and ~
, it may be equally beneficial to use bool
and !bool
instead of (possibly undefined) (int >> 31). Likewise, if the condition evaluates as [0, 1], one can generate a working mask with negation:
-[0, 1] = [0, 0xffffffff] in 2's complement representation
?:
with a branchlesscmov
instruction. – Pumping