How can I make branchless code and how does it work?
Asked Answered
T

3

37

Related to this answer.

In the above answer, it's mentioned how you can avoid branch prediction fails by avoiding branches.

The user demonstrates this by replacing:

if (data[c] >= 128)
{
    sum += data[c];
}

With:

int t = (data[c] - 128) >> 31;
sum += ~t & data[c];

How are these two equivalent (for the specific data set, not strictly equivalent)?

What are some general ways I can do similar things in similar situations? Would it always be by using >> and ~?

Tungstite answered 19/8, 2015 at 23:10 Comment(4)
hackersdelight.org is a nice collection of short functions, often exploiting clever bit-hacks. I think there's another collection that focuses more on bit-hacks like you're talking about, but I can't think of it atm.Teevens
Some compilers can replace the conditional operator ?: with a branchless cmov instruction.Pumping
hackersdelight.org appears to have died. Sad times.Concomitant
Fortunately, it's archived: web.archive.org/web/20190915025154/http://…Howl
L
42
int t = (data[c] - 128) >> 31;

The trick here is that if data[c] >= 128, then data[c] - 128 is nonnegative, otherwise it is negative. The highest bit in an int, the sign bit, is 1 if and only if that number is negative. >> is a shift that extends the sign bit, so shifting right by 31 makes the whole result 0 if it used to be nonnegative, and all 1 bits (which represents -1) if it used to be negative. So t is 0 if data[c] >= 128, and -1 otherwise. ~t switches these possibilities, so ~t is -1 if data[c] >= 128, and 0 otherwise.

x & (-1) is always equal to x, and x & 0 is always equal to 0. So sum += ~t & data[c] increases sum by 0 if data[c] < 128, and by data[c] otherwise.

Many of these tricks can be applied elsewhere. This trick can certainly be generally applied to have a number be 0 if and only if one value is greater than or equal to another value, and -1 otherwise, and you can mess with it some more to get <=, <, and so on. Bit twiddling like this is a common approach to making mathematical operations branch-free, though it's certainly not always going to be built out of the same operations; ^ (xor) and | (or) also come into play sometimes.

Luetic answered 19/8, 2015 at 23:21 Comment(1)
if data[c] >= 128, then ... - only if the subtraction doesn't have signed wraparound. e.g. INT_MIN - 128 wraps to become a positive integer. This is why signed conditions in asm check the Overflow and Sign flags, not just Sign. e.g. on x86, cmovge (greater or equal) checks SF == OF. Only looking at the sign bit is safe if data[c] is a narrower type that gets sign-extended to int, because that makes signed overflow impossible. Or in this case if you know that data[c] >= INT_MIN+128 will always be true.Teevens
T
15

While Louis Wasserman's answer is correct, I want to show you a more general (and much clearer) way to write branchless code. You can just use ? : operator:

    int t = data[c];
    sum += (t >= 128 ? t : 0);

JIT compiler sees from the execution profile that the condition is poorly predicted here. In such cases the compiler is smart enough to replace a conditional branch with a conditional move instruction:

    mov    0x10(%r14,%rbp,4),%r9d  ; load R9d from array
    cmp    $0x80,%r9d              ; compare with 128
    cmovl  %r8d,%r9d               ; if less, move R8d (which is 0) to R9d

You can verify yourself that this version works equally fast for both sorted and unsorted array.

Tuscarora answered 19/8, 2015 at 23:52 Comment(16)
Note that when branches are predictable, turning control dependencies into data dependencies on the critical path can be a bad thing. cmov puts both its inputs into the dependency chain. Also, good compilers can often convert simple if clauses into cmov.Teevens
Yes doing this improves the speed of both sorted and unsorted, but if it's unsorted it's not as effective as making it branchless. See results -> Unsorted = 11.3 seconds, ?: op = 9.18 seconds, Branchless sorted = 5.89 seconds, Branchless unsorted = 5.86, Sorted = 5.56 seconds and ?:op sorted = 4.77 secondsTungstite
@Tungstite What CPU and Java version do you run? On my PC the results are equal for both sorted and unsorted. Note parenthesis around the expression - they are necessary here for producing cmov (due to int-to-long conversion issues).Tuscarora
Java 1.8_51 i7 4930MX. Similar results with 1.8_60Tungstite
@Tungstite Seems like 32-bit Java. OK, try int t = data[c]; sum += (t >= 128 ? t : 0);Tuscarora
yes 32 bit java. That change makes it slower for sorted arrays to ~5.7 comparable but faster than branchless. But greatly speeds up the unsorted one to ~4.89 comparable to the original ?: op line you hadTungstite
? : ternary operator https://en.wikipedia.org/wiki/%3F: +1 for this answer as it is better readable!Withoutdoors
The problem with this approach is that you're relying on compiler optimization. So if you want to be sure that you can branchless code (e.g. to avoid timing attacks in cryptography code), using the ugly workaround is still a good idea (even if it isn't a 100% guarantee either).Pumping
? : is a branch, it just has the advantage over if() that it's an expression rather than a statement. An inline function with an if() and two returns would be basically equivalent to a ? :. If it results in branchless machine language, that's due to optimization. See @aki-suihkonen's answer https://mcmap.net/q/415340/-how-can-i-make-branchless-code-and-how-does-it-work for a longer discussion.Sharp
@Sharp I don't say it is not a branch. Neither I say it is. It is an operator, a language construction. It may or may not be compiled to a branch. In Java bytecode there is a branch (just because there are no other conditional bytecodes except branches). But what I say is that JIT compiler recognizes this particular bytecode pattern and translates it to a conditional move instruction.Tuscarora
@apangin: eww, 32bit java? Try using the server VM, if it still isn't the default in 32bit JVMs. It's the default in 64bit JVMs. It optimizes more.Teevens
? : is in no way a "more general" way to write branchless code, because it isn't a "way" to write branchless code at all. Semantically ? : creates a branch, you even say it's visible in the bytecode. The JIT may or may not turn branched code into branchless code.Sharp
Conditionals create branches, otherwise how does the conditional act on the condition? So maybe another way to ask the question is: How does one create code without conditionals?Sharp
@Sharp I completely agree with your point on conditionals. My answer is not about writing code without conditionals. However, when we talk about microarchitecture, a branch is a little bit different concept. For example, in ARM architecture almost every instruction can be conditional. You can think of conditional instructions as functions with an extra argument. So, in the context of branch prediction I think my answer still makes sense.Tuscarora
Though if you are not content with the wording, please feel free to edit my answer.Tuscarora
Interesting point about ARM conditional instructions. From a branch prediction standpoint, I guess you are right that they should be considered branchless. So strike my equating branchless with conditionless. I maintain that the syntax choice of if() vs ? : is irrelevant to the question asked.Sharp
C
11

Branchless code means typically evaluating all possible outcomes of a conditional statement with a weight from the set [0, 1], so that the Sum{ weight_i } = 1. Most of the calculations are essentially discarded. Some optimization can result from the fact, that E_i doesn't have to be correct when the corresponding weight w_i (or mask m_i) is zero.

  result = (w_0 * E_0) + (w_1 * E_1) + ... + (w_n * E_n)    ;; or
  result = (m_0 & E_0) | (m_1 & E_1) | ... | (m_n * E_n)

where m_i stands for a bitmask.

Speed can be achieved also through parallel processing of E_i with a horizontal collapse.

This is contradictory to the semantics of if (a) b; else c; or it's ternary shorthand a ? b : c, where only one expression out of [b, c] is evaluated.

Thus ternary operation is no magic bullet for branchless code. A decent compiler produces branchless code equally for

t = data[n];
if (t >= 128) sum+=t;

vs.

    movl    -4(%rdi,%rdx), %ecx
    leal    (%rax,%rcx), %esi
    addl    $-128, %ecx
    cmovge  %esi, %eax

Variations of branchless code include presenting the problem through other branchless non-linear functions, such as ABS, if present in the target machine.

e.g.

 2 * min(a,b) = a + b - ABS(a - b),
 2 * max(a,b) = a + b + ABS(a - b)

or even:

 ABS(x) = sqrt(x*x)      ;; caveat -- this is "probably" not efficient

In addition to << and ~, it may be equally beneficial to use bool and !bool instead of (possibly undefined) (int >> 31). Likewise, if the condition evaluates as [0, 1], one can generate a working mask with negation:

-[0, 1] = [0, 0xffffffff]  in 2's complement representation
Concelebrate answered 20/8, 2015 at 8:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.