Is using the result of comparison as int really branchless?
Asked Answered
G

2

8

I have seen code like this in many answers, and the authors say this is branchless:

template <typename T> 
inline T imax (T a, T b)
{
    return (a > b) * a + (a <= b) * b;
}

But is this really branchless on current architectures? (x86, ARM...) And is there a real standard guarantee that this is branchless?

Gerardgerardo answered 8/12, 2015 at 15:1 Comment(5)
Why do you need such guarantee?Abut
The C++ standard guarantees nothing about the machine code (or even that there is machine code), except that it, or whatever is used instead, will reproduce the effects required by the semantics of the C++ statements. These effects are changes to memory contents and calls of library functions.Disfavor
It is not branchless on a number of embedded processors in 2015.Seacoast
What Alf said. Note that there is also the variant where you use an array, const T v[] = {b, a}; return v[a>b], which, depending on the predictability of the code, may run even faster. However, only do this kind of microoptimization if you have measured relevance.Kitty
@Abut : Because I want to do heavy premature optimizations & over-engineering.Gerardgerardo
V
7

x86 has the SETcc family of instructions which set a byte register to 1 or 0 depending on the value of a flag. This is commonly used by compilers to implement this kind of code without branches.

If you use the “naïve” approach

int imax(int a, int b) {
    return a > b ? a : b;
}

The compiler would generate even more efficient branch-less code using the CMOVcc (conditional move) family of instructions.

ARM has the ability to conditionally execute every instruction which allowed the compiler to compile both your and the naïve implementation efficiently, the naïve implementation being faster.

Variegated answered 8/12, 2015 at 15:14 Comment(8)
So in other words: by attempting to manually optimize the code, instead of writing as readable and simple code as possible, the code turned both uglier and slower.Zaller
Which is not surprising at all. Compiler writers generally know chipset instructions much better than C++ programmers, by definition.Splint
That's a good answer for the platform part, but is there really a guarantee to be branchless in the standard? And I suppose that using if/else will work as your "naïve" approach, which basically means that these pseudo optimized answers I'm citing as degrading performance compared to std functions (min, max, ...)Gerardgerardo
@Gerardgerardo I'm not familiar with the C++ standard but if it's anything like the C standard, it makes absolutely now guarantees about how something is implemented. From the perspective of the standard, even addition might be implemented with branching.Variegated
@Gerardgerardo I removed the sentence you added to the answer as I'm not familiar with the C++ standard and don't want to make any claims about it.Variegated
I'm not sure the x86 solution would work on a non-integer version of this function.Turnover
@Turnover When an 387 FPU is used, it's possible. I'm not sure about MMX and SSE though.Variegated
@Turnover And for SSE, there's the minsd instruction which does that in one operation.Variegated
H
0

I stumbled upon this SO question because I was asking me the same… turns out it’s not always. For instance, the following code…

const struct op {
    const char *foo;
    int bar;
    int flags;
} ops[] = {
    { "foo", 5, 16 },
    { "bar", 9, 16 },
    { "baz", 13, 0 },
    { 0, 0, 0 }
};

extern int foo(const struct op *, int);

int
bar(void *a, void *b, int c, const struct op *d)
{
    c |= (a == b) && (d->flags & 16);
    return foo(d, c) + 1;
}

… compiles to branching code with both gcc 3.4.6 (i386) and 8.3.0 (amd64, i386) in all optimisation levels. The one from 3.4.6 is more manually legibe, I’ll demonstrate with gcc -O2 -S -masm=intel x.c; less x.s:

[…]
    .text
    .p2align 2,,3
    .globl   bar
    .type    bar , @function
bar:
    push     %ebp
    mov      %ebp, %esp
    push     %ebx
    push     %eax
    mov      %eax, DWORD PTR [%ebp+12]
    xor      %ecx, %ecx
    cmp      DWORD PTR [%ebp+8], %eax
    mov      %edx, DWORD PTR [%ebp+16]
    mov      %ebx, DWORD PTR [%ebp+20]
    je       .L4
.L2:
    sub      %esp, 8
    or       %edx, %ecx
    push     %edx
    push     %ebx
    call     foo
    inc      %eax
    mov      %ebx, DWORD PTR [%ebp-4]
    leave
    ret
    .p2align 2,,3
.L4:
    test     BYTE PTR [%ebx+8], 16
    je       .L2
    mov      %cl, 1
    jmp      .L2
    .size    bar , . - bar

Turns out the pointer comparison operation invokes a comparison and even a subroutine to insert 1.

Not even using !!(a == b) makes a difference here.

tl;dr

Check the actual compiler output (assembly with -S or disassembly with objdump -d -Mintel x.o; drop the -Mintel if not on x86, it merely makes the assembly more legible) of the actual compilation; compilers are unpredictable beasts.

Hurtful answered 2/8, 2019 at 18:6 Comment(4)
Note that if you compile for 32 bit, you might need to specify a sufficiently recent architecture (e.g. -march=i686) because the required branch-free instructions were only added some 25 years ago.Variegated
And note that in this specific case, branch-free code cannot be used because the compiler cannot assume that it is allowed to dereference d->flags if a != b. Thus, it has to emit a branch, guarding this dereference.Variegated
The culprit here I suspect is the && logical operator, which needs branching for the short-circuit evaluation, rather than the actual comparison operator. Try c |= (a == b) & (d->flags >> 4 & 1); instead.Hirundine
@user16217248 aaaaaaah you’re right… and the compiler cannot generally make that equivalence because of the short-circuiting (d may be nil after all)… good point, and with that it also is branchless (cmp + sete) in GCC 3.4.6 even for i386.Hurtful

© 2022 - 2024 — McMap. All rights reserved.