Is using the result of comparison as int really branchless?

Asked 8/12, 2015 at 15:1 Answered 2/8, 2019 at 18:6

Solved c++c comparison branch-prediction branchless

I have seen code like this in many answers, and the authors say this is branchless:

template <typename T> 
inline T imax (T a, T b)
{
    return (a > b) * a + (a <= b) * b;
}

But is this really branchless on current architectures? (x86, ARM...) And is there a real standard guarantee that this is branchless?

Gerardgerardo answered 8/12, 2015 at 15:1 Comment(5)

Why do you need such guarantee? – Abut 8/12, 2015 at 15:4

The C++ standard guarantees nothing about the machine code (or even that there is machine code), except that it, or whatever is used instead, will reproduce the effects required by the semantics of the C++ statements. These effects are changes to memory contents and calls of library functions. – Disfavor 8/12, 2015 at 15:9

It is not branchless on a number of embedded processors in 2015. – Seacoast 8/12, 2015 at 15:16

What Alf said. Note that there is also the variant where you use an array, const T v[] = {b, a}; return v[a>b], which, depending on the predictability of the code, may run even faster. However, only do this kind of microoptimization if you have measured relevance. – Kitty 8/12, 2015 at 15:20

@Abut : Because I want to do heavy premature optimizations & over-engineering. – Gerardgerardo 8/12, 2015 at 15:29

x86 has the SETcc family of instructions which set a byte register to 1 or 0 depending on the value of a flag. This is commonly used by compilers to implement this kind of code without branches.

If you use the “naïve” approach

int imax(int a, int b) {
    return a > b ? a : b;
}

The compiler would generate even more efficient branch-less code using the CMOVcc (conditional move) family of instructions.

ARM has the ability to conditionally execute every instruction which allowed the compiler to compile both your and the naïve implementation efficiently, the naïve implementation being faster.

Variegated answered 8/12, 2015 at 15:14 Comment(8)

So in other words: by attempting to manually optimize the code, instead of writing as readable and simple code as possible, the code turned both uglier and slower. – Zaller 8/12, 2015 at 15:24

Which is not surprising at all. Compiler writers generally know chipset instructions much better than C++ programmers, by definition. – Splint 8/12, 2015 at 15:25

That's a good answer for the platform part, but is there really a guarantee to be branchless in the standard? And I suppose that using if/else will work as your "naïve" approach, which basically means that these pseudo optimized answers I'm citing as degrading performance compared to std functions (min, max, ...) – Gerardgerardo 8/12, 2015 at 15:34

@Gerardgerardo I'm not familiar with the C++ standard but if it's anything like the C standard, it makes absolutely now guarantees about how something is implemented. From the perspective of the standard, even addition might be implemented with branching. – Variegated 8/12, 2015 at 15:35

@Gerardgerardo I removed the sentence you added to the answer as I'm not familiar with the C++ standard and don't want to make any claims about it. – Variegated 9/12, 2015 at 9:19

I'm not sure the x86 solution would work on a non-integer version of this function. – Turnover 23/12, 2015 at 0:46

@Turnover When an 387 FPU is used, it's possible. I'm not sure about MMX and SSE though. – Variegated 23/12, 2015 at 0:48

@Turnover And for SSE, there's the minsd instruction which does that in one operation. – Variegated 23/12, 2015 at 0:50

I stumbled upon this SO question because I was asking me the same… turns out it’s not always. For instance, the following code…

const struct op {
    const char *foo;
    int bar;
    int flags;
} ops[] = {
    { "foo", 5, 16 },
    { "bar", 9, 16 },
    { "baz", 13, 0 },
    { 0, 0, 0 }
};

extern int foo(const struct op *, int);

int
bar(void *a, void *b, int c, const struct op *d)
{
    c |= (a == b) && (d->flags & 16);
    return foo(d, c) + 1;
}

… compiles to branching code with both gcc 3.4.6 (i386) and 8.3.0 (amd64, i386) in all optimisation levels. The one from 3.4.6 is more manually legibe, I’ll demonstrate with gcc -O2 -S -masm=intel x.c; less x.s:

[…]
    .text
    .p2align 2,,3
    .globl   bar
    .type    bar , @function
bar:
    push     %ebp
    mov      %ebp, %esp
    push     %ebx
    push     %eax
    mov      %eax, DWORD PTR [%ebp+12]
    xor      %ecx, %ecx
    cmp      DWORD PTR [%ebp+8], %eax
    mov      %edx, DWORD PTR [%ebp+16]
    mov      %ebx, DWORD PTR [%ebp+20]
    je       .L4
.L2:
    sub      %esp, 8
    or       %edx, %ecx
    push     %edx
    push     %ebx
    call     foo
    inc      %eax
    mov      %ebx, DWORD PTR [%ebp-4]
    leave
    ret
    .p2align 2,,3
.L4:
    test     BYTE PTR [%ebx+8], 16
    je       .L2
    mov      %cl, 1
    jmp      .L2
    .size    bar , . - bar

Turns out the pointer comparison operation invokes a comparison and even a subroutine to insert 1.

Not even using !!(a == b) makes a difference here.

tl;dr

Check the actual compiler output (assembly with -S or disassembly with objdump -d -Mintel x.o; drop the -Mintel if not on x86, it merely makes the assembly more legible) of the actual compilation; compilers are unpredictable beasts.

Hurtful answered 2/8, 2019 at 18:6 Comment(4)

Note that if you compile for 32 bit, you might need to specify a sufficiently recent architecture (e.g. -march=i686) because the required branch-free instructions were only added some 25 years ago. – Variegated 28/9, 2020 at 7:50

And note that in this specific case, branch-free code cannot be used because the compiler cannot assume that it is allowed to dereference d->flags if a != b. Thus, it has to emit a branch, guarding this dereference. – Variegated 28/9, 2020 at 7:53

The culprit here I suspect is the && logical operator, which needs branching for the short-circuit evaluation, rather than the actual comparison operator. Try c |= (a == b) & (d->flags >> 4 & 1); instead. – Hirundine 12/11, 2023 at 19:55

@user16217248 aaaaaaah you’re right… and the compiler cannot generally make that equivalence because of the short-circuiting (d may be nil after all)… good point, and with that it also is branchless (cmp + sete) in GCC 3.4.6 even for i386. – Hurtful 13/11, 2023 at 21:57

tl;dr

Recommended topics

Hot tags