Branch Prediction at no cost?
Asked Answered
M

2

8

I've just stumbled upon this thing, and I'm really curious if maybe modern CPUs (current ones, maybe mobile ones as well (embedded)) don't actually have a branching cost in the situation below.

1.Let's say we have this:

x += a; // let's assume they are both declared earlier as simple ints  
if (flag)  
   do A  // let's assume A is not the same as B  
else  
   do B  // and of course B is different than A  

2.Compared to this:

if (flag)  
{  
  x += a   
  do A  
}  
else  
{  
   x += a  
   do B  
}

Assuming A and B are completely different in therms of pipeline instructions (fetch, decode, execute, etc):

  1. Is the 2nd approach going to be faster ?

  2. Are CPUs smart enough to tell that no matter what the flag is, the next instruction is the same (so they won't have to discard pipeline stages for it because of branch miss prediction ) ?

Note:

In the first case the CPU has no option, but to discard the first few pipeline stages of the do A or do B if a branch miss prediction happened, because they are different. I see the 2nd example as a somehow delayed branching like: " I'm going to check that flag, even if I don't know the flag, I can get on with the next instruction because it's the same, no matter what the flag is, I already have the next instruction and it's OK for me to use it."

EDIT:
I did some research and I have some nice results. How would you explain this behavior ? Sorry for my latest edit, but I had some cache problems as far as I could see, these are more accurate results and code samples, I hope.

Here is the code, compiled with gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) using -O3.

Case 1.

#include <stdio.h>

extern int * cache;
extern bool * b;
extern int * x;
extern int * a;
extern unsigned long * loop;

extern void A();
extern void B();

int main()
{
    for (unsigned long i = 0; i < *loop; ++i)
    {
        ++*cache;

        *x += *a;

        if (*b)
        {
            A();
        }
        else
        {
            B();
        }
    }

    delete b;
    delete x;
    delete a;
    delete loop;
    delete cache;

    return 0;
}

int * cache = new int(0);
bool * b = new bool(true);
int * x = new int(0);
int * a = new int(0);
unsigned long * loop = new unsigned long(0x0ffffffe);

void A() { --*x; *b = false; }
void B() { ++*x; *b = true; }

Case 2

#include <stdio.h>

extern int * cache;
extern bool * b;
extern int * x;
extern int * a;
extern unsigned long * loop;

extern void A();
extern void B();

int main()
{
    for (unsigned long i = 0; i < *loop; ++i)
    {
        ++*cache;

        if (*b)
        {
            *x += *a;
            A();
        }
        else
        {
            *x += *a;
            B();
        }
    }

    delete b;
    delete x;
    delete a;
    delete loop;
    delete cache;

    return 0;
}

int * cache = new int(0);
bool * b = new bool(true);
int * x = new int(0);
int * a = new int(0);
unsigned long * loop = new unsigned long(0x0ffffffe);

void A() { --*x; *b = false; }
void B() { ++*x; *b = true; }

There is pretty much unnoticeable difference between the -O3 versions of both approaches, but without -O3, second case does run slightly faster, at least on my machine. I have tested without -O3 and with the loop = 0xfffffffe.
Best times:
alin@ubuntu:~/Desktop$ time ./1

real 0m20.231s
user 0m20.224s
sys 0m0.020s

alin@ubuntu:~/Desktop$ time ./2

real 0m19.932s
user 0m19.890s
sys 0m0.060s

Motherwort answered 27/9, 2015 at 9:55 Comment(7)
Such things are generally optimized by compilers, not at execution/CPU level.Saxony
I suspect compiler optimizer would do its job and factor that out to yield same code.Entreaty
PS: thank you for the code edit (it's my very first post, sorry about that). So in other words, I could write case 2 as 1 and trust the compiler to notice this ?Motherwort
@Calvin Factoring out the common code would defeat the optimization attempt.Lolanthe
@AlinIonutLipan: I haven't seen compilers on x86 machines doing this (transform case 1 to case 2,) but I have seen thin on RISC machines decades ago (but not exactly like this.) And that indeed was being done by the compiler. Generally speaking, you can not depend on compiler optimization too much, but this one is a relatively simple and obvious pinhole optimization. I'd recommend always writing case 1 though, as it is easier for the compiler to do.Gingras
Note that, in general, case 1 and case 2 are NOT equivalent as it is possible that pointers 'x' and 'b' point to the same memory location. If so, this would mean that case 1 updates the value then checks it, whereas case 2 checks the value then updates it. A compiler is NOT allowed to make this type of transformation unless either (a) it can prove there is no aliasing (which it might be able to do here in this simple case, or (b) you've promised it that your pointers don't alias, using 'restrict' in C / '__restrict'/'__restrict__' in C++ or using a compiler option.)Jagatai
Hi Ian, I'm just wondering how could x and b point to the same memory location since each of them get their memory from their own new call ? Could you please be more specific on this ?Motherwort
H
6

There are two parts to this:

First, does the compiler optimize this?

Let's run an experiment:

test.cc

#include <random>
#include "test2.h"

int main() {
  std::default_random_engine e;
  std::uniform_int_distribution<int> d(0,1);
  int flag = d(e);

  int x = 0;
  int a = 1;

  if (flag) {
    x += a;
    doA(x);
    return x;
  } else {
    x += a;
    doB(x);
    return x;
  }
}

test2.h

void doA(int& x);
void doB(int& x);

test2.cc

void doA(int& x) {}
void doB(int& x) {}

test2.cc and test2.h both exist solely to prevent the compiler from optimizing away everything. The compiler cannot be certain that there isn't a side effect because these functions exist in another translation unit.

Now we compile to assembly:

gcc -std=c++11 -S test.cc

And let's jump to the part of the assembly that's interesting:

  call  _ZNSt24uniform_int_distributionIiEclISt26linear_congruential_engineImLm16807ELm0ELm2147483647EEEEiRT_
  movl  %eax, -40(%rbp); <- setting flag
  movl  $0, -44(%rbp);   <- setting x
  movl  $1, -36(%rbp);   <- setting a
  cmpl  $0, -40(%rbp);   <- first part of if (flag)
  je    .L2;             <- second part of if (flag)
  movl  -44(%rbp), %edx  <- setting up x
  movl  -36(%rbp), %eax  <- setting up a
  addl  %edx, %eax       <- adding x and a
  movl  %eax, -44(%rbp)  <- assigning back to x
  leaq  -44(%rbp), %rax  <- grabbing address of x
  movq  %rax, %rdi       <- bookkeeping for function call
  call  _Z3doARi         <- function call doA
  movl  -44(%rbp), %eax
  jmp   .L4
.L2:
  movl  -44(%rbp), %edx  <- setting up x
  movl  -36(%rbp), %eax  <- setting up a
  addl  %edx, %eax       <- perform the addition
  movl  %eax, -44(%rbp)  <- move it back to x
  leaq  -44(%rbp), %rax  <- and so on
  movq  %rax, %rdi
  call  _Z3doBRi
  movl  -44(%rbp), %eax
.L4:

So we can see that the compiler didn't optimize it. But we also didn't actually ask it to.

g++ -std=c++11 -S -O3 test.cc

and then the interesting assembly:

main:
.LFB4729:
  .cfi_startproc
  subq  $56, %rsp
  .cfi_def_cfa_offset 64
  leaq  32(%rsp), %rdx
  leaq  16(%rsp), %rsi
  movq  $1, 16(%rsp)
  movq  %fs:40, %rax
  movq  %rax, 40(%rsp)
  xorl  %eax, %eax
  movq  %rdx, %rdi
  movl  $0, 32(%rsp)
  movl  $1, 36(%rsp)
  call  _ZNSt24uniform_int_distributionIiEclISt26linear_congruential_engineImLm16807ELm0ELm2147483647EEEEiRT_RKNS0_10param_typeE
  testl %eax, %eax
  movl  $1, 12(%rsp)
  leaq  12(%rsp), %rdi
  jne   .L83
  call  _Z3doBRi
  movl  12(%rsp), %eax
.L80:
  movq  40(%rsp), %rcx
  xorq  %fs:40, %rcx
  jne   .L84
  addq  $56, %rsp
  .cfi_remember_state
  .cfi_def_cfa_offset 8
  ret
.L83:
  .cfi_restore_state
  call  _Z3doARi
  movl  12(%rsp), %eax
  jmp   .L80

This is a bit beyond my ability to cleanly show a 1 to 1 relationship between the assembly and the code, but you can tell from the calls to doA and doB that the setup is all common and done outside of the if statement. (Above the line jne .L83). So yes, compilers do perform this optimization.

Part 2:

How can we know if CPUs do this optimization if given the first code?

I'm actually not aware of a way to test this. So I do not know. I'd rate it as plausible given that out of order and speculative execution exists. But the proof is in the pudding, and I don't have a way of testing this pudding. So I'm reluctant to make a claim one way or another.

Hadsall answered 27/9, 2015 at 10:36 Comment(8)
The same explanation with equivalent C code would be less confusing.Deductive
The only real differences would be lack of name mangling and different random function name calls. This is fine imo. I skipped over most of the setup in both cases.Hadsall
Thank you for your answer, and yes I understand that we should always write case 1 with no fuss. I was wondering if is it possible for case 2 to be faster than case 1 (let's assume the compiler knows nothing about the values, let's assume we had pointers all over the place and the compiler can't know the side effects just yet). Without knowing how could he possibly optimize case 1 ? I'm gonna do some testing myself and see if can case 2 be any faster and if so, by how much.Motherwort
I've only tested case 2 to show that it will compile to something semantically equivalent to case 1. With the limited example you gave, I can't see how case 2 could possibly be faster than case 1 (only equal to). Perhaps you can give more detail?Hadsall
That's what I mean, name mangling and is confusing to non C++ programmers, the question being tagged C as well, flag = rand(); would be simple enough.Deductive
@Deductive If you propose an edit with these changes (c code and commented generated assembly), I'll be happy to accept it. I just went into c++ mode because it's my default.Hadsall
Aside from the language choice, your answer is interesting: the compiler can optimize option 2 into code similar to option 1, but it does not really address the OP's question about whether correctly predicted branches are transparent to CPU's pipeline handling regarding register dependencies. What you are exhibiting is that g++ seems to make this assumption, not whether it is correct, nor to which extend.Deductive
@chqrlie, thanks. Mostly I wrote it because people were saying things like "Compilers don't optimize that," which I knew from previous experience was incorrect. Unfortunately, I could not think of a convincing way to demonstrate the internal behavior of my CPU, thus I could not adequately answer OP's direct question. It seems reasonable to think that g++ will do the "right thing" from a performance perspective. But I cannot prove it, of course.Hadsall
S
5

Back in the day CPUs explicitly supported something a bit like this - after a branch instruction the next instruction would always be executed whether or not the branch was actually taken (look up "branch delay slot").

I'm pretty sure modern CPUs just dump the whole pipeline on a branch mispredict. There's no point attempting to do the optimisation you suggest at execution time when the compiler can easily do it at compilation time.

Solfatara answered 27/9, 2015 at 10:17 Comment(3)
Ah, I was just trying to remember the name "delay slot" to post almost exactly the same answer as yours. :DGingras
Thank you, I didn't knew about the delay slot, that seems to be exactly the info I was missing :) So I see no point then in writing the unclean case 2.Motherwort
Write whatever is clearest in the circumstances - which will usually be 1.Solfatara

© 2022 - 2024 — McMap. All rights reserved.