I've just stumbled upon this thing, and I'm really curious if maybe modern CPUs (current ones, maybe mobile ones as well (embedded)) don't actually have a branching cost in the situation below.
1.Let's say we have this:
x += a; // let's assume they are both declared earlier as simple ints
if (flag)
do A // let's assume A is not the same as B
else
do B // and of course B is different than A
2.Compared to this:
if (flag)
{
x += a
do A
}
else
{
x += a
do B
}
Assuming A
and B
are completely different in therms of pipeline instructions (fetch, decode, execute, etc):
Is the 2nd approach going to be faster ?
Are CPUs smart enough to tell that no matter what the flag is, the next instruction is the same (so they won't have to discard pipeline stages for it because of branch miss prediction ) ?
Note:
In the first case the CPU has no option, but to discard the first few pipeline stages of the do A
or do B
if a branch miss prediction happened, because they are different. I see the 2nd example as a somehow delayed branching like: " I'm going to check that flag, even if I don't know the flag, I can get on with the next instruction because it's the same, no matter what the flag is, I already have the next instruction and it's OK for me to use it."
EDIT:
I did some research and I have some nice results. How would you explain this behavior ? Sorry for my latest edit, but I had some cache problems as far as I could see, these are more accurate results and code samples, I hope.
Here is the code, compiled with gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) using -O3.
Case 1.
#include <stdio.h>
extern int * cache;
extern bool * b;
extern int * x;
extern int * a;
extern unsigned long * loop;
extern void A();
extern void B();
int main()
{
for (unsigned long i = 0; i < *loop; ++i)
{
++*cache;
*x += *a;
if (*b)
{
A();
}
else
{
B();
}
}
delete b;
delete x;
delete a;
delete loop;
delete cache;
return 0;
}
int * cache = new int(0);
bool * b = new bool(true);
int * x = new int(0);
int * a = new int(0);
unsigned long * loop = new unsigned long(0x0ffffffe);
void A() { --*x; *b = false; }
void B() { ++*x; *b = true; }
Case 2
#include <stdio.h>
extern int * cache;
extern bool * b;
extern int * x;
extern int * a;
extern unsigned long * loop;
extern void A();
extern void B();
int main()
{
for (unsigned long i = 0; i < *loop; ++i)
{
++*cache;
if (*b)
{
*x += *a;
A();
}
else
{
*x += *a;
B();
}
}
delete b;
delete x;
delete a;
delete loop;
delete cache;
return 0;
}
int * cache = new int(0);
bool * b = new bool(true);
int * x = new int(0);
int * a = new int(0);
unsigned long * loop = new unsigned long(0x0ffffffe);
void A() { --*x; *b = false; }
void B() { ++*x; *b = true; }
There is pretty much unnoticeable difference between the -O3 versions of both approaches, but without -O3, second case does run slightly faster, at least on my machine.
I have tested without -O3 and with the loop = 0xfffffffe.
Best times:
alin@ubuntu:~/Desktop$ time ./1
real 0m20.231s
user 0m20.224s
sys 0m0.020s
alin@ubuntu:~/Desktop$ time ./2
real 0m19.932s
user 0m19.890s
sys 0m0.060s