Let A
be an array that contains an odd number of zeros and ones. If n
is the size of A
, then A
is constructed such that the first ceil(n/2)
elements are 0
and the remaining elements 1
.
So if n = 9
, A
would look like this:
0,0,0,0,0,1,1,1,1
The goal is to find the sum of 1s
in the array and we do this by using this function:
s = 0;
void test1(int curIndex){
//A is 0,0,0,...,0,1,1,1,1,1...,1
if(curIndex == ceil(n/2)) return;
if(A[curIndex] == 1) return;
test1(curIndex+1);
test1(size-curIndex-1);
s += A[curIndex+1] + A[size-curIndex-1];
}
This function is rather silly for the problem given, but it's a simulation of a different function that I want to look like this and is producing the same amount of branch mispredictions.
Here is the entire code of the experiment:
#include <iostream>
#include <fstream>
using namespace std;
int size;
int *A;
int half;
int s;
void test1(int curIndex){
//A is 0,0,0,...,0,1,1,1,1,1...,1
if(curIndex == half) return;
if(A[curIndex] == 1) return;
test1(curIndex+1);
test1(size - curIndex - 1);
s += A[curIndex+1] + A[size-curIndex-1];
}
int main(int argc, char* argv[]){
size = atoi(argv[1]);
if(argc!=2){
cout<<"type ./executable size{odd integer}"<<endl;
return 1;
}
if(size%2!=1){
cout<<"size must be an odd number"<<endl;
return 1;
}
A = new int[size];
half = size/2;
int i;
for(i=0;i<=half;i++){
A[i] = 0;
}
for(i=half+1;i<size;i++){
A[i] = 1;
}
for(i=0;i<100;i++) {
test1(0);
}
cout<<s<<endl;
return 0;
}
Compile by typing g++ -O3 -std=c++11 file.cpp
and run by typing ./executable size{odd integer}
.
I am using an Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz with 8 GB of RAM, L1 cache 256 KB, L2 cache 1 MB, L3 cache 6 MB.
Running perf stat -B -e branches,branch-misses ./cachetests 111111
gives me the following:
Performance counter stats for './cachetests 111111':
32,639,932 branches
1,404,836 branch-misses # 4.30% of all branches
0.060349641 seconds time elapsed
if I remove the line
s += A[curIndex+1] + A[size-curIndex-1];
I get the following output from perf:
Performance counter stats for './cachetests 111111':
24,079,109 branches
39,078 branch-misses # 0.16% of all branches
0.027679521 seconds time elapsed
What does that line have to do with branch predictions when it's not even an if statement?
The way I see it, in the first ceil(n/2) - 1
calls of test1()
, both if statements will be false. In the ceil(n/2)-th
call, if(curIndex == ceil(n/2))
will be true. In the remaining n-ceil(n/2)
calls, the first statement will be false, and the second statement will be true.
Why does Intel fail to predict such a simple behavior?
Now let's look at a second case. Suppose that A
now has alternating zeros and ones. We will always start from 0. So if n = 9
A
will look like this:
0,1,0,1,0,1,0,1,0
The function we are going to use is the following:
void test2(int curIndex){
//A is 0,1,0,1,0,1,0,1,....
if(curIndex == size-1) return;
if(A[curIndex] == 1) return;
test2(curIndex+1);
test2(curIndex+2);
s += A[curIndex+1] + A[curIndex+2];
}
And here is the entire code of the experiment:
#include <iostream>
#include <fstream>
using namespace std;
int size;
int *A;
int s;
void test2(int curIndex){
//A is 0,1,0,1,0,1,0,1,....
if(curIndex == size-1) return;
if(A[curIndex] == 1) return;
test2(curIndex+1);
test2(curIndex+2);
s += A[curIndex+1] + A[curIndex+2];
}
int main(int argc, char* argv[]){
size = atoi(argv[1]);
if(argc!=2){
cout<<"type ./executable size{odd integer}"<<endl;
return 1;
}
if(size%2!=1){
cout<<"size must be an odd number"<<endl;
return 1;
}
A = new int[size];
int i;
for(i=0;i<size;i++){
if(i%2==0){
A[i] = false;
}
else{
A[i] = true;
}
}
for(i=0;i<100;i++) {
test2(0);
}
cout<<s<<endl;
return 0;
}
I run perf using the same commands as before:
Performance counter stats for './cachetests2 111111':
28,560,183 branches
54,204 branch-misses # 0.19% of all branches
0.037134196 seconds time elapsed
And removing that line again improved things a little bit:
Performance counter stats for './cachetests2 111111':
28,419,557 branches
16,636 branch-misses # 0.06% of all branches
0.009977772 seconds time elapsed
Now if we analyse the function, if(curIndex == size-1)
will be false n-1
times, and if(A[curIndex] == 1)
will alternate from true to false.
As I see it, both functions should be easy to predict, however this is not the case for the first function. At the same time I am not sure what is happening with that line and why it plays a role in improving branch behavior.
curIndex
ifcurIndex
is not pointing to the last0
and also is not pointing to a1
. If the array is indexed from0
, the second last0
will be in position(floor(n/2) - 1)
and the highest jump we will make is going to be towardsn-(floor(n/2) - 1)-1 = n - floor(n/2)
which should point to the element after the last0
. If we are in position0
, we will jump to(n-0-1)
which will point to the last element in the array. As for the second function, we do the same, when we reach the last0
, the index will be equal ton-1
so we will stop. – Surgeonfish-O2
? It looks like-O3
results in a lot of brancy recursion unrolling, and I'm curious if that makes a significant difference here. – Zamudio-O2
the branch misses of the first method increases to~12%
. The second function stays under1%
. – SurgeonfishcurIndex == size - 2
, you will requesttest1(size - 1) + test1(size)
. Have you looked at the assembly your code produces? – Narakaif
s are unlikely, we can improve the assembly. before: godbolt.org/g/tvywxe, after: godbolt.org/g/M9nc4x – Naraka