I am trying to understand how does a branch prediction unit work in a CPU.
I have used papi
and also linux's perf-events
but both of them do not give accurate results (for my case).
This is my code:
void func(int* arr, int sequence_len){
for(int i = 0; i < sequence_len; i++){
// region starts
if(arr[i]){
do_sth();
}
// region ends
}
}
My array consists of 0's and 1's. It has a pattern with a size of sequence_len
. For example, if my size is 8, then it has a pattern of 0 1 0 1 0 0 1 1
or something like that.
Trial 1:
I am trying to understand how CPU predicts those branches. So, I have used papi and set up performance counter for branch predictions mispredicted (I know that it also counts indirect branches).
int func(){
papi_read(r1);
for(){
//... same as above
}
papi_read(r2);
return r2-r1;
}
int main(){
init_papi();
for(int i = 0; i < 10; i++)
res[i] = func();
print(res[i]);
}
What I see as an output is that (for sequence length of 200)
100 #iter1
40 #iter2
10 #iter3
3
0
0
#...
So, at first, the CPU blindly predicts the sequence, only success half of the time. In the next iterations, the CPU can predict better and better. After some amount of iterations, the CPU can guess that perfectly.
Trial 2
I would like to see, at which array index does the CPU misprediction.
int* func(){
int* results;
for(){
papi_read(r1);
if(arr[i])
do_sth();
papi_read(r2);
res[i] = r2-r1;
}
return res;
}
int main(){
init_papi();
for(int i = 0; i < 10; i++)
res[i] = func();
print(res[i]);
}
Expected result:
#1st iteration, 0 means no mispred, 1 means mispred
1 0 0 1 1 0 0 0 1 1 0... # total of 200 results
Mispred: 100/200
#2nd iteration
0 0 0 0 1 0 0 0 1 0 0... # total of 200 results
Mispred: 40/200 # it learned from previous iteration
#3rd iteration
0 0 0 0 0 0 0 0 1 0 0... # total of 200 results
Mispred: 10/200 # continues to learn
#...
Received result:
#1st iteration
1 0 0 1 1 0 0 0 1 1 0... # total of 200 results
Mispred: 100/200
#2nd iteration
1 0 0 0 1 1 0 1 0 0 0... # total of 200 results
Mispred: 100/200 # it DID NOT learn from previous iteration
#3rd iteration
0 1 0 1 0 1 0 1 1 0 0... # total of 200 results
Mispred: 100/200 # NO LEARNING
#...
My observation
When I measure the misprediction outside of the for loop, I can see that CPU learns from its mispredictions. However, when I try to measure single branch instructions misprediction, then the CPU either cannot learn, or I am measuring it wrongly.
My explanation
I am giving 200 as a sequence length. The CPU has one small branch predictor, like 2-3 bit saturated counter in Intels, and one big global branch predictor. When I measure outside of the loop, I introduce less noise to the measurement. By less noise, I mean the papi
calls.
Think about this: outside of the loop measurement
global history is: papi_start, branch_outcome1, branch_outcome2, branch_outcome3, ..., papi_end, papi_start (2nd loop of main iteration), branch_outcome1, ...
So, the branch predictor somehow finds the pattern in the same branch.
However, if I try to measure single branch instruction then the global history is:
papi_start, branchoutcome1, papiend, papistart, branchoutcome2, papiend...
So, I am introducing more and more branches to global history. I assume the global history cannot hold many branch entries and therefore, it cannot find any correlation/pattern in the desired if statement(branch).
As a result
I need to measure a single branch prediction outcome. I know that the CPU can learn the 200 pattern if I don't introduce papi too much. I have looked at the papi calls and I have seen lots of for loops, if conditions.
That is why I need better measurement. I have tried linux perf-event
but it makes ioctl
calls, which is a system call and I pollute the global history with system calls, and therefore, not a good measurement.
I have read that rdpmc
and rdmsr
instructions and I assume that since they are only instructions, I will not pollute the global history, and I can measure single branch instruction at a time.
However, I have no clue about how I can do that. I have AMD 3600 CPU. These are the links that I found online but I couldn't figure out how to do that. In addition to that, am I missing something?