Kcachegrind/callgrind is inaccurate for dispatcher functions?

I have a model code on which kcachegrind/callgrind reports strange results. It is kind of dispatcher function. The dispatcher is called from 4 places; each call says, which actual do_J function to run (so the first2 will call only do_1 and do_2 and so on)

Source (this is a model of actual code)

#define N 1000000

int a[N];
int do_1(int *a) { int i; for(i=0;i<N/4;i++) a[i]+=1; }
int do_2(int *a) { int i; for(i=0;i<N/2;i++) a[i]+=2; }
int do_3(int *a) { int i; for(i=0;i<N*3/4;i++) a[i]+=3; }
int do_4(int *a) { int i; for(i=0;i<N;i++) a[i]+=4; }

int dispatcher(int *a, int j) {
    if(j==1) do_1(a);
    else if(j==2) do_2(a);
    else if(j==3) do_3(a);
    else do_4(a);
}

int first2(int *a) { dispatcher(a,1); dispatcher(a,2); }
int last2(int *a) { dispatcher(a,4); dispatcher(a,3); }
int inner2(int *a) { dispatcher(a,2); dispatcher(a,3); }
int outer2(int *a) { dispatcher(a,1); dispatcher(a,4); }

int main(){
    first2(a);
    last2(a);
    inner2(a);
    outer2(a);
}

Compiled with gcc -O0; Callgrinded with valgrind --tool=callgrind; kcachegrinded with kcachegrind and qcachegrind-0.7.

Here is a full callgraph of the application. All paths to do_J go through dispatcher and this is good (the do_1 is just hided as too fast, but it is here really, just left to do_2)

Full

Lets focus on do_1 and check, who called it (this picture is incorrect):

enter image description here

And this is very strange, I think, only first2 and outer2 called do_1 but not all.

Is it a limitation of callgrind/kcachegrind? How can I get accurate callgraph with weights (proportional to running time of every function, with and without its childs)?

Recommended topics

Hot tags