My orginial idea was to give an elegant code example, that would demonstrate the impact of instruction cache limitations. I wrote the following piece of code, that creates a large amount of identical functions, using template metaprogramming.
volatile int checksum;
void (*funcs[MAX_FUNCS])(void);
template <unsigned t>
__attribute__ ((noinline)) static void work(void) { ++checksum; }
template <unsigned t>
static void create(void) { funcs[t - 1] = &work<t - 1>; create<t - 1>(); }
template <> void create<0>(void) { }
int main()
{
create<MAX_FUNCS>();
for (unsigned range = 1; range <= MAX_FUNCS; range *= 2)
{
checksum = 0;
for (unsigned i = 0; i < WORKLOAD; ++i)
{
funcs[i % range]();
}
}
return 0;
}
The outer loop varies the amount of different functions to be called using a jump table. For each loop pass, the time taken to invoke WORKLOAD
functions is then measured. Now what are the results? The following chart shows the average run time per function call in relation to the used range. The blue line shows the data measured on a Core i7 machine. The comparative measurement, depicted by the red line, was carried out on a Pentium 4 machine. Yet when it comes to interpreting these lines, I seem to be somehow struggling...
The only jumps of the piecewise constant red curve occur exactly where the total memory consumption for all functions within range exceed the capacity of one cache level on the tested machine, which has no dedicated instruction cache. For very small ranges (below 4 in this case) however, run time still increases with the amount of functions. This may be related to branch prediction efficiency, but since every function call reduces to an unconditional jump in this case, I'm not sure if there should be any branching penalty at all.
The blue curve behaves quite differently. Run time is constant for small ranges and increases logarithmic thereafter. Yet for larger ranges, the curve seems to be approaching a constant asymptote again. How exactly can the qualitative differences of both curves be explained?
I am currently using GCC MinGW Win32 x86 v.4.8.1 with g++ -std=c++11 -ftemplate-depth=65536
and no compiler optimization.
Any help would be appreciated. I am also interested in any idea on how to improve the experiment itself. Thanks in advance!