I'm currently trying to understand some very very strange behavior in one of my C programs. Apparently, adding or removing a seemingly inconsequential line at the end of it drastically affects the performance in the rest of the program.
My program looks a bit like this:
int large_buffer[10000];
void compute(FILE * input) {
for(int i=0; i<100; i++) {
do_lots_of_stuff();
printf(".");
fflush(stdout);
}
}
int main() {
FILE *input = fopen("input.txt", "r");
compute(input);
fclose(input); // <--- everything gets 2x slower if I comment this out (!)
return 0;
}
In theory, that fclose(input)
line at the end of the main function should not matter, since the OS should automatically close the file at the end of the program anyway. However I observed that my program took 2.5 seconds to run when I included the fclose statement and 5s when I commented it out. A factor of 2 difference! And this was not due to latency at the start or end of the program: the speed at which the .
are printed out is visibly faster in the version with the fclose statement.
I suspect that this might have to do with some memory alignment or cache miss issue. If I replace the fclose with another function such as ftell it also takes 5s to run and if I reduce the size of the large_buffer
to <= 8000 elements then it always runs in 2.5 seconds, regardless of the fclose statement being there or not.
But I would really like to be able to 100% sure of what is the culprit behind this strange behavior. Would it be possible to run my program under some sort of profiler or other tool that would give me that information? So far I tried running both versions under valgrind --tool=cachegrind
but it reported the same amount of cache miss (0%) for both versions of my program.
edit 1: After running both versions of my program under perf stat -d -d -d
I got the following results:
Performance counter stats for './no-fclose examples/bench.o':
5625.535086 task-clock (msec) # 1.000 CPUs utilized
38 context-switches # 0.007 K/sec
0 cpu-migrations # 0.000 K/sec
54 page-faults # 0.010 K/sec
17,851,853,580 cycles # 3.173 GHz (53.23%)
6,421,955,412 stalled-cycles-frontend # 35.97% frontend cycles idle (53.23%)
4,919,383,925 stalled-cycles-backend # 27.56% backend cycles idle (53.23%)
13,294,878,129 instructions # 0.74 insn per cycle
# 0.48 stalled cycles per insn (59.91%)
3,178,485,061 branches # 565.010 M/sec (59.91%)
440,171,927 branch-misses # 13.85% of all branches (59.92%)
4,778,577,556 L1-dcache-loads # 849.444 M/sec (60.19%)
125,313 L1-dcache-load-misses # 0.00% of all L1-dcache hits (60.22%)
12,110 LLC-loads # 0.002 M/sec (60.25%)
<not supported> LLC-load-misses
<not supported> L1-icache-loads
20,196,491 L1-icache-load-misses (60.22%)
4,793,012,927 dTLB-loads # 852.010 M/sec (60.18%)
683 dTLB-load-misses # 0.00% of all dTLB cache hits (60.13%)
3,443 iTLB-loads # 0.612 K/sec (53.38%)
90 iTLB-load-misses # 2.61% of all iTLB cache hits (53.31%)
<not supported> L1-dcache-prefetches
51,382 L1-dcache-prefetch-misses # 0.009 M/sec (53.24%)
5.627225926 seconds time elapsed
Performance counter stats for './yes-fclose examples/bench.o':
2652.609254 task-clock (msec) # 1.000 CPUs utilized
15 context-switches # 0.006 K/sec
0 cpu-migrations # 0.000 K/sec
57 page-faults # 0.021 K/sec
8,277,447,108 cycles # 3.120 GHz (53.39%)
2,453,171,903 stalled-cycles-frontend # 29.64% frontend cycles idle (53.46%)
1,235,728,409 stalled-cycles-backend # 14.93% backend cycles idle (53.53%)
13,296,127,857 instructions # 1.61 insn per cycle
# 0.18 stalled cycles per insn (60.20%)
3,177,698,785 branches # 1197.952 M/sec (60.20%)
71,034,122 branch-misses # 2.24% of all branches (60.20%)
4,790,733,157 L1-dcache-loads # 1806.046 M/sec (60.20%)
74,908 L1-dcache-load-misses # 0.00% of all L1-dcache hits (60.20%)
15,289 LLC-loads # 0.006 M/sec (60.19%)
<not supported> LLC-load-misses
<not supported> L1-icache-loads
140,750 L1-icache-load-misses (60.08%)
4,792,716,217 dTLB-loads # 1806.793 M/sec (59.93%)
1,010 dTLB-load-misses # 0.00% of all dTLB cache hits (59.78%)
113 iTLB-loads # 0.043 K/sec (53.12%)
167 iTLB-load-misses # 147.79% of all iTLB cache hits (53.44%)
<not supported> L1-dcache-prefetches
29,744 L1-dcache-prefetch-misses # 0.011 M/sec (53.36%)
2.653584624 seconds time elapsed
Looks like there were few data-cache misses in both cases, as kcachegrind had reported, but the slower version of the program had worse branch prediction and more instruction cache misses and iTLB loads. Which of these differences would be the most likely to be responsible for the 2x difference in runtime between the test cases?
edit 2: Fun fact, apparently I can still keep the strange behavior if I replace the "fclose" call by a single NOP instruction.
edit 3: My processor is an Intel i5-2310 (Sandy Bridge)
Edit 4: Turns out that if I resize the arrays by editing the assembly file it does not get faster. Apparetly the reason it got faster when I changed their sizes in the C code was because gcc decided to rearange the order of the stuff in the binary.
Edit 5: More evidence that what matters was the precise addresses of the JMP instructions: If I add a single NOP (instead of a printf) at the start of my code it gets faster. Similarly, if I remove an useless instruction from the start of my code it also gets faster. And when I compiled my code on a different version of gcc it also got faster, despite the generated assembly code being the same. The only difference was debug info at the start and that the sections of the binary file were in a different order.
.
are printed out is visibly faster [..]" o.O is this a joke? The speed at which those characters are printed depends not directly on your program but rather on the buffering of the standard library and the terminal... Do you have any ... better suitable measurement which indicates that the difference in runtime is really not at the end of the program? – Doriselarge_buffer
array had an effect in the runtime suggests that this is some sort of memory issue and not just an output buffering artifact. I used/usr/bin/time
to measure how long each program took to run. – Wargofclose
, not slower. – Dorisedo_lots_of_stuff()
– Melatonindo_lots_of_stuff()
doing? – Forde