How can I pinpoint if the slowness in my program is a CPU cache issue (on Linux)?

Asked 5/4, 2017 at 21:24 Answered 12/4, 2017 at 22:58

Solved c performance profiling cpu-cache cachegrind

I'm currently trying to understand some very very strange behavior in one of my C programs. Apparently, adding or removing a seemingly inconsequential line at the end of it drastically affects the performance in the rest of the program.

My program looks a bit like this:

int large_buffer[10000];

void compute(FILE * input) {
    for(int i=0; i<100; i++) {
        do_lots_of_stuff();
        printf(".");
        fflush(stdout);
    }
}

int main() {
    FILE *input = fopen("input.txt", "r");
    compute(input);
    fclose(input); // <--- everything gets 2x slower if I comment this out (!)
    return 0;
}

In theory, that fclose(input) line at the end of the main function should not matter, since the OS should automatically close the file at the end of the program anyway. However I observed that my program took 2.5 seconds to run when I included the fclose statement and 5s when I commented it out. A factor of 2 difference! And this was not due to latency at the start or end of the program: the speed at which the . are printed out is visibly faster in the version with the fclose statement.

I suspect that this might have to do with some memory alignment or cache miss issue. If I replace the fclose with another function such as ftell it also takes 5s to run and if I reduce the size of the large_buffer to <= 8000 elements then it always runs in 2.5 seconds, regardless of the fclose statement being there or not.

But I would really like to be able to 100% sure of what is the culprit behind this strange behavior. Would it be possible to run my program under some sort of profiler or other tool that would give me that information? So far I tried running both versions under valgrind --tool=cachegrind but it reported the same amount of cache miss (0%) for both versions of my program.

edit 1: After running both versions of my program under perf stat -d -d -d I got the following results:

 Performance counter stats for './no-fclose examples/bench.o':

       5625.535086      task-clock (msec)         #    1.000 CPUs utilized          
                38      context-switches          #    0.007 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                54      page-faults               #    0.010 K/sec                  
    17,851,853,580      cycles                    #    3.173 GHz                      (53.23%)
     6,421,955,412      stalled-cycles-frontend   #   35.97% frontend cycles idle     (53.23%)
     4,919,383,925      stalled-cycles-backend    #   27.56% backend cycles idle      (53.23%)
    13,294,878,129      instructions              #    0.74  insn per cycle         
                                                  #    0.48  stalled cycles per insn  (59.91%)
     3,178,485,061      branches                  #  565.010 M/sec                    (59.91%)
       440,171,927      branch-misses             #   13.85% of all branches          (59.92%)
     4,778,577,556      L1-dcache-loads           #  849.444 M/sec                    (60.19%)
           125,313      L1-dcache-load-misses     #    0.00% of all L1-dcache hits    (60.22%)
            12,110      LLC-loads                 #    0.002 M/sec                    (60.25%)
   <not supported>      LLC-load-misses                                             
   <not supported>      L1-icache-loads                                             
        20,196,491      L1-icache-load-misses                                         (60.22%)
     4,793,012,927      dTLB-loads                #  852.010 M/sec                    (60.18%)
               683      dTLB-load-misses          #    0.00% of all dTLB cache hits   (60.13%)
             3,443      iTLB-loads                #    0.612 K/sec                    (53.38%)
                90      iTLB-load-misses          #    2.61% of all iTLB cache hits   (53.31%)
   <not supported>      L1-dcache-prefetches                                        
            51,382      L1-dcache-prefetch-misses #    0.009 M/sec                    (53.24%)

       5.627225926 seconds time elapsed

 Performance counter stats for './yes-fclose examples/bench.o':

       2652.609254      task-clock (msec)         #    1.000 CPUs utilized          
                15      context-switches          #    0.006 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                57      page-faults               #    0.021 K/sec                  
     8,277,447,108      cycles                    #    3.120 GHz                      (53.39%)
     2,453,171,903      stalled-cycles-frontend   #   29.64% frontend cycles idle     (53.46%)
     1,235,728,409      stalled-cycles-backend    #   14.93% backend cycles idle      (53.53%)
    13,296,127,857      instructions              #    1.61  insn per cycle         
                                                  #    0.18  stalled cycles per insn  (60.20%)
     3,177,698,785      branches                  # 1197.952 M/sec                    (60.20%)
        71,034,122      branch-misses             #    2.24% of all branches          (60.20%)
     4,790,733,157      L1-dcache-loads           # 1806.046 M/sec                    (60.20%)
            74,908      L1-dcache-load-misses     #    0.00% of all L1-dcache hits    (60.20%)
            15,289      LLC-loads                 #    0.006 M/sec                    (60.19%)
   <not supported>      LLC-load-misses                                             
   <not supported>      L1-icache-loads                                             
           140,750      L1-icache-load-misses                                         (60.08%)
     4,792,716,217      dTLB-loads                # 1806.793 M/sec                    (59.93%)
             1,010      dTLB-load-misses          #    0.00% of all dTLB cache hits   (59.78%)
               113      iTLB-loads                #    0.043 K/sec                    (53.12%)
               167      iTLB-load-misses          #  147.79% of all iTLB cache hits   (53.44%)
   <not supported>      L1-dcache-prefetches                                        
            29,744      L1-dcache-prefetch-misses #    0.011 M/sec                    (53.36%)

       2.653584624 seconds time elapsed

Looks like there were few data-cache misses in both cases, as kcachegrind had reported, but the slower version of the program had worse branch prediction and more instruction cache misses and iTLB loads. Which of these differences would be the most likely to be responsible for the 2x difference in runtime between the test cases?

edit 2: Fun fact, apparently I can still keep the strange behavior if I replace the "fclose" call by a single NOP instruction.

edit 3: My processor is an Intel i5-2310 (Sandy Bridge)

Edit 4: Turns out that if I resize the arrays by editing the assembly file it does not get faster. Apparetly the reason it got faster when I changed their sizes in the C code was because gcc decided to rearange the order of the stuff in the binary.

Edit 5: More evidence that what matters was the precise addresses of the JMP instructions: If I add a single NOP (instead of a printf) at the start of my code it gets faster. Similarly, if I remove an useless instruction from the start of my code it also gets faster. And when I compiled my code on a different version of gcc it also got faster, despite the generated assembly code being the same. The only difference was debug info at the start and that the sections of the binary file were in a different order.

Wargo answered 5/4, 2017 at 21:24 Comment(12)

'the OS should automatically close the file at the end of the program anyway' - well in C, that's optimistic... It will free the fd etc, sure, but it will probably not flush buffered data as close() would do. – Thallophyte 5/4, 2017 at 21:50

"[..] the speed at which the . are printed out is visibly faster [..]" o.O is this a joke? The speed at which those characters are printed depends not directly on your program but rather on the buffering of the standard library and the terminal... Do you have any ... better suitable measurement which indicates that the difference in runtime is really not at the end of the program? – Dorise 5/4, 2017 at 21:52

@DanielJour: Good point. In the original version of the program there was an fflush after each printf that I forgot to include in this question (I edited to include that now). But even then, the fact that changing the size of the large_buffer array had an effect in the runtime suggests that this is some sort of memory issue and not just an output buffering artifact. I used /usr/bin/time to measure how long each program took to run. – Wargo 5/4, 2017 at 22:8

I don't think this has anything to do with caches or memory alignment. fclose is a library(system) call and will certainly take certain number of cycles to complete. If the rest of your program happens to be less than or equal to this cycle count, it will look like a 2X improvement. Try increasing you loop limit to, 10000. Now the effect of fclose wont be 2X! – Marylou 6/4, 2017 at 8:52

@IsuruH the curiosity seems to be that the code runs faster with the fclose, not slower. – Dorise 6/4, 2017 at 12:3

@IsuruH: In addition to what Daniel said, the program already already has a big loop inside the do_lots_of_stuff function. It takes multiple seconds to run, which is orders of magnitude more than a single system call. – Wargo 6/4, 2017 at 14:1

@DanielJour my apologies. read it too quickly. In that case, it is strange!! – Marylou 6/4, 2017 at 16:2

I'd be interested to see how you do_lots_of_stuff() – Melatonin 8/4, 2017 at 15:56

Can't reproduce on windows, i've measured with&without fclose and the elapsed time is the same. What is that do_lots_of_stuff() doing? – Forde 8/4, 2017 at 17:12

@BPL: I doubt anyone will be able to reproduce this unless they run the exact same program, compiler version, CPU hardware and operating system as I did. And I can't make a minimal example because tiny seemingly inconsequential changes make the weird behavior disappear. This is why I am asking about what profiling or diagnostic tools I can use on Linux to figure out what is going on. – Wargo 8/4, 2017 at 18:40

But if you are curious, the original program was a brainfuck interpreter running a particular brainfuck benchmark. – Wargo 8/4, 2017 at 18:40

So you're asking why a language that was designed to mess with your brain, is doing exactly that? – Myriad 9/4, 2017 at 13:44

The Key Metric

Your culprit is branch misses:

440,171,927      branch-misses             #   13.85% of all branches

vs.

71,034,122      branch-misses             #    2.24% of all branches

I'm not sure which processor your'e running, but if you are hypothetically running a 2.5 GHz processor on Haswell, you'll see that each branch prediction miss costs around 15 cycles (usually a bit more because other stuff stalls too), and each cycle is 0.4ns. So, 0.4ns/cycle * 15 cycles/missed branch * (440,171,927 - 71,034,122) branch misses = 2.2 seconds. It will depend on your exact processor and the machine code, but this explains the bulk of the difference right there.

Cause

The branch prediction algorithms of different chips are proprietary, but if you research here ( http://www.agner.org/optimize/microarchitecture.pdf) you can learn more about different processors and there limitations. Essentially, what you'll find is that certain processors do a better job of avoiding collisions in the branch prediction tables that they store to make predictions about branches taken or not.

So, why is that relevant? Well, what has happened (99% chance) is that by rearranging your program very slightly, you change where exactly different branches are in terms of memory locations. There are too many that map to the same bucket in the branch prediction tables for the processor. By changing the executable slightly, this problem goes away. You have to have a very specific distance between the two branch points to trigger this issue. You have unluckily managed to do that.

Simple Solution

As you found, many changes will in fact cause the program to not hit this degraded performance. Essentially, anything that changes the distance between the two critical branch points will fix the problem. You can accomplish this by literally inserting 16 bytes (or enough to move the branch points to different buckets) of machine code nops somewhere between the two places. You can also do as you did and change something that will disrupt this distance to a non-pathological case.

Digging Deeper

If you want to truly understand what causes it in this situation, you'll need to get your hands dirty. Fun! You will need to select one of many tools to find the specific branch that is being mispredicted. Here is one way: How to measure mispredictions for a single branch on Linux?

After you identify the mispredicted branch, you can figure out if there is a way to remove the branch (almost always a good idea for performance anyway). If not, you can place a hint for it or, worst case, just move things around to ensure that the same entry isn't shared as previously suggested.

Broader lessons

Programmers underestimate the cost of branching (when the compiler is unable to remove the branches at compile time). As you've discovered, each branch puts more strain on the branch prediction buffers of a processor and increases the chance of mispredictions. So, even branches that are 100% predictable to the processor can degrade performance by reducing the resources available for predicting other branches. Further, when a branch is mispredicted, it costs a minimum of 12-15 cycles and can be much more if the needed instructions aren't in the L1 cache, L2 cache, L3 cache, or heaven help you, main memory. Additionally, the compiler can't make all optimizations across branches.

Wyon answered 12/4, 2017 at 22:58 Comment(1)

Thanks for the math! That is really good evidence that it was the branch prediction all along. And I made some extra tests (see my edits) that indeed suggest that it was just the branch prediction and that memory locality had nothing to do with it. – Wargo 14/4, 2017 at 18:11

First you want to sanity check by disassembling all funcs and making sure the only difference is in main. You can disassemble with objdump -d and hack around to compare with diff.

The addition of fclose pulls in a new symbol (thus part of the file is already modified) and after that the main func is also modified. This in turn changes addresses and offsets.

So the suspicion is that you get excessive cache trashing not present in the previous version of the program.

In your question you state that cachegrind was executed, but it reported 0%. This does not add up. Even if the aforementioned suspicion is incorrect, you are bound to get several misses anyway. Please paste both results.

The canonical tool to play with on linux is perf ( https://perf.wiki.kernel.org/index.php/Tutorial ). Make sure to run it several times as for such short runtimes you will get a lot of variance.

You can play with explicit alignment of both variables and functions by annotating them with

  __attribute__((aligned(64)))

Play with the number.

Prudence answered 9/4, 2017 at 10:40 Comment(5)

Thanks, I'll check this on monday when I can reproduce this again. – Wargo 9/4, 2017 at 14:33

More testing also revealed that the weird behavior only happens on -O3, when everything gets inlined inside the main function. I double checked the generated assembly and both cases are very similar except that one of the programs has an extra block of code in the middle for the fclose. I ran the programs under perf (thanks for the recommendation, I wasn't aware of it) and posted the results on an edit to my question. Tomorrow, I'll try to do some experiments with the alignment, didn't get a chance to do that yet. – Wargo 11/4, 2017 at 2:23

Given the data, that's almost certainly branch prediction. You can profile to see which branches are getting mispredicted and how often. If something vast majority of the time, you can annotate it with putting the condition within __builtin_expect((condition), 1) (or 0). The real fix consists of having fewer branches though (e.g. maybe you have a branch in a loop, but you can move it outside and have 2 variants of the body instead). – Prudence 11/4, 2017 at 4:12

What can I use to find out what branches are getting mispredicted? perf stat only reports the total number of mispredictions. – Wargo 11/4, 2017 at 4:29

@Wargo Use perf record instead, then perf report --stdio – Tritanopia 12/4, 2017 at 8:13

The visible change is a change in timing, so you'd want to start by aiming a time profiler at runs that both exhibit and do not exhibit the behavior to see what changes in where the time is spent. If you're lucky, you can compile for tracing and see what gprof can spit out without perturbing the behavior.

If you truly have just one humongous function making up most of your program, anything that does time roll-up by function is not going to give terribly granular results. In that case, you might try breaking up the one mega-function into a network of smaller functions in order to get more granular time cost statistics.

If you're unlucky, and compiling for profiling makes it so the behavior difference disappears, then skip to profiling kernel function calls (which can be done dynamically) and look for time differences with regard to the kernel function call trace rather than the user function call trace.

Once you know where the time is going, you should have more precise questions to ask and possibly be able to rule out some factors. At that point, you might want to break out perf-tools.

Cerelia answered 9/4, 2017 at 23:12 Comment(2)

I don't think breaking into smaller functions is an option here. After further analysis, it seems that this weird behavior only happens on -O3 optimization level, when pretty much the entirety of my code ends up inlined inside the main function. Anyway, I ran my code under perf-tools and posted the resuls as an edit to my question. Do you think we can conclude something based on those numbers? – Wargo 11/4, 2017 at 2:26

Branch misses are like six times higher in the slow run case. The next question is what branches are being missed so much for virtually the same code. And how much run time is actually attributable to (a) misses and (b) different code paths. You might be able to get line level cost info out of gprof, which might give a very clear picture of where time is being spent. – Cerelia 12/4, 2017 at 15:13

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags