How to handle branch mispredictions that seem to depend on machine code position?

While trying to benchmark implementations of a simple sparse unit lower triangular backward solve in CSC format, I observe strange behavior. The performance seems to vary drastically, depending on where in the executable the assembly instructions end up. I observe this in many different variations of the same problem. One minimal example is to get duplicated instructions for the implementation

void lowerUnitTriangularTransposedBacksolve(const EntryIndex* col_begin_indices,
                                            const Index* row_indices,
                                            const Value* values,
                                            const Index dimension, Value* x) {
  if (dimension == 0) return;

  EntryIndex entry_index = col_begin_indices[dimension];
  Index col_index = dimension;
  do {
    col_index -= 1;
    const EntryIndex col_begin = col_begin_indices[col_index];

    if (entry_index > col_begin) {
      Value x_temp = x[col_index];
      do {
        entry_index -= 1;
        x_temp -= values[entry_index] * x[row_indices[entry_index]];
      } while (entry_index != col_begin);
      x[col_index] = x_temp;
    }
  } while (col_index != 0);
}

in two functions benchmarkBacksolve1 and benchmarkBacksolve2 (source code using google/benchmark). The difference in performance of the two (apart from offsets) identical functions then also shows on quick-bench.com. This is not a fluke; benchmarkBacksolve1 is consistently 10% slower than benchmarkBacksolve2. Keeping the same order of execution but swapping the position of the machine code in the executable by defining benchmarkBacksolve2 before benchmarkBacksolve1 also swaps the performance difference.

On my local machine (gcc 9.4.0, -O3, AMD Ryzen 7 PRO 4750U), the difference is even more extreme:

--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
benchmarkBacksolve1       1919 ns         1907 ns       364269
benchmarkBacksolve2        918 ns          908 ns       773814

The function defined first is twice as slow. Running only the benchmark for one of the functions with --benchmark_filter or using --benchmark_enable_random_interleaving=true with --benchmark_repetitions=n only confirms the measurements.

The most notable observation I found during profiling is a large difference in branch misses:

$ perf stat ./a.out --benchmark_filter=benchmarkBacksolve1
                        ...
     1.231.964.948      branches                  #    1,354 G/sec                    (83,31%)
        97.548.817      branch-misses             #    7,92% of all branches          (83,22%)
                        ...
$ perf stat ./a.out --benchmark_filter=benchmarkBacksolve2
                        ...
     2.241.666.453      branches                  #    2,779 G/sec                    (83,64%)
         3.751.915      branch-misses             #    0,17% of all branches          (83,41%)
                        ...

Almost 8% of all branches seem to be mispredicted for the function defined first, compared to only 0.2% for the function defined last.

On different machines, I have to modify the setup a bit to see this effect. But other experiments confirm how brittle this seems to be on most machines. Any kind of change in offset can completely change the benchmark measurements. Just as one more example, removing the __attribute__((always_inline)) on my machine results in equal time for the two functions, but it is neither the faster nor the slower time from before, it is a distinct new one around 1250 ns.

Through interviews of and talks by Emery Berger, I am aware that unexpected factors like link ordering and environment variables can influence performance. Still, I would like to understand this example better.

Why does branch prediction in this code in particular seem to be influenced so much by code position?
Is there anything I can do to avoid these deviations to be able to evaluate different implementations properly? After observing these effects, I am worried that my answer in Optimizing the backward solve for a sparse lower triangular linear system was measuring anything but real performance improvements
Is there something one can pay attention to during implementation to avoid such problems for branch predictors of most modern CPUs?

Moderns CPUs have branch predictors that learn the behavior of a branch by remembering past results. But most code has more than one branch in a loop so for this to have any chance of working the CPU has to have multiple branch predictors and some way to make different branches use different predictors so they can learn the behavior of each branch separately.

One simple way to do this is to use the last few bits of the address, or a simple hash of the address, to pick a branch predictor. But then putting code in order may cause a collision where 2 branches pick the same predictor while a different order has the branches using separate predictors. And this can obviously change the success rate of the prediction of each branch.

One thing (at least in gcc) you should do is not blindly use -O3. -O3 includes optimizations that generally hurt performance or are still experimental. That's something to only use selectively after checking it actually has a benefit. And to even judge if it has a benefit you have to fuzz the binary a lot, measuring lots of different alignment of code, stack and data and different ordering of the code.

Recommended topics

Hot tags