How to handle branch mispredictions that seem to depend on machine code position?
Asked Answered
U

1

7

While trying to benchmark implementations of a simple sparse unit lower triangular backward solve in CSC format, I observe strange behavior. The performance seems to vary drastically, depending on where in the executable the assembly instructions end up. I observe this in many different variations of the same problem. One minimal example is to get duplicated instructions for the implementation

void lowerUnitTriangularTransposedBacksolve(const EntryIndex* col_begin_indices,
                                            const Index* row_indices,
                                            const Value* values,
                                            const Index dimension, Value* x) {
  if (dimension == 0) return;

  EntryIndex entry_index = col_begin_indices[dimension];
  Index col_index = dimension;
  do {
    col_index -= 1;
    const EntryIndex col_begin = col_begin_indices[col_index];

    if (entry_index > col_begin) {
      Value x_temp = x[col_index];
      do {
        entry_index -= 1;
        x_temp -= values[entry_index] * x[row_indices[entry_index]];
      } while (entry_index != col_begin);
      x[col_index] = x_temp;
    }
  } while (col_index != 0);
}

in two functions benchmarkBacksolve1 and benchmarkBacksolve2 (source code using google/benchmark). The difference in performance of the two (apart from offsets) identical functions then also shows on quick-bench.com. This is not a fluke; benchmarkBacksolve1 is consistently 10% slower than benchmarkBacksolve2. Keeping the same order of execution but swapping the position of the machine code in the executable by defining benchmarkBacksolve2 before benchmarkBacksolve1 also swaps the performance difference.

On my local machine (gcc 9.4.0, -O3, AMD Ryzen 7 PRO 4750U), the difference is even more extreme:

--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
benchmarkBacksolve1       1919 ns         1907 ns       364269
benchmarkBacksolve2        918 ns          908 ns       773814

The function defined first is twice as slow. Running only the benchmark for one of the functions with --benchmark_filter or using --benchmark_enable_random_interleaving=true with --benchmark_repetitions=n only confirms the measurements.

The most notable observation I found during profiling is a large difference in branch misses:

$ perf stat ./a.out --benchmark_filter=benchmarkBacksolve1
                        ...
     1.231.964.948      branches                  #    1,354 G/sec                    (83,31%)
        97.548.817      branch-misses             #    7,92% of all branches          (83,22%)
                        ...
$ perf stat ./a.out --benchmark_filter=benchmarkBacksolve2
                        ...
     2.241.666.453      branches                  #    2,779 G/sec                    (83,64%)
         3.751.915      branch-misses             #    0,17% of all branches          (83,41%)
                        ...

Almost 8% of all branches seem to be mispredicted for the function defined first, compared to only 0.2% for the function defined last.


On different machines, I have to modify the setup a bit to see this effect. But other experiments confirm how brittle this seems to be on most machines. Any kind of change in offset can completely change the benchmark measurements. Just as one more example, removing the __attribute__((always_inline)) on my machine results in equal time for the two functions, but it is neither the faster nor the slower time from before, it is a distinct new one around 1250 ns.


Through interviews of and talks by Emery Berger, I am aware that unexpected factors like link ordering and environment variables can influence performance. Still, I would like to understand this example better.

  • Why does branch prediction in this code in particular seem to be influenced so much by code position?
  • Is there anything I can do to avoid these deviations to be able to evaluate different implementations properly? After observing these effects, I am worried that my answer in Optimizing the backward solve for a sparse lower triangular linear system was measuring anything but real performance improvements
  • Is there something one can pay attention to during implementation to avoid such problems for branch predictors of most modern CPUs?
Underage answered 20/8, 2022 at 16:8 Comment(0)
P
8

Moderns CPUs have branch predictors that learn the behavior of a branch by remembering past results. But most code has more than one branch in a loop so for this to have any chance of working the CPU has to have multiple branch predictors and some way to make different branches use different predictors so they can learn the behavior of each branch separately.

One simple way to do this is to use the last few bits of the address, or a simple hash of the address, to pick a branch predictor. But then putting code in order may cause a collision where 2 branches pick the same predictor while a different order has the branches using separate predictors. And this can obviously change the success rate of the prediction of each branch.

One thing (at least in gcc) you should do is not blindly use -O3. -O3 includes optimizations that generally hurt performance or are still experimental. That's something to only use selectively after checking it actually has a benefit. And to even judge if it has a benefit you have to fuzz the binary a lot, measuring lots of different alignment of code, stack and data and different ordering of the code.

Payroll answered 20/8, 2022 at 16:31 Comment(7)
Thanks for the nice, concise explanation of what I agree is probably the general problem in this example. I suppose your point about -O3 is in general also fair. However, I am not sure how it applies to the question really. Are you saying that -O3 is in general more likely to cause such a collision in branch prediction? The optimization level is of course also an influencing factor on the measurements, but one can observe the same described behavior on -O2 as well. I just happened to find an illustrative example with -O3 sooner.Underage
@mjacobse: No, the stuff about -O3 is just a side note, suggesting (perhaps too pessimistically, IMO) that it might make your code slower for other reasons. Different optimization choices happening to make two important branches alias each other or not is pretty much random chance.Cornia
If anything, -O3 is more likely to do if-conversion to branchless code, e.g. cmov. Or even auto-vectorize. Both of these would remove the possibility of mispredicts when it happens, and reduce competition for branch-predictor space. Auto-vectorization does introduce some extra branching outside loops to check if the size is at least a full vector width, and to figure out how much cleanup to run. But those branches only run one for the whole function, so they don't tend to dominate the prediction if multiple branches alias to the same predictor slot.Cornia
(Modern predictors aren't simply indexed by low bits, with rest as tags. IT-TAGE (Haswell, Zen2 and later) also uses branch history as part of the index, and branches that alias can share a predictor instead of evicting the other prediction if the tag doesn't match. So the hot one can dominate instead of having its prediction thrown away. See xania.org/201602/bpu-part-two / Why did Intel change the static branch prediction mechanism over these years?) But the general principle of branches aliasing is still a thing that can happen.Cornia
@mjacobse: AFAIK, there isn't any reliable way to write your source code to avoid branch aliasing problems, even if you're only considering one CPU microarchitecture (like your Zen2). It's just something you have to measure and maybe work around.Cornia
I'm not sure -O3 results in any more use of cmov over -O2. One thing I noticed that I'm fairly uncertain about is that -O3 will unroll loops more, just repeating the same code multiple times. For example it repeats the code 4 times and only then branches back to the top. While this saves 3 branches those branches would have been perfectly predicted in any long running loop. But now all the other branches are multiplied by 4 and the CPU has to use 4 times the branch predictors and train each separately. I can't see how that can really be a benefit anymore. It also wastes I-cache.Payroll
Modern CPU basically unroll loops in hardware.Payroll

© 2022 - 2024 — McMap. All rights reserved.