While trying to benchmark implementations of a simple sparse unit lower triangular backward solve in CSC format, I observe strange behavior. The performance seems to vary drastically, depending on where in the executable the assembly instructions end up. I observe this in many different variations of the same problem. One minimal example is to get duplicated instructions for the implementation
void lowerUnitTriangularTransposedBacksolve(const EntryIndex* col_begin_indices,
const Index* row_indices,
const Value* values,
const Index dimension, Value* x) {
if (dimension == 0) return;
EntryIndex entry_index = col_begin_indices[dimension];
Index col_index = dimension;
do {
col_index -= 1;
const EntryIndex col_begin = col_begin_indices[col_index];
if (entry_index > col_begin) {
Value x_temp = x[col_index];
do {
entry_index -= 1;
x_temp -= values[entry_index] * x[row_indices[entry_index]];
} while (entry_index != col_begin);
x[col_index] = x_temp;
}
} while (col_index != 0);
}
in two functions benchmarkBacksolve1
and benchmarkBacksolve2
(source code using google/benchmark). The difference in performance of the two (apart from offsets) identical functions then also shows on quick-bench.com. This is not a fluke; benchmarkBacksolve1
is consistently 10% slower than benchmarkBacksolve2
. Keeping the same order of execution but swapping the position of the machine code in the executable by defining benchmarkBacksolve2
before benchmarkBacksolve1
also swaps the performance difference.
On my local machine (gcc 9.4.0, -O3
, AMD Ryzen 7 PRO 4750U), the difference is even more extreme:
--------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------
benchmarkBacksolve1 1919 ns 1907 ns 364269
benchmarkBacksolve2 918 ns 908 ns 773814
The function defined first is twice as slow. Running only the benchmark for one of the functions with --benchmark_filter
or using --benchmark_enable_random_interleaving=true
with --benchmark_repetitions=n
only confirms the measurements.
The most notable observation I found during profiling is a large difference in branch misses:
$ perf stat ./a.out --benchmark_filter=benchmarkBacksolve1
...
1.231.964.948 branches # 1,354 G/sec (83,31%)
97.548.817 branch-misses # 7,92% of all branches (83,22%)
...
$ perf stat ./a.out --benchmark_filter=benchmarkBacksolve2
...
2.241.666.453 branches # 2,779 G/sec (83,64%)
3.751.915 branch-misses # 0,17% of all branches (83,41%)
...
Almost 8% of all branches seem to be mispredicted for the function defined first, compared to only 0.2% for the function defined last.
On different machines, I have to modify the setup a bit to see this effect. But other experiments confirm how brittle this seems to be on most machines. Any kind of change in offset can completely change the benchmark measurements. Just as one more example, removing the __attribute__((always_inline))
on my machine results in equal time for the two functions, but it is neither the faster nor the slower time from before, it is a distinct new one around 1250 ns
.
Through interviews of and talks by Emery Berger, I am aware that unexpected factors like link ordering and environment variables can influence performance. Still, I would like to understand this example better.
- Why does branch prediction in this code in particular seem to be influenced so much by code position?
- Is there anything I can do to avoid these deviations to be able to evaluate different implementations properly? After observing these effects, I am worried that my answer in Optimizing the backward solve for a sparse lower triangular linear system was measuring anything but real performance improvements
- Is there something one can pay attention to during implementation to avoid such problems for branch predictors of most modern CPUs?
-O3
is in general also fair. However, I am not sure how it applies to the question really. Are you saying that-O3
is in general more likely to cause such a collision in branch prediction? The optimization level is of course also an influencing factor on the measurements, but one can observe the same described behavior on-O2
as well. I just happened to find an illustrative example with-O3
sooner. – Underage