TAGE prediction accuracy improves with loop over larger array?

Loop branches typically mispredict only on the last iteration when execution falls through out of the loop instead of jumping to the top. (For fairly obvious reasons: they quickly learn that the branch is always-taken, and predict that way.)

The more iterations your loops run, the more correctly-predicted taken branches you have for the same number of not-taken special cases that mispredict.

Fun fact: on modern Intel CPUs (like Haswell / Skylake), their IT-TAGE branch predictors can "learn" a pattern up to about 22 iterations, correctly predicting the loop exit. With a very long outer loop to give the CPU time to learn the pattern, an inner loop that runs only 22 or fewer iterations tends to predict even the loop-exit branches correctly. So there's a significant dropoff in performance (and instruction throughput) when the inner-loop size grows past that point, if the loop body is pretty simple.

But it probably takes quite a few outer-loop iterations to train the predictors with that much history. I was testing 10 million outer-loop iterations or so, to average out noise and startup overhead for a whole process with perf stat on real hardware under Linux. So the startup / learning phase was negligible.

With older simpler branch predictors (before TAGE), I think some CPUs did implement loop-pattern prediction with a counter to predict loop exits for inner loops that ran a constant number of iterations every time they were reached. https://danluu.com/branch-prediction/ says the same, that "modern CPUs" "often" have such predictors.

Recommended topics

Hot tags