How does Branch Prediction affect performance in R?

set.seed(128) #or making a vector with 1e7 myvec <- rnorm(1e8, 128, 128) myvecsorted <- sort(myvec) mysumU = 0 mysumS = 0 SvU <- microbenchmark::microbenchmark( Unsorted = for (i in 1:length(myvec)) { if (myvec[i] > 128) { mysumU = mysumU + myvec[i] } } , Sorted = for (i in 1:length(myvecsorted)) { if (myvecsorted[i] > 128) { mysumS = mysumS + myvecsorted[i] } } , times = 10) ggplot2::autoplot(SvU)

R version 3.6.1 (2019-07-05) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 16299) Matrix products: default locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252 attached base packages: [1] compiler stats graphics grDevices utils datasets methods base other attached packages: [1] rstudioapi_0.10 reprex_0.3.0 cli_1.1.0 pkgconfig_2.0.3 evaluate_0.14 rlang_0.4.0 [7] Rcpp_1.0.2 microbenchmark_1.4-7 ggplot2_3.2.1

Interpreter overhead, and just being an interpreter, explains most of the average difference. I don't have an explanation for the higher variance.

R is an interpreted language, not JIT compiled to machine code like Java, or ahead-of-time like C. (I don't know much about R internals, just CPUs and performance, so I'm making a lot of assumptions here.)

The code that's running on the actual CPU hardware is the R interpreter, not exactly your R program.

Control dependencies in the R program (like an if()) become data dependencies in the interpreter. The current thing being executed is just data for the interpreter running on a real CPU.

Different operations in the R program become control dependencies in the interpreter. For example, evaluating myvec[i] then the + operator would probably be done by two different functions in the interpreter. And a separate function for > and for if() statements.

The classic interpreter loop is based around an indirect branch that dispatches from a table of function pointers. Instead of a taken/not-taken choice, the CPU needs a prediction for one of many recently-used target addresses. I don't know if R uses a single indirect branch like that or if tries to be fancier like having the end of each interpreter block dispatch to the next one, instead of returning to a main dispatch loop.

Modern Intel CPUs (like Haswell and later) have IT-TAGE (Indirect TAgged GEometric history length) prediction. The taken/not-taken state of previous branches along the path of execution are used as an index into a table of predictions. This mostly solves the interpreter branch-prediction problem, allowing it to do a surprisingly good job, especially when the interpreted code (the R code in your case) does the same thing repeatedly.

Branch Prediction and the Performance of Interpreters - Don’t Trust Folklore (2015) - Haswell's ITTAGE is a huge improvement for interpreters, invalidating the previous wisdom that a single indirect branch for interpreter dispatch was a disaster. I don't know what R actually uses; there are tricks that were useful.
X86 prefetching optimizations: "computed goto" threaded code has more links.
https://comparch.net/2013/06/30/why-tage-is-the-best/
https://danluu.com/branch-prediction/ has some links about that at the bottom. Also mentions that AMD has used Perceptron predictors in Bulldozer-family and Zen: like a neural net.

The if() being taken does result in needing to do different operations, so it does actually still make some branching in the R interpreter more or less predictable depending on data. But of course as an interpreter, it's doing much more work at each step than a simple machine-code loop over an array.

So extra branch mispredicts are a much smaller fraction of the total time because of interpreter overhead.

Of course, both your tests are with the same interpreter on the same hardware. I don't know what kind of CPU you have.

If it's Intel older than Haswell or AMD older than Zen, you might be getting a lot of mispredicts even with the sorted array, unless the pattern is simple enough for an indirect branch history predictor to lock onto. That would bury the difference in more noise.

Since you do see a pretty clear difference, I'm guessing that the CPU doesn't mispredict too much in the sorted case, so there is room for it to get worse in the unsorted case.

Question:

Update:

Recommended topics

Hot tags