Why do neither V8 nor spidermonkey seem to unroll static loops?

Asked 6/2, 2021 at 20:59 Answered 6/2, 2021 at 21:37

Solved javascript v8 spidermonkey loop-unrolling

Doing a small check, it looks like neither V8 nor spidermonkey unroll loops, even if it is completely obvious, how long they are (literal as condition, declared locally):

const f = () => {
  let counter = 0;
  for (let i = 0; i < 100_000_000; i++) {
    counter++;
  }
  return counter;
};

const g = () => {
  let counter = 0;
  for (let i = 0; i < 10_000_000; i += 10) {
    counter++;
    counter++;
    counter++;
    counter++;
    counter++;
    counter++;
    counter++;
    counter++;
    counter++;
    counter++;
  }
  return counter;
}

let start = performance.now();
f();
let mid = performance.now();
g();
let end = performance.now();

console.log(
  `f took ${(mid - start).toFixed(2)}ms, g took ${(end - mid).toFixed(2)}ms, ` +
  `g was ${((mid - start)/(end - mid)).toFixed(2)} times faster.`
);

Is there any reason for this? They perform considerably more complex optimizations. Are standard for-loops just that uncommon in javascript, that it's not worth it?

Edit: Just as a note: one might argue, that perhaps optimization is delayed. This doesn't seem to be the case, although i am not an expert here. I used node --allow-natives-syntax --trace-deopt, performed optimization manually, and observed no deoptimization happening (snippet for collapsing, not actually runnable in a browser):

const { performance } = require('perf_hooks');

const f = () => {
  let counter = 0;
  for (let i = 0; i < 100_000_000; i++) {
    counter++;
  }
  return counter;
};
// collect metadata and optimize
f(); f();
%OptimizeFunctionOnNextCall(f);
f();

const start = performance.now();
f();
console.log(performance.now() - start);

Done with both the normal and unrolled version, the same effect prevails.

Uhlan answered 6/2, 2021 at 20:59 Comment(2)

Uh, do you really expect a loop of 100000000 repetitions to be unrolled? No sane compiler will generate that much code. – Strepitous 6/2, 2021 at 22:33

No, but in some cases (e.g. the microbenchmark shown above), a normal "unroll", where e.g. 10 loop steps are grouped together, is beneficial (note that as the accepted answer mentions, folding plays a role in the example given, but then again, it's also a factor 50 difference in speed on chrome). Unrolling doesn't have to be everything or nothing, it's not like that in other languages either. You just want to avoid the conditional of the loop as much as possible (even though it is caught mostly by the branch predictor, it is still a problem, if the loop body does little). – Uhlan 6/2, 2021 at 22:41

(V8 developer here.)

TL;DR: because it's rarely worth it for real-world code.

Loop unrolling, like other optimizations that increase code size (such as inlining), is a double-edged sword. Yes, it can help; in particular it often helps for tiny toy examples like the one posted here. But it can also hurt performance, most obviously because it increases the amount of work that the compiler has to do (and hence the time it takes to do that work), but also through secondary effects like bigger code benefitting less from the CPU's caching efforts.

V8's optimizing compiler actually does like to unroll the first iteration of loops. Also, as it happens, we currently have an ongoing project to unroll more loops; the current status is that it sometimes helps and sometimes hurts, so we're still fine-tuning the heuristics for when it should kick in, and when it shouldn't. This difficulty also indicates that for real-world JavaScript, the benefits will typically be quite small.

It doesn't matter whether it's a "standard for-loop" or not; any loop can be unrolled in theory. It just happens to be the case that aside from microbenchmarks, loop unrolling tends to make little difference: there isn't all that much overhead to just doing another iteration, so if the loop body does more than counter++, there's not a lot to be gained from avoiding the per-iteration overhead. And, fwiw, this per-iteration overhead is not what your test is measuring: the repeated increments all get folded, so what you're really comparing here is 100M iterations of counter += 1 against 10M iterations of counter += 10.

So this is one of many examples of misleading microbenchmarks trying to trick us into drawing the wrong conclusions ;-)

Bauman answered 6/2, 2021 at 21:37 Comment(3)

Very informative answer, thank you. I was curious about this, and it's interesting to learn what's happening under the hood. – Optional 23/11, 2021 at 23:6

So the current state is that in wasm, loops with less than 5 iterations are unrolled, though not in JS correct? github.com/v8/v8/blob/lkgr/src/compiler/loop-unrolling.h (the design document seems to be internal unfortunately, or the link is outdated) – Hawkweed 23/1, 2022 at 17:49

@JonasWilms: Yes, with a minor correction: up to 5 copies of the original loop body will be created; the original iteration count doesn't matter. Needless to say, I wouldn't rely on any of the details, as they could change any day. – Bauman 23/1, 2022 at 18:9

I recommend you read this answer as it explains it quite clearly. In a nutshell, unrolling won't mean that the code will run faster.

For example, if, instead of a simple counter++, you had a function call (taken from the linked answer):

function one() {
  // something complex running here, in this case a very long comment:
  // bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla
  
  return 1;
}

for (let i = 0; i < 1000; ++i) {
  counter += one();
}

If the function is short, on-stack replacement as well as inline the function will make the unrolling code faster, however if the function is long, the loop is actually faster (all examples again, taken from the answer I linked).

  counter += one();
  counter += one();
  ...

Now, from my personal journey back in the day from creating a simple compiler back in college using Assembly language (already optimized by the processor), and all the way up to C/C++ (which by themselves can already produce unbelievably efficient ASM code), and walking my way up to higher level languages such as PHP and Javascript:

My take is that the people in charge of optimization have to do a lot of heuristics and would most likely be interested in real-life code that can produce real-life results.

Now, I am not in the position to determine whether it's more common to do arithmetic in a for loop rather than calling a function, but my gut tells me that it's unlikely that for loops with simple arithmetic matter much in the huge ecosystems that browsers have become nowadays. Then again it's a great exercise to learn and dive deeper.

Orangewood answered 6/2, 2021 at 21:24 Comment(2)

Thanks, but some problems: first, i cannot reproduce comments preventing function inlining anymore. Probably something changed (the answer is three years old). Second, i therefore cannot check, whether it really is, that the unrolled version is inferior to the loop, if function inlining doesn't happen in either. Lastly, i can see, that there is a "sensible point to stop unrolling". If above loop was fully unrolled, it would have 100 million lines, which would make the parser gulp when looking at it. It's not like that's a reasonable approach though. – Uhlan 6/2, 2021 at 21:36

I can confirm that V8 has moved away from using source size (including comments) as part of its inlining heuristics. The general point still stands though: both inlining and unrolling can help or hurt performance, it really depends. That's why it's so hard to get the engine's decision-making (i.e. heuristics) right. – Bauman 6/2, 2021 at 21:46

Recommended topics

Hot tags