Understanding loops performance in jvm

Asked 28/10, 2016 at 12:36 Answered 28/10, 2016 at 13:31

I'm playing with jmh and in the section about looping they said that

You might notice the larger the repetitions count, the lower the "perceived" cost of the operation being measured. Up to the point we do each addition with 1/20 ns, well beyond what hardware can actually do. This happens because the loop is heavily unrolled/pipelined, and the operation to be measured is hoisted from the loop. Morale: don't overuse loops, rely on JMH to get the measurement right.

I tried it myself

    @Benchmark
    @OperationsPerInvocation(1)
    public int measurewrong_1() {
        return reps(1);
    }      

    @Benchmark
    @OperationsPerInvocation(1000)
    public int measurewrong_1000() {
        return reps(1000);
    }

and got the following result:

Benchmark                      Mode  Cnt  Score    Error  Units
MyBenchmark.measurewrong_1     avgt   15  2.425 ±  0.137  ns/op
MyBenchmark.measurewrong_1000  avgt   15  0.036 ±  0.001  ns/op

It indeed shows that the MyBenchmark.measurewrong_1000 is dramatically faster than MyBenchmark.measurewrong_1. But I cannot really understand the optimization JVM does to make this performance improvement.

What do they mean the loop is unrolled/pipelined?

Lingenfelter answered 28/10, 2016 at 12:36 Comment(2)

what exactly does your reps method do? – Acute 28/10, 2016 at 12:44

Have you try to run the 1000 version before the 1 one? – Diclinous 28/10, 2016 at 13:39

Loop unrolling makes pipelining possible. So the pipeline-able CPU (for example RISC) can execute the unrolled code in parallel.

So if your CPU is able to execute 5 pipelines in parallel, your loop will be unrolled in the way:

// pseudo code
int pipelines = 5;
for(int i = 0; i < length; i += pipelines){
    s += (x + y);
    s += (x + y);
    s += (x + y);
    s += (x + y);
    s += (x + y);
}

IF = Instruction Fetch, ID = Instruction Decode, EX = Execute, MEM = Memory access, WB = Register write back

From Oracle White paper:

... a standard compiler optimization that enables faster loop execution. Loop unrolling increases the loop body size while simultaneously decreasing the number of iterations. Loop unrolling also increases the effectiveness of other optimizations.

more information about pipelining: Classic RISC pipeline

Lebron answered 28/10, 2016 at 13:31 Comment(0)

Loop unrolling is a tecnhique to flatten multiple loop iterations by repeating the loop body.
E.g. in the given example

    for (int i = 0; i < reps; i++) {
        s += (x + y);
    }

can be unrolled by JIT compiler to something like

    for (int i = 0; i < reps - 15; i += 16) {
        s += (x + y);
        s += (x + y);
        // ... 16 times ...
        s += (x + y);
    }

Then the extended loop body can be further optimized to

    for (int i = 0; i < reps - 15; i += 16) {
        s += 16 * (x + y);
    }

Obviously computing 16 * (x + y) is much faster than computing (x + y) 16 times.

Crapulent answered 28/10, 2016 at 13:8 Comment(5)

Doesn't 16 * (x + y) ultimately get evaluated as (x+y) sixteen times? When the expression is done? Or am I mistaken? – Wadding 28/10, 2016 at 13:18

@Wadding of course not, even on microprocessors that don't have a hardware multiplier you can still use the "shift and add" algorithm. Well for 16 that's only shift, no add. – Unruly 28/10, 2016 at 13:34

@harold Of course, I got it now. My mistake, well rather a teachers mistake. Or perhaps I mis-remembered my teacher...making it my mistake again. Thanks for the info. – Wadding 28/10, 2016 at 13:38

@MarkoTopolnik It's equivalent even when facing overflows as both expression return the exact result mod 2**32 or 64. But I guess, I'm just misunderstanding you (but then, others might too). – Plating 30/10, 2016 at 18:0

@maartinus no, I was just wrong ☺ I didn't pay attention to the properties of integer overflow. – Acute 30/10, 2016 at 18:26

Loop Pipelining = Software Pipelining.

Basically, it's a technique that is used to optimize the efficiency of sequential loop iterations, by executing some of the instructions in the body of the loop - in parrallel.

Of course, this can be done only when certain conditions are met, such as each iteration not being dependent on another etc.

From insidehpc.com:

Software pipelining, which really has nothing to do with hardware pipelining, is a loop optimization technique to make statements within an iteration independent of each other. The goal is to remove dependencies so that seemingly sequential instructions may be executed in parallel.

See more here:

Scrutable answered 28/10, 2016 at 12:54 Comment(2)

That covers the pipelining part but not the unrolling part. en.wikipedia.org/wiki/Loop_unrolling – Amberlyamberoid 28/10, 2016 at 12:58

Pipelining/unrolling is just a minor player in getting the cost of one loop iteration to 1 picosecond, which is what OP has measured. – Acute 28/10, 2016 at 12:58

Recommended topics

Hot tags