Why are two separate loops faster than one?

Asked 23/2, 2018 at 16:40 Answered 5/4, 2018 at 15:14

java performance optimization benchmarking microbenchmark

I want to understand what kind of optimizations Java does to consecutive for loops. More precisely, I'm trying to check if loop fusion is performed. Theoretically, I was expecting that this optimization was not done automatically and was expecting to confirm that the fused version was faster than the version with two loops.

However, after running the benchmarks, the results show that two separate (and consecutive) loops are faster than one single loop doing all the work.

I already tried using JMH for creating the benchmarks and got the same results.

I used the javap command and it shows that the generated bytecode for the source file with two loops actually corresponds to two loops being executed (no loop unrolling or other optimization was performed).

Code being measured for BenchmarkMultipleLoops.java:

private void work() {
        List<Capsule> intermediate = new ArrayList<>();
        List<String> res = new ArrayList<>();
        int totalLength = 0;

        for (Capsule c : caps) {
            if(c.getNumber() > 100000000){
                intermediate.add(c);
            }
        }

        for (Capsule c : intermediate) {
            String s = "new_word" + c.getNumber();
            res.add(s);
        }

        //Loop to assure the end result (res) is used for something
        for(String s : res){
            totalLength += s.length();
        }

        System.out.println(totalLength);
    }

Code being measured for BenchmarkSingleLoop.java:

private void work(){
        List<String> res = new ArrayList<>();
        int totalLength = 0;

        for (Capsule c : caps) {
            if(c.getNumber() > 100000000){
                String s = "new_word" + c.getNumber();
                res.add(s);
            }
        }

        //Loop to assure the end result (res) is used for something
        for(String s : res){
            totalLength += s.length();
        }

        System.out.println(totalLength);
    }

And here is the code for Capsule.java:

public class Capsule {
    private int number;
    private String word;

    public Capsule(int number, String word) {
        this.number = number;
        this.word = word;
    }

    public int getNumber() {
        return number;
    }

    @Override
    public String toString() {
        return "{" + number +
                ", " + word + '}';
    }
}

caps is an ArrayList<Capsule> with 20 million elements populated like this in the beginning:

private void populate() {
        Random r = new Random(3);

        for(int n = 0; n < POPSIZE; n++){
            int randomN = r.nextInt();
            Capsule c = new Capsule(randomN, "word" + randomN);
            caps.add(c);
        }
    }

Before measuring, a warm up phase is executed.

I ran each of the benchmarks 10 times or, in other words, the work() method is executed 10 times for each benchmark and the average times to complete are presented below (in seconds). After each iteration, the GC was executed along with a few sleeps:

MultipleLoops: 4.9661 seconds
SingleLoop: 7.2725 seconds

OpenJDK 1.8.0_144 running on an Intel i7-7500U (Kaby Lake).

Why is the MultipleLoops version faster than the SingleLoop version, even though it has to traverse two different data structures?

UPDATE 1:

As suggested in the comments, if I change the implementation to calculate the totalLength while strings are being generated, avoiding the creation of the res list, the single loop version becomes faster.

However, that variable was only introduced so that some work was done after creating the result list, in order to avoid discarding the elements if nothing was done with them.

In other words, the intended result is to produce the final list. But this suggestion helps in better understanding what is going on.

Results:

MultipleLoops: 0.9339 seconds
SingleLoop: 0.66590005 seconds

UPDATE 2:

Here is a link for the code I used for the JMH benchmark: https://gist.github.com/FranciscoRibeiro/2d3928761f76e4f7cecfcfcdf7fc96d5

Results:

MultipleLoops: 7.397 seconds
SingleLoop: 8.092 seconds

Doronicum answered 23/2, 2018 at 16:40 Comment(17)

What happens if you increment totalSize directly, skipping creating the result list? – Gills 23/2, 2018 at 16:45

I guess it's because branch prediction. #11228309 – Myeshamyhre 23/2, 2018 at 16:46

Keep in mind that your JVM will JIT compile the bytecode, so the instructions that your CPU actually runs don't necessarily reflect unrolling or lack thereof in the bytecode. What JVM did you use, and what hardware? (e.g. i7-6700k (Skylake), or Ryzen, or Silvermont, or what? A CPU model number would be fine, if you don't know the name of the microarchitecture.) – Orfurd 23/2, 2018 at 16:58

@AndyTurner updated the question with info based on your comment – Doronicum 23/2, 2018 at 17:59

@JohannesKuhn that is an interesting observation. So, the overhead of mispredicting in the SingleLoop version is greater than in the MultipleLoops version? Because in SingleLoop, the creation of the Strings would be "reverted", whereas in MultipleLoops, the only thing that might go to waste is the addition of the Capsule objects to the intermediate list. Does this make sense? – Doronicum 23/2, 2018 at 18:4

@PeterCordes yes, JIT might be doing some of its tricks here... is there any way I can check which optimizations it is applying to the bytecode? The JVM I'm using is OpenJDK 1.8.0_144 and the CPU is an Intel i7-7500U (Kaby Lake). – Doronicum 23/2, 2018 at 18:9

Yeah, you can look at the asm produced by the JIT compiler. #1503979, and / or https://mcmap.net/q/88613/-how-do-i-check-assembly-output-of-java-code. – Orfurd 23/2, 2018 at 18:13

Just tried to reproduce your results without success. For me, both loops have approx. the same performance, see gist.github.com/l-wi/20a923af1ee707e885e087e136af0bfe. Could there be a flaw in your (or my) benchmark code? – Mediaeval 23/2, 2018 at 18:24

Usually for that sort of procedures JVM needs to be configured specifically for CPU, maybe it worth to look at garbage collector configurations and optimise it – Immigration 23/2, 2018 at 18:57

@JohannesKuhn: Branch prediction seems unlikely; both versions have the same input. Possibly the mispredict penalty is higher because of the details of the surrounding code in the combined loop, but I'd guess that the mispredict rate is the same. – Orfurd 23/2, 2018 at 19:37

Rerun with -Xmx8g -Xms8g -verbose:gc – Altimetry 23/2, 2018 at 19:57

Please provide an MCVE. Emphasis on C ... for Complete. You said: "I already tried using JMH for creating the benchmarks and got the same results." Then you should replace the code in the question with the JMH-based version and the corresponding results. Right now, there are people saying that they can't reproduce your results, but we / they don' know if they are running the same benchmark as you. – Patinous 23/2, 2018 at 23:49

@Mediaeval The collection you used has a smaller size than the one I'm using (2M vs 20M). But I think the results you got already show a difference in performance between the two alternatives. Did you try using the 20M collection size? Also, I don't rebuild the list every time I execute a benchmark method because the list does not get modified. Can this be a flaw? I also executed the benchmark you posted and got 0.107 s/op (one loop) vs 0.071 s/op (two loops). Error of +- 0.001 for both. – Doronicum 26/2, 2018 at 9:46

@StephenC I updated the question with a link to the JMH benchmark code I used – Doronicum 26/2, 2018 at 9:57

See also How do I write a correct micro-benchmark in Java?. – Scan 1/3, 2018 at 21:13

Try doing the SingleLoops test first. It could be that the test you run first is creating GC work during the test you run second. – Atmometer 29/3, 2018 at 3:39

@MattTimmermans I tried it at the time and got the same kind of results. However, GC does seem to be causing this results – Doronicum 6/4, 2018 at 16:8

I investigated this "phenomena" and look like got something like an answer.
Let's add .jvmArgs("-verbose:gc") to JMHs OptionsBuilder. Results for 1 Iteration:

Single Loop: [Full GC (Ergonomics) [PSYoungGen: 2097664K->0K(2446848K)] [ParOldGen: 3899819K->4574771K(5592576K)] 5997483K->4574771K(8039424K), [Metaspace: 6208K->6208K(1056768K)], 5.0438301 secs] [Times: user=37.92 sys=0.10, real=5.05 secs] 4.954 s/op

Multiple Loops: [Full GC (Ergonomics) [PSYoungGen: 2097664K->0K(2446848K)] [ParOldGen: 3899819K->4490913K(5592576K)] 5997483K->4490913K(8039424K), [Metaspace: 6208K->6208K(1056768K)], 3.7991573 secs] [Times: user=26.84 sys=0.08, real=3.80 secs] 4.187 s/op

JVM spent huge amount of CPU time for GC. Once per 2 test runs, JVM have to make Full GC (move 600Mb to OldGen and collect 1.5Gb of trash from previous cycles). Both garbage collectors done the same work, but spent ~25% less application time for multiple loops test case. If we decrease the POPSIZE to 10_000_000 or add to before bh.consume() Thread.sleep(3000), or add -XX:+UseG1GC to JVM args, then multiple loop boosting effect gone. I run it once again with .addProfiler(GCProfiler.class). The main difference:

Multiple Loops: gc.churn.PS_Eden_Space 374.417 ± 23 MB/sec

Single Loop: gc.churn.PS_Eden_Space 336.037 MB/sec ± 19 MB/sec

I think, we see speed up in such specific circumstances, because old-good Compare and Swap GC algorithm has CPU bottleneck for multiples test runs and uses extra "senseless" cycle for Collect Garbage from earlier runs. It's even easier to reproduce with @Threads(2), if you have enough RAM. It looks like this, if you try to profile Single_Loop test:

profiling

Goosander answered 5/4, 2018 at 15:14 Comment(2)

Some time after posting the question, I made some experiments with different GC algorithms: Serial GC, Parallel GC and G1GC. Parallel GC seemed to affect the performance of SingleLoop the most. Using G1GC or Serial GC shortened the difference between Multiple and Single, but the best I could achieve was having SingleLoop perform almost identically to MultipleLoops. Your experiments seem to point in the same direction, that is, GC is the "culprit". Thanks! – Doronicum 6/4, 2018 at 16:17

May be, I wasn't super clear. We have perfomance boost mainly because of Compare and Swap GC (which parallel as well) cheats JMH using multiple CPU cores, even in case you set @Thread=1 explicitly. Thank you for interesting quastion and good luck ;) – Goosander 6/4, 2018 at 16:54

To understand what is happening under the hood, you could add JMX behaviour to analyse running app in jvisualvm, located in JAVA_HOME\bin With 20M size of a capsule list in the memory, it ran out of memory and visualvm went in non-responding state. I had reduced capsule list size to 200k and 100M to 1M in if condition to test. After observing behaviour on visualvm, single loop execution completed prior to multiple loops. Maybe this is not the right approach, but you could experiment with it.

LoopBean.java

import java.util.List;
public interface LoopMBean {
    void multipleLoops();
    void singleLoop();
    void printResourcesStats();
}

Loop.java

import java.util.ArrayList;
import java.util.List;
import java.util.Random;

public class Loop implements LoopMBean {

    private final List<Capsule> capsules = new ArrayList<>();

    {
        Random r = new Random(3);
        for (int n = 0; n < 20000000; n++) {
            int randomN = r.nextInt();
            capsules.add(new Capsule(randomN, "word" + randomN));
        }
    }

    @Override
    public void multipleLoops() {

        System.out.println("----------------------Before multiple loops execution---------------------------");
        printResourcesStats();

        final List<Capsule> intermediate = new ArrayList<>();
        final List<String> res = new ArrayList<>();
        int totalLength = 0;

        final long start = System.currentTimeMillis();

        for (Capsule c : capsules)
            if (c.getNumber() > 100000000) {
                intermediate.add(c);
            }

        for (Capsule c : intermediate) {
            String s = "new_word" + c.getNumber();
            res.add(s);
        }

        for (String s : res)
            totalLength += s.length();

        System.out.println("multiple loops=" + totalLength + " | time taken=" + (System.currentTimeMillis() - start) + " milliseconds");

        System.out.println("----------------------After multiple loops execution---------------------------");
        printResourcesStats();

        res.clear();
    }

    @Override
    public void singleLoop() {

        System.out.println("----------------------Before single loop execution---------------------------");
        printResourcesStats();

        final List<String> res = new ArrayList<>();
        int totalLength = 0;

        final long start = System.currentTimeMillis();

        for (Capsule c : capsules)
            if (c.getNumber() > 100000000) {
                String s = "new_word" + c.getNumber();
                res.add(s);
            }

        for (String s : res)
            totalLength += s.length();

        System.out.println("Single loop=" + totalLength + " | time taken=" + (System.currentTimeMillis() - start) + " milliseconds");
        System.out.println("----------------------After single loop execution---------------------------");
        printResourcesStats();

        res.clear();
    }

    @Override
    public void printResourcesStats() {
        System.out.println("Max Memory= " + Runtime.getRuntime().maxMemory());
        System.out.println("Available Processors= " + Runtime.getRuntime().availableProcessors());
        System.out.println("Total Memory= " + Runtime.getRuntime().totalMemory());
        System.out.println("Free Memory= " + Runtime.getRuntime().freeMemory());
    }
}

LoopClient.java

import javax.management.MBeanServer;
import javax.management.ObjectName;
import java.lang.management.ManagementFactory;

public class LoopClient {

    void init() {

        final MBeanServer mBeanServer = ManagementFactory.getPlatformMBeanServer();
        try {
            mBeanServer.registerMBean(new Loop(), new ObjectName("LOOP:name=LoopBean"));
        } catch (Exception e) {
            e.printStackTrace();
        }

    }

    public static void main(String[] args) {

        final LoopClient client = new LoopClient();
        client.init();
        System.out.println("Loop client is running...");
        waitForEnterPressed();
    }

    private static void waitForEnterPressed() {
        try {
            System.out.println("Press  to continue...");
            System.in.read();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Execute with following command:

java -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=9999 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false LoopClient

You could add -Xmx3072M extra option for quick memory increase to avoid OutOfMemoryError

Gravely answered 29/3, 2018 at 3:19 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

UPDATE 1:

UPDATE 2:

Recommended topics

Hot tags