I want to understand what kind of optimizations Java does to consecutive for loops. More precisely, I'm trying to check if loop fusion is performed. Theoretically, I was expecting that this optimization was not done automatically and was expecting to confirm that the fused version was faster than the version with two loops.
However, after running the benchmarks, the results show that two separate (and consecutive) loops are faster than one single loop doing all the work.
I already tried using JMH for creating the benchmarks and got the same results.
I used the javap
command and it shows that the generated bytecode for the source file with two loops actually corresponds to two loops being executed (no loop unrolling or other optimization was performed).
Code being measured for BenchmarkMultipleLoops.java
:
private void work() {
List<Capsule> intermediate = new ArrayList<>();
List<String> res = new ArrayList<>();
int totalLength = 0;
for (Capsule c : caps) {
if(c.getNumber() > 100000000){
intermediate.add(c);
}
}
for (Capsule c : intermediate) {
String s = "new_word" + c.getNumber();
res.add(s);
}
//Loop to assure the end result (res) is used for something
for(String s : res){
totalLength += s.length();
}
System.out.println(totalLength);
}
Code being measured for BenchmarkSingleLoop.java
:
private void work(){
List<String> res = new ArrayList<>();
int totalLength = 0;
for (Capsule c : caps) {
if(c.getNumber() > 100000000){
String s = "new_word" + c.getNumber();
res.add(s);
}
}
//Loop to assure the end result (res) is used for something
for(String s : res){
totalLength += s.length();
}
System.out.println(totalLength);
}
And here is the code for Capsule.java
:
public class Capsule {
private int number;
private String word;
public Capsule(int number, String word) {
this.number = number;
this.word = word;
}
public int getNumber() {
return number;
}
@Override
public String toString() {
return "{" + number +
", " + word + '}';
}
}
caps
is an ArrayList<Capsule>
with 20 million elements populated like this in the beginning:
private void populate() {
Random r = new Random(3);
for(int n = 0; n < POPSIZE; n++){
int randomN = r.nextInt();
Capsule c = new Capsule(randomN, "word" + randomN);
caps.add(c);
}
}
Before measuring, a warm up phase is executed.
I ran each of the benchmarks 10 times or, in other words, the work()
method is executed 10 times for each benchmark and the average times to complete are presented below (in seconds). After each iteration, the GC was executed along with a few sleeps:
- MultipleLoops: 4.9661 seconds
- SingleLoop: 7.2725 seconds
OpenJDK 1.8.0_144 running on an Intel i7-7500U (Kaby Lake).
Why is the MultipleLoops version faster than the SingleLoop version, even though it has to traverse two different data structures?
UPDATE 1:
As suggested in the comments, if I change the implementation to calculate the totalLength
while strings are being generated, avoiding the creation of the res
list, the single loop version becomes faster.
However, that variable was only introduced so that some work was done after creating the result list, in order to avoid discarding the elements if nothing was done with them.
In other words, the intended result is to produce the final list. But this suggestion helps in better understanding what is going on.
Results:
- MultipleLoops: 0.9339 seconds
- SingleLoop: 0.66590005 seconds
UPDATE 2:
Here is a link for the code I used for the JMH benchmark: https://gist.github.com/FranciscoRibeiro/2d3928761f76e4f7cecfcfcdf7fc96d5
Results:
- MultipleLoops: 7.397 seconds
- SingleLoop: 8.092 seconds
totalSize
directly, skipping creating theresult
list? – Gills-Xmx8g -Xms8g -verbose:gc
– Altimetry