Why is "while (i++ < n) {}" significantly slower than "while (++i < n) {}"

Asked 15/8, 2014 at 7:24 Answered 18/8, 2014 at 7:47

Solved java performance compiler-optimization post-increment pre-increment

Apparently on my Windows 8 laptop with HotSpot JDK 1.7.0_45 (with all compiler/VM options set to default), the below loop

final int n = Integer.MAX_VALUE;
int i = 0;
while (++i < n) {
}

is at least 2 orders of magnitude faster (~10 ms vs. ~5000 ms) than:

final int n = Integer.MAX_VALUE;
int i = 0;
while (i++ < n) {
}

I happened to notice this problem while writing a loop to evaluate another irrelevant performance issue. And the difference between ++i < n and i++ < n was huge enough to significantly influence the result.

If we look at the bytecode, the loop body of the faster version is:

iinc
iload
ldc
if_icmplt

And for the slower version:

iload
iinc
ldc
if_icmplt

So for ++i < n, it first increments local variable i by 1 and then push it onto the operand stack while i++ < n does those 2 steps in reverse order. But that doesn't seem to explain why the former is much faster. Is there any temp copy involved in the latter case? Or is it something beyond the bytecode (VM implementation, hardware, etc.) that should be responsible for the performance difference?

I've read some other discussion regarding ++i and i++ (not exhaustively though), but didn't find any answer that is Java-specific and directly related to the case where ++i or i++ is involved in a value comparison.

Capitalist answered 15/8, 2014 at 7:24 Comment(14)

may be because in ++i<n it first increment i then compare – Adulteration 15/8, 2014 at 7:26

may be duplicated ? – Audette 15/8, 2014 at 7:30

You did a lot of research, and I'm impressed, but you missed the most basic step: This has been answered before. Many times. – Olympus 15/8, 2014 at 7:31

10 ms is hardly long enough for a benchmark - let alone a Java benchmark where you have JVM warmup effects. Can you post your exact test code? Also, try reversing the order of the benchmarks. – Vanden 15/8, 2014 at 7:32

As Mysticial said, java needs warmup time. This is for the Just In Time (JIT) compiler to do its work. If you place your code in a function and call it several times before doing your measurements you might get different results. – Lorica 15/8, 2014 at 7:34

@CaptainCodeman in such a general form, that statement is just plain nonsense. There's much more to performance than (flawed) micro benchmarks. We switched to Java for a quite large project from C++ and gained an order of magnitude in performance. It depends on the problem you are trying to solve, the resources you have, and much more. Always chose the language that best suits your problem, and the personnel you have at hand (among other factors). – Dilettantism 15/8, 2014 at 12:33

I cannot fathom why a JIT compiler would not optimize out a completely empty loop like this. That makes me think you're doing this without optimizations enabled. In which case, benchmarking is useless. – Laborer 15/8, 2014 at 15:33

Because Oracle wants it like that. – Iredale 15/8, 2014 at 18:58

@Dilettantism I'm curious, for what sort of application did switching from C++ to Java give you an order of magnitude performance increase? – Benitobenjamen 16/8, 2014 at 9:44

@Dilettantism No compiled programming language is an order of magnitude faster than another; so the more likely scenario is that you had terrible C++ programmers or were using a very slow library. – Benitobenjamen 16/8, 2014 at 18:41

@Benitobenjamen I didn't say "Java is an order of magnitude faster in general". That's clearly not the case. I was referring to your (now deleted?) comment "If you need performance, don't use Java". – Dilettantism 17/8, 2014 at 16:22

@Benitobenjamen we are calculating payments and capitals for life insurance contracts. We found memory management to be a lot faster in Java (GC pauses don't matter for us). To do this right in C++ it takes a lot more development time. Also, our turnaround times are much lower in Java (reasons: first class refactoring support in the IDE, compile-on-save, hot code replace in debugging, easy database connectivity). I'm sure the same performance could be reached in C++, but not in the same development time. First Java version was running after 2 weeks of development, performance ~50 faster. – Dilettantism 17/8, 2014 at 16:34

@Dilettantism Yeah some trigger-happy mod must have deleted my original comment. Fair enough, thanks for the explanation. I stand by my statement that whatever you were doing in C++ must have suffered from terrible design and/or implementation to be 50 times slower. I can see how using Java can save you development time due to IDE features and easy memory management, but I'm pretty sure you could reach the same and probably better performance for the same application in C++. But your example doesn't contradict my initial point: Java may make programming easier, but it's not for tight optimization. – Benitobenjamen 17/8, 2014 at 20:20

Both the loops aren't equal. The first one (with i++) has one iteration more than the second(with ++i). So it must take more time. Looks straight forward to me. – Undertenant 19/8, 2014 at 23:15

119

As others have pointed out, the test is flawed in many ways.

You did not tell us exactly how you did this test. However, I tried to implement a "naive" test (no offense) like this:

class PrePostIncrement
{
    public static void main(String args[])
    {
        for (int j=0; j<3; j++)
        {
            for (int i=0; i<5; i++)
            {
                long before = System.nanoTime();
                runPreIncrement();
                long after = System.nanoTime();
                System.out.println("pre  : "+(after-before)/1e6);
            }
            for (int i=0; i<5; i++)
            {
                long before = System.nanoTime();
                runPostIncrement();
                long after = System.nanoTime();
                System.out.println("post : "+(after-before)/1e6);
            }
        }
    }

    private static void runPreIncrement()
    {
        final int n = Integer.MAX_VALUE;
        int i = 0;
        while (++i < n) {}
    }

    private static void runPostIncrement()
    {
        final int n = Integer.MAX_VALUE;
        int i = 0;
        while (i++ < n) {}
    }
}

When running this with default settings, there seems to be a small difference. But the real flaw of the benchmark becomes obvious when you run this with the -server flag. The results in my case then are along something like

...
pre  : 6.96E-4
pre  : 6.96E-4
pre  : 0.001044
pre  : 3.48E-4
pre  : 3.48E-4
post : 1279.734543
post : 1295.989086
post : 1284.654267
post : 1282.349093
post : 1275.204583

Obviously, the pre-increment version has been completely optimized away. The reason is rather simple: The result is not used. It does not matter at all whether the loop is executed or not, so the JIT simply removes it.

This is confirmed by a look at the hotspot disassembly: The pre-increment version results in this code:

[Entry Point]
[Verified Entry Point]
[Constants]
  # {method} {0x0000000055060500} &apos;runPreIncrement&apos; &apos;()V&apos; in &apos;PrePostIncrement&apos;
  #           [sp+0x20]  (sp of caller)
  0x000000000286fd80: sub    $0x18,%rsp
  0x000000000286fd87: mov    %rbp,0x10(%rsp)    ;*synchronization entry
                                                ; - PrePostIncrement::runPreIncrement@-1 (line 28)

  0x000000000286fd8c: add    $0x10,%rsp
  0x000000000286fd90: pop    %rbp
  0x000000000286fd91: test   %eax,-0x243fd97(%rip)        # 0x0000000000430000
                                                ;   {poll_return}
  0x000000000286fd97: retq   
  0x000000000286fd98: hlt    
  0x000000000286fd99: hlt    
  0x000000000286fd9a: hlt    
  0x000000000286fd9b: hlt    
  0x000000000286fd9c: hlt    
  0x000000000286fd9d: hlt    
  0x000000000286fd9e: hlt    
  0x000000000286fd9f: hlt

The post-increment version results in this code:

[Entry Point]
[Verified Entry Point]
[Constants]
  # {method} {0x00000000550605b8} &apos;runPostIncrement&apos; &apos;()V&apos; in &apos;PrePostIncrement&apos;
  #           [sp+0x20]  (sp of caller)
  0x000000000286d0c0: sub    $0x18,%rsp
  0x000000000286d0c7: mov    %rbp,0x10(%rsp)    ;*synchronization entry
                                                ; - PrePostIncrement::runPostIncrement@-1 (line 35)

  0x000000000286d0cc: mov    $0x1,%r11d
  0x000000000286d0d2: jmp    0x000000000286d0e3
  0x000000000286d0d4: nopl   0x0(%rax,%rax,1)
  0x000000000286d0dc: data32 data32 xchg %ax,%ax
  0x000000000286d0e0: inc    %r11d              ; OopMap{off=35}
                                                ;*goto
                                                ; - PrePostIncrement::runPostIncrement@11 (line 36)

  0x000000000286d0e3: test   %eax,-0x243d0e9(%rip)        # 0x0000000000430000
                                                ;*goto
                                                ; - PrePostIncrement::runPostIncrement@11 (line 36)
                                                ;   {poll}
  0x000000000286d0e9: cmp    $0x7fffffff,%r11d
  0x000000000286d0f0: jl     0x000000000286d0e0  ;*if_icmpge
                                                ; - PrePostIncrement::runPostIncrement@8 (line 36)

  0x000000000286d0f2: add    $0x10,%rsp
  0x000000000286d0f6: pop    %rbp
  0x000000000286d0f7: test   %eax,-0x243d0fd(%rip)        # 0x0000000000430000
                                                ;   {poll_return}
  0x000000000286d0fd: retq   
  0x000000000286d0fe: hlt    
  0x000000000286d0ff: hlt

It's not entirely clear for me why it seemingly does not remove the post-increment version. (In fact, I consider asking this as a separate question). But at least, this explains why you might see differences with an "order of magnitude"...

EDIT: Interestingly, when changing the upper limit of the loop from Integer.MAX_VALUE to Integer.MAX_VALUE-1, then both versions are optimized away and require "zero" time. Somehow this limit (which still appears as 0x7fffffff in the assembly) prevents the optimization. Presumably, this has something to do with the comparison being mapped to a (singed!) cmp instruction, but I can not give a profound reason beyond that. The JIT works in mysterious ways...

Firn answered 15/8, 2014 at 8:40 Comment(6)

I'm not a java guy but I do take a passing, hobby interest in the mechanics of compilers. If you (or anyone) happens to ask your follow-up question on a separate post, please post a link. Thank you! – Ventilation 15/8, 2014 at 11:6

Actually that was the first thing which came into my mind: when while (i++ < Integer.MAX_VALUE) exits the loop, an overflow has already happened to i. Proving the correctness of a code transformation is much harder when overflow might occur and after all, loops with overflows are not the common case so why should hotspot bother optimizing them… – Dzerzhinsk 15/8, 2014 at 11:39

@Ventilation I posted a follow-up question at #25326877 – Firn 15/8, 2014 at 12:27

@Holger: Yeah, that sounds like a way to avoid having trouble with the optimizations violating safety constraints - it doesn't occur commonly, so it's not worth it checking for all the things that could go wrong (e.g. buffer overflows). – Notarize 15/8, 2014 at 14:45

@Dzerzhinsk but how do you explain that if the limit is reduced from Integer.MAX_VALUE to Integer.MAX_VALUE-1 both are optimized, So with i++ case overflow still happens but optimized at the same time!!! – Raman 20/8, 2014 at 19:15

@Sumit Kumar Saha: No, with a limit of MAX_VALUE-1 no overflow occurs. The loop will end when i++ < MAX_VALUE-1 evaluates to false and that is when i++ evaluates to MAX_VALUE-1 and i will have the value MAX_VALUE afterwards. – Dzerzhinsk 21/8, 2014 at 7:46

The difference between ++i and i++ is that ++i effectively increments the variable and 'returns' that new value. i++ on the other hand effectively creates a temp variable to hold the current value in i, then increments the variable 'returning' the temp variable's value. This is where the extra overhead is coming from.

// i++ evaluates to something like this
// Imagine though that somehow i was passed by reference
int temp = i;
i = i + 1;
return temp;

// ++i evaluates to
i = i + 1;
return i;

In your case it appears that the increment won't be optimized by the JVM because you are using the result in an expression. The JVM can on the other hand optimize a loop like this.

for( int i = 0; i < Integer.MAX_VALUE; i++ ) {}

This is because the result of i++ is never used. In a loop like this you should be able to use both ++i and i++ with the same performance as if you used ++i.

Gintz answered 15/8, 2014 at 7:27 Comment(4)

It might be a bit clearer when the hotspot compiler would be mentioned explicitly. – Bandicoot 15/8, 2014 at 7:34

As the OP mentioned, both versions result in the same number of byte code instructions. Where is the overhead you are talking about there? And what are the JVM optimizations you talk about that are possible for the ++i version that are not possible for the other? – Vicechairman 15/8, 2014 at 8:5

Wondering how iload works... Does it actually copy the variable from the local variable table to the operand stack? If yes, for i++, i is first pushed (copied) to the operand stack and iinc increments the original i in the local variable table. ++i does exactly the same in reverse order. In both cases, there is no additional temp variable. But I could be completely wrong :) – Capitalist 15/8, 2014 at 8:9

If you look at Eugene's answer with his added benchmarks you see that the difference is minimal if none at all. The JVM can optimize, most of the time, a i++ to a ++i. In that it will remove the temp variable and just do an increment on the variable. My only guess is that by using i++ in the comparison, is that when the bytecode is compiled to machine code, the JVM is allocating an additional register for use with the loop. – Gintz 15/8, 2014 at 8:10

EDIT 2

You should really look here:

http://hg.openjdk.java.net/code-tools/jmh/file/f90aef7f1d2c/jmh-samples/src/main/java/org/openjdk/jmh/samples/JMHSample_11_Loops.java

EDIT The more I think about it, I realise that this test is somehow wrong, the loop will get seriously optimized by the JVM.

I think that you should just drop the @Param and let n=2.

This way you will test the performance of the while itself. The results I get in this case :

o.m.t.WhileTest.testFirst      avgt         5        0.787        0.086    ns/op
o.m.t.WhileTest.testSecond     avgt         5        0.782        0.087    ns/op

The is almost no difference

The very first question you should ask yourself is how you test and measure this. This is micro-benchmarking and in Java this is an art, and almost always a simple user (like me) will get the results wrong. You should rely on a benchmark test and very good tool for that. I used JMH to test this:

    @Measurement(iterations=5, time=1, timeUnit=TimeUnit.MILLISECONDS)
@Fork(1)
@Warmup(iterations=5, time=1, timeUnit=TimeUnit.SECONDS)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
@State(Scope.Benchmark)
public class WhileTest {
    public static void main(String[] args) throws Exception {
        Options opt = new OptionsBuilder()
            .include(".*" + WhileTest.class.getSimpleName() + ".*")
            .threads(1)
            .build();

        new Runner(opt).run();
    }


    @Param({"100", "10000", "100000", "1000000"})
    private int n;

    /*
    @State(Scope.Benchmark)
    public static class HOLDER_I {
        int x;
    }
    */


    @Benchmark
    public int testFirst(){
        int i = 0;
        while (++i < n) {
        }
        return i;
    }

    @Benchmark
    public int testSecond(){
        int i = 0;
        while (i++ < n) {
        }
        return i;
    }
}

Someone way more experienced in JMH might correct this results (I really hope so!, since I am not that versatile in JMH yet), but the results show that the difference is pretty darn small:

Benchmark                        (n)   Mode   Samples        Score  Score error    Units
o.m.t.WhileTest.testFirst        100   avgt         5        1.271        0.096    ns/op
o.m.t.WhileTest.testFirst      10000   avgt         5        1.319        0.125    ns/op
o.m.t.WhileTest.testFirst     100000   avgt         5        1.327        0.241    ns/op
o.m.t.WhileTest.testFirst    1000000   avgt         5        1.311        0.136    ns/op
o.m.t.WhileTest.testSecond       100   avgt         5        1.450        0.525    ns/op
o.m.t.WhileTest.testSecond     10000   avgt         5        1.563        0.479    ns/op
o.m.t.WhileTest.testSecond    100000   avgt         5        1.418        0.428    ns/op
o.m.t.WhileTest.testSecond   1000000   avgt         5        1.344        0.120    ns/op

The Score field is the one you are interested in.

Impetus answered 15/8, 2014 at 7:45 Comment(1)

From what I can tell, and correct me if I am wrong, the JVM doesn't appear to be optimizing the i++ into ++i when the result is used. Or is it just because i++ loops an extra time? – Gintz 15/8, 2014 at 7:56

probably this test is not enough to take conclusions but I would say if this is the case, the JVM can optimize this expression by changing i++ to ++i since the stored value of i++ (pre value) is never used in this loop.

Airspeed answered 18/8, 2014 at 7:47 Comment(0)

-3

I suggest you should (whenever possible) always use ++c rather than c++ as the former will never be slower since, conceptually, a deep copy of c has to be taken in the latter case in order to return the previous value.

Indeed many optimisers will optimise away an unnecessary deep copy but they can't easily do that if you're making use of the expression value. And you're doing just that in your case.

Many folk disagree though: they see it as as a micro-optimisation.

Endways answered 15/8, 2014 at 7:35 Comment(18)

This may be true in the world of non-trivial C++ iterators, but not for primitive types... – Vanden 15/8, 2014 at 7:37

For the most part, the difference between i++ and ++i can be safely ignored. I am personally used to using i++ in for loops, because most compilers now a days should be able to optimize the increment. Again I say most of the time, because there are times where the result does matter and the compiler can no longer help you. Also in the case of overridden increment operators in languages line C++. – Gintz 15/8, 2014 at 7:41

@Vanden Not sure I entirely agree. Nowhere does a standard say that unnecessary value copies will be optimised out. So you have to assume that there are platforms that suffer when using c++ unnecessarily: even on a primitive type. If I were to use c++ unnecessarily on i) Mars lander project or ii) Large Hadron Collider project then I can expect a difficult conversation with the project lead. – Endways 15/8, 2014 at 7:41

@Endways Things can go significantly wrong if use use int i = array[++c]; instead of int i = array[c++];... – Betrothal 15/8, 2014 at 7:44

Indeed. The expressions do have different values. But in that case the bottleneck is in creating the array. – Endways 15/8, 2014 at 7:45

@Endways The compiler would have to be brain-dead for it to make a difference on primitive types. And if the performance mattered, they shouldn't be using a brain-dead compiler in the first place. – Vanden 15/8, 2014 at 7:46

@Endways that is an extreme corner case. There are always going to be exceptions to the case. I would also like to see the Mars lander project written in java, I highly doubt it is. Since this question is asking about Java, Mysticial was making a correct statement in saying that primitives are special cases and can easily be optimized at runtime. – Gintz 15/8, 2014 at 7:47

@ThorstenDittmar Again this is a special case where the result of the expression matters. We were discussing the case where the result doesn't matter and the difference in using ++i over i++ is non-existent. – Gintz 15/8, 2014 at 7:48

Ouch. Some chipsets out there very intentionally make use of very literal compilers. They are not brain dead: they just do what it says on the tin and the output assembler is very easy to follow : can be reviewed over a coffee at 2am before burning onto a chip that nobody on Earth will ever see again! Yes, I'm talking about C compilers but I still think the principle holds. Perhaps I'm just old-fashioned. – Endways 15/8, 2014 at 7:49

@Endways I agree that you should understand your compiler and what kind of optimizations it will do for you. In limited cases you will have to do those kinds of optimizations yourself. If you are using a compiler that will not do it for you, you will probably know. As most of these compilers are for embedded systems or have a smaller population of users. – Gintz 15/8, 2014 at 7:51

@Smith_61: That's very sensible. But I will confess that I insist on seeing ++c in a for loop unless the expression value is used. – Endways 15/8, 2014 at 7:53

I am on @Endways 's side. I do know that in 99% of case (especially in Java) it makes no difference in writing ++i and i++. However I'd rather make this a habit to write ++i because there are non-trivial case that it will makes a difference (esp in C++ etc). Given that ++i is nothing harder to read than i++, why not write a potentially safer form? Just like we write things like if (CONSTANT == var), and if (CONSTANT.equals(var)) – Ethiop 15/8, 2014 at 7:54

@AdrianShum I agree ++i should be a habit to use. Sadly I grew into the habit of using i++ in for loops because I lived in the safety of Java when I began. It is a bad habit that I should break but considering most compilers I will ever use will do it for me. I would rather stay with something I am comfortable with. – Gintz 15/8, 2014 at 7:58

Downvote for misinformation. It is not possible to "deep copy" anything that it is possible to use the "++" operators on in Java, and stating that optimizers cannot optimize away the operation when it is used in a comparison is also misinformation. – Egghead 15/8, 2014 at 13:41

@Score_Under: I've inserted an "easily". I am incorrect in limiting the powers of modern optimisers. – Endways 15/8, 2014 at 14:2

In the situation where the result of an incrementing operator is used, one should use whichever operator better fits the semantics of what one is doing, since any performance difference may be offset by code changes elsewhere resulting from the choice. If the result of the operator isn't used, I prefer post-operators since it's more consistent with the noun-verb pattern used elsewhere. – Baxter 15/8, 2014 at 15:45

This should have been a comment, not an answer. – Picaresque 16/8, 2014 at 4:49

No I don't agree. Comments are the poor cousin of an answer since, for one thing, they cannot be down voted. So they cannot be peer reviewed as effectively as an answer. I personally have found the insightful comments made on this thread interesting. That said though, I will still insist on prefix increment whenever possible! – Endways 16/8, 2014 at 19:41

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags