In Java, can & be faster than &&?
Asked Answered
G

7

73

In this code:

if (value >= x && value <= y) {

when value >= x and value <= y are as likely true as false with no particular pattern, would using the & operator be faster than using &&?

Specifically, I am thinking about how && lazily evaluates the right-hand-side expression (ie only if the LHS is true), which implies a conditional, whereas in Java & in this context guarantees strict evaluation of both (boolean) sub-expressions. The value result is the same either way.

But whilst a >= or <= operator will use a simple comparison instruction, the && must involve a branch, and that branch is susceptible to branch prediction failure - as per this Very Famous Question: Why is it faster to process a sorted array than an unsorted array?

So, forcing the expression to have no lazy components will surely be more deterministic and not be vulnerable to prediction failure. Right?

Notes:

  • obviously the answer to my question would be No if the code looked like this: if(value >= x && verySlowFunction()). I am focusing on "sufficiently simple" RHS expressions.
  • there's a conditional branch in there anyway (the if statement). I can't quite prove to myself that that is irrelevant, and that alternative formulations might be better examples, like boolean b = value >= x && value <= y;
  • this all falls into the world of horrendous micro-optimizations. Yeah, I know :-) ... interesting though?

Update Just to explain why I'm interested: I've been staring at the systems that Martin Thompson has been writing about on his Mechanical Sympathy blog, after he came and did a talk about Aeron. One of the key messages is that our hardware has all this magical stuff in it, and we software developers tragically fail to take advantage of it. Don't worry, I'm not about to go s/&&/\&/ on all my code :-) ... but there are a number of questions on this site on improving branch prediction by removing branches, and it occurred to me that the conditional boolean operators are at the core of test conditions.

Of course, @StephenC makes the fantastic point that bending your code into weird shapes can make it less easy for JITs to spot common optimizations - if not now, then in the future. And that the Very Famous Question mentioned above is special because it pushes the prediction complexity far beyond practical optimization.

I'm pretty much aware that in most (or almost all) situations, && is the clearest, simplest, fastest, best thing to do - although I'm very grateful to the people who have posted answers demonstrating this! I'm really interested to see if there are actually any cases in anyone's experience where the answer to "Can & be faster?" might be Yes...

Update 2: (Addressing advice that the question is overly broad. I don't want to make major changes to this question because it might compromise some of the answers below, which are of exceptional quality!) Perhaps an example in the wild is called for; this is from the Guava LongMath class (thanks hugely to @maaartinus for finding this):

public static boolean isPowerOfTwo(long x) {
    return x > 0 & (x & (x - 1)) == 0;
}

See that first &? And if you check the link, the next method is called lessThanBranchFree(...), which hints that we are in branch-avoidance territory - and Guava is really widely used: every cycle saved causes sea-levels to drop visibly. So let's put the question this way: is this use of & (where && would be more normal) a real optimization?

Grimbal answered 20/9, 2016 at 7:34 Comment(13)
If there's a difference it will be nanoseconds. This smells like premature optimization. Why is it important? If you really want to know, just look at the compiled bytecode.Thithia
@JimGarrison It's important because tests like this are commonly used in comparators (ie sorting) and filters, so millions of executions in a tight loop might be common, and then ns become ms. Also, the strict evaluation of the & operator is a little-known peculiarity of Java in terms of an alternative to &&, and in years of Java programming, I've never chosen to use it. Perhaps I've been overly dismissive!Grimbal
& <-- verifies both operands && <-- stops evaluating if the first operand evaluates to false since the result will be false So I pressume sometimes && is faster than &Psittacosis
@Psittacosis - I thought I made that very clear in the question (see the verySlowFunction() note); this is about branch prediction - or should I clarify it some more? Suggestions welcome.Grimbal
&& does not necessarily lead to a branch. There are conditional instructions in x86. But the branch predictor is also very good in modern CPUs. Why don't you benchmark both versions?Broussard
Don't target "the" JVM if this is the level on which you're optimising.Huba
@Rhymoid absolutely: I'm more concerned with the general question of & vs &&. I've always assumed that && is totally better - but branch prediction makes that questionable.Grimbal
@RustyX Ok, "conditional instructions" is the beginnings of a good answer, as it means the predictor is irrelevant, as long as the RHS can be executed conditionally - and that means we don't care about the "quality of the predictor", right?Grimbal
FWIW, it looks like & over && has some real uses.Rorke
@Rorke Wonderful!! Ok - I think we've found one! And it uses the old "clear the bottom bit a&(a-1)" trick that was passed down from the Ancient Days. I wonder if it's actually performant?Grimbal
The C# compiler will generate code as though you wrote & even if you wrote && if its heuristics think that doing so would be a win. I have no idea if Java's compiler does the same, but it is an easy optimization and it would be a bit surprising if they hadn't thought of it.Selene
@Grimbal Maybe. The guys usually know what they're doing and benchmark everything, OTOH the first condition probably holds for most calls and then the conditional jump can be executed in parallel with the computation of the second condition. +++ In a side-effect free case I'd see & vs. && as a hint to the JVM about what to do when it can't see a clear advantage on either side. No idea if this is really the case.Rorke
@EricLippert Note that using & is not always better. JVM does ist unless the branch is easy to predict. They didn't get it perfectly right when I measured it some time ago. There's a threshold on the branching probability where the conditional branch (&&) is faster, but it's usually very low (4% in my case).Rorke
O
76

Ok, so you want to know how it behaves at the lower level... Let's have a look at the bytecode then!

EDIT : added the generated assembly code for AMD64, at the end. Have a look for some interesting notes.
EDIT 2 (re: OP's "Update 2"): added asm code for Guava's isPowerOfTwo method as well.

Java source

I wrote these two quick methods:

public boolean AndSC(int x, int value, int y) {
    return value >= x && value <= y;
}

public boolean AndNonSC(int x, int value, int y) {
    return value >= x & value <= y;
}

As you can see, they are exactly the same, save for the type of AND operator.

Java bytecode

And this is the generated bytecode:

  public AndSC(III)Z
   L0
    LINENUMBER 8 L0
    ILOAD 2
    ILOAD 1
    IF_ICMPLT L1
    ILOAD 2
    ILOAD 3
    IF_ICMPGT L1
   L2
    LINENUMBER 9 L2
    ICONST_1
    IRETURN
   L1
    LINENUMBER 11 L1
   FRAME SAME
    ICONST_0
    IRETURN
   L3
    LOCALVARIABLE this Ltest/lsoto/AndTest; L0 L3 0
    LOCALVARIABLE x I L0 L3 1
    LOCALVARIABLE value I L0 L3 2
    LOCALVARIABLE y I L0 L3 3
    MAXSTACK = 2
    MAXLOCALS = 4

  // access flags 0x1
  public AndNonSC(III)Z
   L0
    LINENUMBER 15 L0
    ILOAD 2
    ILOAD 1
    IF_ICMPLT L1
    ICONST_1
    GOTO L2
   L1
   FRAME SAME
    ICONST_0
   L2
   FRAME SAME1 I
    ILOAD 2
    ILOAD 3
    IF_ICMPGT L3
    ICONST_1
    GOTO L4
   L3
   FRAME SAME1 I
    ICONST_0
   L4
   FRAME FULL [test/lsoto/AndTest I I I] [I I]
    IAND
    IFEQ L5
   L6
    LINENUMBER 16 L6
    ICONST_1
    IRETURN
   L5
    LINENUMBER 18 L5
   FRAME SAME
    ICONST_0
    IRETURN
   L7
    LOCALVARIABLE this Ltest/lsoto/AndTest; L0 L7 0
    LOCALVARIABLE x I L0 L7 1
    LOCALVARIABLE value I L0 L7 2
    LOCALVARIABLE y I L0 L7 3
    MAXSTACK = 3
    MAXLOCALS = 4

The AndSC (&&) method generates two conditional jumps, as expected:

  1. It loads value and x onto the stack, and jumps to L1 if value is lower. Else it keeps running the next lines.
  2. It loads value and y onto the stack, and jumps to L1 also, if value is greater. Else it keeps running the next lines.
  3. Which happen to be a return true in case none of the two jumps were made.
  4. And then we have the lines marked as L1 which are a return false.

The AndNonSC (&) method, however, generates three conditional jumps!

  1. It loads value and x onto the stack and jumps to L1 if value is lower. Because now it needs to save the result to compare it with the other part of the AND, so it has to execute either "save true" or "save false", it can't do both with the same instruction.
  2. It loads value and y onto the stack and jumps to L1 if value is greater. Once again it needs to save true or false and that's two different lines depending on the comparison result.
  3. Now that both comparisons are done, the code actually executes the AND operation -- and if both are true, it jumps (for a third time) to return true; or else it continues execution onto the next line to return false.

(Preliminary) Conclusion

Though I'm not that very much experienced with Java bytecode and I may have overlooked something, it seems to me that & will actually perform worse than && in every case: it generates more instructions to execute, including more conditional jumps to predict and possibly fail at.

A rewriting of the code to replace comparisons with arithmetical operations, as someone else proposed, might be a way to make & a better option, but at the cost of making the code much less clear.
IMHO it is not worth the hassle for 99% of the scenarios (it may be very well worth it for the 1% loops that need to be extremely optimized, though).

EDIT: AMD64 assembly

As noted in the comments, the same Java bytecode can lead to different machine code in different systems, so while the Java bytecode might give us a hint about which AND version performs better, getting the actual ASM as generated by the compiler is the only way to really find out.
I printed the AMD64 ASM instructions for both methods; below are the relevant lines (stripped entry points etc.).

NOTE: all methods compiled with java 1.8.0_91 unless otherwise stated.

Method AndSC with default options

  # {method} {0x0000000016da0810} 'AndSC' '(III)Z' in 'AndTest'
  ...
  0x0000000002923e3e: cmp    %r8d,%r9d
  0x0000000002923e41: movabs $0x16da0a08,%rax   ;   {metadata(method data for {method} {0x0000000016da0810} 'AndSC' '(III)Z' in 'AndTest')}
  0x0000000002923e4b: movabs $0x108,%rsi
  0x0000000002923e55: jl     0x0000000002923e65
  0x0000000002923e5b: movabs $0x118,%rsi
  0x0000000002923e65: mov    (%rax,%rsi,1),%rbx
  0x0000000002923e69: lea    0x1(%rbx),%rbx
  0x0000000002923e6d: mov    %rbx,(%rax,%rsi,1)
  0x0000000002923e71: jl     0x0000000002923eb0  ;*if_icmplt
                                                ; - AndTest::AndSC@2 (line 22)

  0x0000000002923e77: cmp    %edi,%r9d
  0x0000000002923e7a: movabs $0x16da0a08,%rax   ;   {metadata(method data for {method} {0x0000000016da0810} 'AndSC' '(III)Z' in 'AndTest')}
  0x0000000002923e84: movabs $0x128,%rsi
  0x0000000002923e8e: jg     0x0000000002923e9e
  0x0000000002923e94: movabs $0x138,%rsi
  0x0000000002923e9e: mov    (%rax,%rsi,1),%rdi
  0x0000000002923ea2: lea    0x1(%rdi),%rdi
  0x0000000002923ea6: mov    %rdi,(%rax,%rsi,1)
  0x0000000002923eaa: jle    0x0000000002923ec1  ;*if_icmpgt
                                                ; - AndTest::AndSC@7 (line 22)

  0x0000000002923eb0: mov    $0x0,%eax
  0x0000000002923eb5: add    $0x30,%rsp
  0x0000000002923eb9: pop    %rbp
  0x0000000002923eba: test   %eax,-0x1c73dc0(%rip)        # 0x0000000000cb0100
                                                ;   {poll_return}
  0x0000000002923ec0: retq                      ;*ireturn
                                                ; - AndTest::AndSC@13 (line 25)

  0x0000000002923ec1: mov    $0x1,%eax
  0x0000000002923ec6: add    $0x30,%rsp
  0x0000000002923eca: pop    %rbp
  0x0000000002923ecb: test   %eax,-0x1c73dd1(%rip)        # 0x0000000000cb0100
                                                ;   {poll_return}
  0x0000000002923ed1: retq   

Method AndSC with -XX:PrintAssemblyOptions=intel option

  # {method} {0x00000000170a0810} 'AndSC' '(III)Z' in 'AndTest'
  ...
  0x0000000002c26e2c: cmp    r9d,r8d
  0x0000000002c26e2f: jl     0x0000000002c26e36  ;*if_icmplt
  0x0000000002c26e31: cmp    r9d,edi
  0x0000000002c26e34: jle    0x0000000002c26e44  ;*iconst_0
  0x0000000002c26e36: xor    eax,eax            ;*synchronization entry
  0x0000000002c26e38: add    rsp,0x10
  0x0000000002c26e3c: pop    rbp
  0x0000000002c26e3d: test   DWORD PTR [rip+0xffffffffffce91bd],eax        # 0x0000000002910000
  0x0000000002c26e43: ret    
  0x0000000002c26e44: mov    eax,0x1
  0x0000000002c26e49: jmp    0x0000000002c26e38

Method AndNonSC with default options

  # {method} {0x0000000016da0908} 'AndNonSC' '(III)Z' in 'AndTest'
  ...
  0x0000000002923a78: cmp    %r8d,%r9d
  0x0000000002923a7b: mov    $0x0,%eax
  0x0000000002923a80: jl     0x0000000002923a8b
  0x0000000002923a86: mov    $0x1,%eax
  0x0000000002923a8b: cmp    %edi,%r9d
  0x0000000002923a8e: mov    $0x0,%esi
  0x0000000002923a93: jg     0x0000000002923a9e
  0x0000000002923a99: mov    $0x1,%esi
  0x0000000002923a9e: and    %rsi,%rax
  0x0000000002923aa1: cmp    $0x0,%eax
  0x0000000002923aa4: je     0x0000000002923abb  ;*ifeq
                                                ; - AndTest::AndNonSC@21 (line 29)

  0x0000000002923aaa: mov    $0x1,%eax
  0x0000000002923aaf: add    $0x30,%rsp
  0x0000000002923ab3: pop    %rbp
  0x0000000002923ab4: test   %eax,-0x1c739ba(%rip)        # 0x0000000000cb0100
                                                ;   {poll_return}
  0x0000000002923aba: retq                      ;*ireturn
                                                ; - AndTest::AndNonSC@25 (line 30)

  0x0000000002923abb: mov    $0x0,%eax
  0x0000000002923ac0: add    $0x30,%rsp
  0x0000000002923ac4: pop    %rbp
  0x0000000002923ac5: test   %eax,-0x1c739cb(%rip)        # 0x0000000000cb0100
                                                ;   {poll_return}
  0x0000000002923acb: retq   

Method AndNonSC with -XX:PrintAssemblyOptions=intel option

  # {method} {0x00000000170a0908} 'AndNonSC' '(III)Z' in 'AndTest'
  ...
  0x0000000002c270b5: cmp    r9d,r8d
  0x0000000002c270b8: jl     0x0000000002c270df  ;*if_icmplt
  0x0000000002c270ba: mov    r8d,0x1            ;*iload_2
  0x0000000002c270c0: cmp    r9d,edi
  0x0000000002c270c3: cmovg  r11d,r10d
  0x0000000002c270c7: and    r8d,r11d
  0x0000000002c270ca: test   r8d,r8d
  0x0000000002c270cd: setne  al
  0x0000000002c270d0: movzx  eax,al
  0x0000000002c270d3: add    rsp,0x10
  0x0000000002c270d7: pop    rbp
  0x0000000002c270d8: test   DWORD PTR [rip+0xffffffffffce8f22],eax        # 0x0000000002910000
  0x0000000002c270de: ret    
  0x0000000002c270df: xor    r8d,r8d
  0x0000000002c270e2: jmp    0x0000000002c270c0
  • First of all, the generated ASM code differs depending on whether we choose the default AT&T syntax or the Intel syntax.
  • With AT&T syntax:
    • The ASM code is actually longer for the AndSC method, with every bytecode IF_ICMP* translated to two assembly jump instructions, for a total of 4 conditional jumps.
    • Meanwhile, for the AndNonSC method the compiler generates a more straight-forward code, where each bytecode IF_ICMP* is translated to only one assembly jump instruction, keeping the original count of 3 conditional jumps.
  • With Intel syntax:
    • The ASM code for AndSC is shorter, with just 2 conditional jumps (not counting the non-conditional jmp at the end). Actually it's just two CMP, two JL/E and a XOR/MOV depending on the result.
    • The ASM code for AndNonSC is now longer than the AndSC one! However, it has just 1 conditional jump (for the first comparison), using the registers to directly compare the first result with the second, without any more jumps.

Conclusion after ASM code analysis

  • At AMD64 machine-language level, the & operator seems to generate ASM code with fewer conditional jumps, which might be better for high prediction-failure rates (random values for example).
  • On the other hand, the && operator seems to generate ASM code with fewer instructions (with the -XX:PrintAssemblyOptions=intel option anyway), which might be better for really long loops with prediction-friendly inputs, where the fewer number of CPU cycles for each comparison can make a difference in the long run.

As I stated in some of the comments, this is going to vary greatly between systems, so if we're talking about branch-prediction optimization, the only real answer would be: it depends on your JVM implementation, your compiler, your CPU and your input data.


Addendum: Guava's isPowerOfTwo method

Here, Guava's developers have come up with a neat way of calculating if a given number is a power of 2:

public static boolean isPowerOfTwo(long x) {
    return x > 0 & (x & (x - 1)) == 0;
}

Quoting OP:

is this use of & (where && would be more normal) a real optimization?

To find out if it is, I added two similar methods to my test class:

public boolean isPowerOfTwoAND(long x) {
    return x > 0 & (x & (x - 1)) == 0;
}

public boolean isPowerOfTwoANDAND(long x) {
    return x > 0 && (x & (x - 1)) == 0;
}

Intel's ASM code for Guava's version

  # {method} {0x0000000017580af0} 'isPowerOfTwoAND' '(J)Z' in 'AndTest'
  # this:     rdx:rdx   = 'AndTest'
  # parm0:    r8:r8     = long
  ...
  0x0000000003103bbe: movabs rax,0x0
  0x0000000003103bc8: cmp    rax,r8
  0x0000000003103bcb: movabs rax,0x175811f0     ;   {metadata(method data for {method} {0x0000000017580af0} 'isPowerOfTwoAND' '(J)Z' in 'AndTest')}
  0x0000000003103bd5: movabs rsi,0x108
  0x0000000003103bdf: jge    0x0000000003103bef
  0x0000000003103be5: movabs rsi,0x118
  0x0000000003103bef: mov    rdi,QWORD PTR [rax+rsi*1]
  0x0000000003103bf3: lea    rdi,[rdi+0x1]
  0x0000000003103bf7: mov    QWORD PTR [rax+rsi*1],rdi
  0x0000000003103bfb: jge    0x0000000003103c1b  ;*lcmp
  0x0000000003103c01: movabs rax,0x175811f0     ;   {metadata(method data for {method} {0x0000000017580af0} 'isPowerOfTwoAND' '(J)Z' in 'AndTest')}
  0x0000000003103c0b: inc    DWORD PTR [rax+0x128]
  0x0000000003103c11: mov    eax,0x1
  0x0000000003103c16: jmp    0x0000000003103c20  ;*goto
  0x0000000003103c1b: mov    eax,0x0            ;*lload_1
  0x0000000003103c20: mov    rsi,r8
  0x0000000003103c23: movabs r10,0x1
  0x0000000003103c2d: sub    rsi,r10
  0x0000000003103c30: and    rsi,r8
  0x0000000003103c33: movabs rdi,0x0
  0x0000000003103c3d: cmp    rsi,rdi
  0x0000000003103c40: movabs rsi,0x175811f0     ;   {metadata(method data for {method} {0x0000000017580af0} 'isPowerOfTwoAND' '(J)Z' in 'AndTest')}
  0x0000000003103c4a: movabs rdi,0x140
  0x0000000003103c54: jne    0x0000000003103c64
  0x0000000003103c5a: movabs rdi,0x150
  0x0000000003103c64: mov    rbx,QWORD PTR [rsi+rdi*1]
  0x0000000003103c68: lea    rbx,[rbx+0x1]
  0x0000000003103c6c: mov    QWORD PTR [rsi+rdi*1],rbx
  0x0000000003103c70: jne    0x0000000003103c90  ;*lcmp
  0x0000000003103c76: movabs rsi,0x175811f0     ;   {metadata(method data for {method} {0x0000000017580af0} 'isPowerOfTwoAND' '(J)Z' in 'AndTest')}
  0x0000000003103c80: inc    DWORD PTR [rsi+0x160]
  0x0000000003103c86: mov    esi,0x1
  0x0000000003103c8b: jmp    0x0000000003103c95  ;*goto
  0x0000000003103c90: mov    esi,0x0            ;*iand
  0x0000000003103c95: and    rsi,rax
  0x0000000003103c98: and    esi,0x1
  0x0000000003103c9b: mov    rax,rsi
  0x0000000003103c9e: add    rsp,0x50
  0x0000000003103ca2: pop    rbp
  0x0000000003103ca3: test   DWORD PTR [rip+0xfffffffffe44c457],eax        # 0x0000000001550100
  0x0000000003103ca9: ret    

Intel's asm code for && version

  # {method} {0x0000000017580bd0} 'isPowerOfTwoANDAND' '(J)Z' in 'AndTest'
  # this:     rdx:rdx   = 'AndTest'
  # parm0:    r8:r8     = long
  ...
  0x0000000003103438: movabs rax,0x0
  0x0000000003103442: cmp    rax,r8
  0x0000000003103445: jge    0x0000000003103471  ;*lcmp
  0x000000000310344b: mov    rax,r8
  0x000000000310344e: movabs r10,0x1
  0x0000000003103458: sub    rax,r10
  0x000000000310345b: and    rax,r8
  0x000000000310345e: movabs rsi,0x0
  0x0000000003103468: cmp    rax,rsi
  0x000000000310346b: je     0x000000000310347b  ;*lcmp
  0x0000000003103471: mov    eax,0x0
  0x0000000003103476: jmp    0x0000000003103480  ;*ireturn
  0x000000000310347b: mov    eax,0x1            ;*goto
  0x0000000003103480: and    eax,0x1
  0x0000000003103483: add    rsp,0x40
  0x0000000003103487: pop    rbp
  0x0000000003103488: test   DWORD PTR [rip+0xfffffffffe44cc72],eax        # 0x0000000001550100
  0x000000000310348e: ret    

In this specific example, the JIT compiler generates far less assembly code for the && version than for Guava's & version (and, after yesterday's results, I was honestly surprised by this).
Compared to Guava's, the && version translates to 25% less bytecode for JIT to compile, 50% less assembly instructions, and only two conditional jumps (the & version has four of them).

So everything points to Guava's & method being less efficient than the more "natural" && version.

... Or is it?

As noted before, I'm running the above examples with Java 8:

C:\....>java -version
java version "1.8.0_91"
Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)

But what if I switch to Java 7?

C:\....>c:\jdk1.7.0_79\bin\java -version
java version "1.7.0_79"
Java(TM) SE Runtime Environment (build 1.7.0_79-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.79-b02, mixed mode)
C:\....>c:\jdk1.7.0_79\bin\java -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,*AndTest.isPowerOfTwoAND -XX:PrintAssemblyOptions=intel AndTestMain
  .....
  0x0000000002512bac: xor    r10d,r10d
  0x0000000002512baf: mov    r11d,0x1
  0x0000000002512bb5: test   r8,r8
  0x0000000002512bb8: jle    0x0000000002512bde  ;*ifle
  0x0000000002512bba: mov    eax,0x1            ;*lload_1
  0x0000000002512bbf: mov    r9,r8
  0x0000000002512bc2: dec    r9
  0x0000000002512bc5: and    r9,r8
  0x0000000002512bc8: test   r9,r9
  0x0000000002512bcb: cmovne r11d,r10d
  0x0000000002512bcf: and    eax,r11d           ;*iand
  0x0000000002512bd2: add    rsp,0x10
  0x0000000002512bd6: pop    rbp
  0x0000000002512bd7: test   DWORD PTR [rip+0xffffffffffc0d423],eax        # 0x0000000002120000
  0x0000000002512bdd: ret    
  0x0000000002512bde: xor    eax,eax
  0x0000000002512be0: jmp    0x0000000002512bbf
  .....

Surprise! The assembly code generated for the & method by the JIT compiler in Java 7, has only one conditional jump now, and is way shorter! Whereas the && method (you'll have to trust me on this one, I don't want to clutter the ending!) remains about the same, with its two conditional jumps and a couple less instructions, tops.
Looks like Guava's engineers knew what they were doing, after all! (if they were trying to optimize Java 7 execution time, that is ;-)

So back to OP's latest question:

is this use of & (where && would be more normal) a real optimization?

And IMHO the answer is the same, even for this (very!) specific scenario: it depends on your JVM implementation, your compiler, your CPU and your input data.

Olivier answered 20/9, 2016 at 8:32 Comment(18)
@Passive Yes, that's what I was thinking when I read this answer. It's a really good analysis (thanks @Walen!), but I don't think it tells us what's going to happen at the level of CPU pipelines. Or am I wrong? I'm under the impression that the (say Hotspot) generated machine code can tear all these branches apart, but I realize I'm speaking from impression, not fact.Grimbal
Well, Java bytecode is the closest thing to ASM before dwelving into the specifics of every OS and CPU. Sure, the IBM javac might output different code than official Oracle or than the OpenJDK one... And of course the machine code in an X86 machine would probably differ from a PowerPC AIX system or the Snapdragon CPUs used in many smartphones -- every platform will have its own compilers and optimizations. But in a simple case like this, I doubt that the differences from one CPU to another will make a bigger difference than having 2 vs 3 bytecode conditional jumps.Olivier
While it might be "the closest thing to ASM", it is not close enough to allow you to draw any logical conclusions. Put simply, after the code has been JIT compiled, the JVM does not execute bytecodes.Doehne
Is it generating 2/3 jumps or 2/3 branches?Christenechristening
@Christenechristening it's a long post and I made a lot of edits ;-) where exactly are you referring to?Olivier
@Olivier You cleared it up. You originally said jump instead of conditional jump (which is really a branch). There's only one place to go for a jump, so there is isn't anything to be predicted. Therefore there couldn't be a misprediction.Christenechristening
@Christenechristening glad I got it :-) Before the ASM edit, the only "jumps" were the bytecoded IF_ICMP* which are always conditional jumps so I just wrote "jumps". After the edit, with the assembly code having both plain JMP and conditional JLE/JG/etc. I tried to be more specific but I missed the bytecode part, thank you!,Olivier
@Olivier I'm being a bit pedantic, but jump instructions (JMP in x86) aren't conditional. A jump is the jump opcode, and an address (relative or absolute). A "conditional jump" is actually a branch instruction.Christenechristening
@Christenechristening yes you are, but I can relate, so no problem :) Allow me to quote Intel's official Intel ® 64 and IA-32 Architectures Software Developer’s Manual: "5.1.7 Control Transfer Instructions The control transfer instructions provide jump, conditional jump, loop, and call and return operations to control program flow."Olivier
@Olivier Thanks for the clarification. I've almost exclusively worked with MIPS where jump and branch are very different instructions and I (incorrectly) assumed, with my (very little) research about x86, that is was the same way.Christenechristening
You also get different ASM depending on (among other things) how many times that method has been called. HotSpot is complicated.Tiger
@Tiger fair point. Let's assume it gets called a lot! Is it reasonable to say that with methods that are so basic, there isn't much that would make a difference between possible code-generations? It's not like there are other classes being loaded that might stimulate different in-linings, for example.Grimbal
@Grimbal the ASM you have posted is not the result when it gets called a lot. When that happens it triggers actual optimisation, which may even cause them to emit identical ASM.Tiger
@Olivier I love your answer, thank you! But I've added an update2 to address the "too broad" marker, quoting the Guava example and restating the question in those specific terms. I've tried hard to make it so all your hard work is still a suitable answer - would it be ok for you to review and fix up anything you think has been orphaned?Grimbal
@Grimbal I saw the Guava example yesterday, looks interesting! I'll try to have a look at the assembly today (CEST timezone) :)Olivier
@Grimbal I edited my answer to address your last update (Update2). Seems like Guava's version is optimized for Java 7 but not for Java 8, so it still depends on your environment #savedyouaclick :)Olivier
@Tiger These methods are called inside a loop with one million iterations for (int i = 0; i < 1000000; i++). I don't know if JIT considers that "a lot" but I hope it does! I'll take it into account next time anyway, since it's a good point.Olivier
Well, I think this is a fantastic answer. It's possible that there's some subtlety in Java8 which might make it apply further optimizations on the basis of HotSpot magic or something. In which case a new question might be born ... meanwhile, nice one! Thank you very much!Grimbal
S
23

For those kind of questions you should run a microbenchmark. I used JMH for this test.

The benchmarks are implemented as

// boolean logical AND
bh.consume(value >= x & y <= value);

and

// conditional AND
bh.consume(value >= x && y <= value);

and

// bitwise OR, as suggested by Joop Eggen
bh.consume(((value - x) | (y - value)) >= 0)

With values for value, x and y according to the benchmark name.

The result (five warmup and ten measurement iterations) for throughput benchmarking is:

Benchmark                                 Mode  Cnt    Score    Error   Units
Benchmark.isBooleanANDBelowRange          thrpt   10  386.086 ▒ 17.383  ops/us
Benchmark.isBooleanANDInRange             thrpt   10  387.240 ▒  7.657  ops/us
Benchmark.isBooleanANDOverRange           thrpt   10  381.847 ▒ 15.295  ops/us
Benchmark.isBitwiseORBelowRange           thrpt   10  384.877 ▒ 11.766  ops/us
Benchmark.isBitwiseORInRange              thrpt   10  380.743 ▒ 15.042  ops/us
Benchmark.isBitwiseOROverRange            thrpt   10  383.524 ▒ 16.911  ops/us
Benchmark.isConditionalANDBelowRange      thrpt   10  385.190 ▒ 19.600  ops/us
Benchmark.isConditionalANDInRange         thrpt   10  384.094 ▒ 15.417  ops/us
Benchmark.isConditionalANDOverRange       thrpt   10  380.913 ▒  5.537  ops/us

The result is not that different for the evaluation itself. As long no perfomance impact is spotted on that piece of code I would not try to optimize it. Depending on the place in the code the hotspot compiler might decide to do some optimization. Which probably is not covered by the above benchmarks.

some references:

boolean logical AND - the result value is true if both operand values are true; otherwise, the result is false
conditional AND - is like &, but evaluates its right-hand operand only if the value of its left-hand operand is true
bitwise OR - the result value is the bitwise inclusive OR of the operand values

Sadden answered 20/9, 2016 at 8:30 Comment(18)
Definitely good point about JMH!! Are these tests checking for random variations in test-success and test-fail, especially around the two sub-expressions? Otherwise we're not seeing the effect of prediction-failure...Grimbal
That's the best benchmark so far, but it's flawed as well :) The blackhole takes much more time than the && or & so you are basically measuring teh blackhole performance :) try with something like consume(a & b & c 7 d & f & g ....&z);Expert
When we are about extra wishes: ((value - x) | (y - value)) >= 0 would be my wish: bitwise orThirdrate
@Grimbal & generates three bytecode IF operations while && generates only two (see my answer for details). With a long enough list of randomly-generated values it should be more apparent that the short-circuit version has fewer prediction fails, if only because it has to predict just two of them... @SvetlinZarev This already shows in your test (with & errors being 17-21-17 and && being 13-16-13), but maybe you could redo them with that in mind?Olivier
@Grimbal BTW, it was JMH bug that helped to discover that HotSpot does shortcut the evaluation of &. So, answering the original question - no, JVM still generates a conditional branch for &.Passive
@Sadden It's a good idea to write a benchmark that supports a theory. However pure benchmark results say nothing without proper analysis. If you include the generated assembly code that explains the benchmark results, this will be an answer.Passive
@Passive I absolutely agree with you and it was mentioned already in my answer. The benchmark was done mainly to explain, that it makes less sense without a specific test.Sadden
Nitpicking: Per the Java Language Specification §4.2.5; the boolean logical operators are &, ^, and |. && and || are the conditional-and and conditional-or operators.Boz
@JoopEggen I added the bitwise or.Sadden
Thanks. Disappointing, though certainly not bad. More something for C.Thirdrate
@Passive re JMH bug: does that mean that HotSpot would have skipped calling the RHS method in boolean b = a>b & methodWithSideEffects()?Grimbal
@Grimbal @Sadden I edited my answer to include the actual JIT-generated ASM code. And it looks like & might be better for some cases! Comments are welcome :-)Olivier
@Olivier Thanks for this additional input.Sadden
@Grimbal No, methodWithSideEffects() will not be skipped, otherwise it would be a spec violation. However, a method without side effects could be optimized out in this case.Passive
There is already a lot of confusion around the meaning of the non-shortcut logical operators. Can you please modify this post so as to not refer to them as bitwise? There are no bitwise calculations in your test.Mutule
@Mutule Thanks for the hint. I changed the naming and add some references to the JLS.Sadden
are the values randomly generated to show the effect of branch mispredictions?Arithmetician
@Arithmetician No. As this was not part of the initial question and the JMH benchmark was not crafted for it.Sadden
D
13

I'm going to come at this from a different angle.

Consider these two code fragments,

  if (value >= x && value <= y) {

and

  if (value >= x & value <= y) {

If we assume that value, x, y have a primitive type, then those two (partial) statements will give the same outcome for all possible input values. (If wrapper types are involved, then they are not exactly equivalent because of an implicit null test for y that might fail in the & version and not the && version.)

If the JIT compiler is doing a good job, its optimizer will be able to deduce that those two statements do the same thing:

  • If one is predictably faster than the other, then it should be able to use the faster version ... in the JIT compiled code.

  • If not, then it doesn't matter which version is used at the source code level.

  • Since the JIT compiler gathers path statistics before compiling, it can potentially have more information about the execution characteristics that the programmer(!).

  • If the current generation JIT compiler (on any given platform) doesn't optimize well enough to handle this, the next generation could well do ... depending on whether or not empirical evidence points to this being a worthwhile pattern to optimize.

  • Indeed, if you write you Java code in a way that optimizes for this, there is a chance that by picking the more "obscure" version of the code, you might inhibit the current or future JIT compiler's ability to optimize.

In short, I don't think you should do this kind of micro-optimization at the source code level. And if you accept this argument1, and follow it to its logical conclusion, the question of which version is faster is ... moot2.

1 - I do not claim this is anywhere near being a proof.

2 - Unless you are one of the tiny community of people who actually write Java JIT compilers ...


The "Very Famous Question" is interesting in two respects:

  • On the one hand, that is an example where the kind of optimization required to make a difference is way beyond the capability of a JIT compiler.

  • On the other hand, it would not necessarily be the correct thing to sort the array ... just because a sorted array can be processed faster. The cost of sorting the array, could well be (much) greater than the saving.

Doehne answered 20/9, 2016 at 9:48 Comment(7)
Your point about inhibiting future optimizations is very well made! - deliberately putting '&' in a condition would be tantamount to "failing to express intentions clearly in order to trick the system", and when you lie to your computer, it will get its revenge ....Grimbal
Which one is faster is data dependent. This is something the JIT can't know. Or can JVM JITs profile such a thing? In that case this would be entirely feasible.Dugong
Yes. A JIT can do that. And HotSpot JIT compilers do do that, during the phase before the bytecodes are being interpreted ... prior to compilation.Doehne
If x and y are either constants or predictable value, the optimized code will rather look like value-x ≤ͧ y-x where ≤ͧ is an unsigned long comparison and y-x a constant, though even if x and y are not predictable, that single comparison variant might be used, if two branches are considered more expensive than an eagerly performed comparison (a numeric comparison is on par with the minus operation). So thinking about & and && makes indeed no sense.Tab
Future optimizations - love that aspect. Consider how "a+b+c" morphed into using StringBuffers, even when maybe they didn't really matter that much. Then when StringBuilders came out now folks have these big hulking thread-safe StringBuffers where such overhead was unnecessary. Now "a+b+c" tuns into StringBuilders on compile, but any explicit StringBuffers obviously still exist due to zealous overoptimization.Squish
I guess I'm used to the .NET JIT which does not profile at all and does not optimize very much. So sad.Dugong
@corsiKa: thankfully, HotSpot is quite good at eliminating unnecessary synchronizing on a purely local object. Problems occur when people were so “clever” to store the StringBuffer instances somewhere for re-use. That backfires in multiple regards.Tab
G
6

Using either & or && still requires a condition to be evaluated so it's unlikely it will save any processing time - it might even add to it considering you're evaluating both expressions when you only need to evaluate one.

Using & over && to save a nanosecond if that in some very rare situations is pointless, you've already wasted more time contemplating the difference than you would've saved using & over &&.

Edit

I got curious and decided to run some bench marks.

I made this class:

public class Main {

    static int x = 22, y = 48;

    public static void main(String[] args) {
        runWithOneAnd(30);
        runWithTwoAnds(30);
    }

    static void runWithOneAnd(int value){
        if(value >= x & value <= y){

        }
    }

    static void runWithTwoAnds(int value){
        if(value >= x && value <= y){

        }
    }
}

and ran some profiling tests with NetBeans. I didn't use any print statements to save processing time, just know both evaluate to true.

First test:

The first profiling test

Second test:

The second profiling test

Third test:

The third profiling test

As you can see by the profiling tests, using only one & actually takes 2-3 times longer to run compared to using two &&. This does strike as some what odd as i did expect better performance from only one &.

I'm not 100% sure why. In both cases, both expressions have to be evaluated because both are true. I suspect that the JVM does some special optimization behind the scenes to speed it up.

Moral of the story: convention is good and premature optimization is bad.


Edit 2

I redid the benchmark code with @SvetlinZarev's comments in mind and a few other improvements. Here is the modified benchmark code:

public class Main {

    static int x = 22, y = 48;

    public static void main(String[] args) {
        oneAndBothTrue();
        oneAndOneTrue();
        oneAndBothFalse();
        twoAndsBothTrue();
        twoAndsOneTrue();
        twoAndsBothFalse();
        System.out.println(b);
    }

    static void oneAndBothTrue() {
        int value = 30;
        for (int i = 0; i < 2000; i++) {
            if (value >= x & value <= y) {
                doSomething();
            }
        }
    }

    static void oneAndOneTrue() {
        int value = 60;
        for (int i = 0; i < 4000; i++) {
            if (value >= x & value <= y) {
                doSomething();
            }
        }
    }

    static void oneAndBothFalse() {
        int value = 100;
        for (int i = 0; i < 4000; i++) {
            if (value >= x & value <= y) {
                doSomething();
            }
        }
    }

    static void twoAndsBothTrue() {
        int value = 30;
        for (int i = 0; i < 4000; i++) {
            if (value >= x & value <= y) {
                doSomething();
            }
        }
    }

    static void twoAndsOneTrue() {
        int value = 60;
        for (int i = 0; i < 4000; i++) {
            if (value >= x & value <= y) {
                doSomething();
            }
        }
    }

    static void twoAndsBothFalse() {
        int value = 100;
        for (int i = 0; i < 4000; i++) {
            if (value >= x & value <= y) {
                doSomething();
            }
        }
    }

    //I wanted to avoid print statements here as they can
    //affect the benchmark results. 
    static StringBuilder b = new StringBuilder();
    static int times = 0;

    static void doSomething(){
        times++;
        b.append("I have run ").append(times).append(" times \n");
    }
}

And here are the performance tests:

Test 1:

enter image description here

Test 2:

enter image description here

Test 3:

enter image description here

This takes into account different values and different conditions as well.

Using one & takes more time to run when both conditions are true, about 60% or 2 milliseconds more time. When either one or both conditions are false, then one & runs faster, but it only runs about 0.30-0.50 milliseconds faster. So & will run faster than && in most circumstances, but the performance difference is still negligible.

Gloat answered 20/9, 2016 at 7:53 Comment(6)
Your micro benchmark is totally flawed. The JIT will optimize away those empty for loops, not to mention that a single execution of the method like in your code can never give any meaningful results.Expert
Thanks for pointing that out, i'll redo the tests with that in mind.Gloat
The only correct way of microbenchmarking is to use a tool like JMH.Expert
Unless you are running on a really old machine, your loops don't execute enough times to get any meaningful results. Also the order of when you call things can make a huge difference. Lastly, if you keep appending to a StringBuilder, it's eventually going to need to allocate a lot of memory and that will take a long time.Mutule
'BothFalse' is invalid. Those methods with 100 test the same thing as 60. You can't be both below the range and above the range at the same time, so BothFalse is unachievable..Sudan
What you really need is a test of value = 10 instead of 100. Then the 60 test becomes SecondFalse and 10 is FirstFalse. The latter test should run faster than the former with &&s but may still be comparable with &s.Sudan
T
3

What you are after is something like this:

x <= value & value <= y
value - x >= 0 & y - value >= 0
((value - x) | (y - value)) >= 0  // integer bit-or

Interesting, one would almost like to look at the byte code. But hard to say. I wish this were a C question.

Thirdrate answered 20/9, 2016 at 7:55 Comment(0)
T
0

I was curious to the answer as well, so I wrote the following (simple) test for this:

private static final int max = 80000;
private static final int size = 100000;
private static final int x = 1500;
private static final int y = 15000;
private Random random;

@Before
public void setUp() {
    this.random = new Random();
}

@After
public void tearDown() {
    random = null;
}

@Test
public void testSingleOperand() {
    int counter = 0;
    int[] numbers = new int[size];
    for (int j = 0; j < size; j++) {
        numbers[j] = random.nextInt(max);
    }

    long start = System.nanoTime(); //start measuring after an array has been filled
    for (int i = 0; i < numbers.length; i++) {
        if (numbers[i] >= x & numbers[i] <= y) {
            counter++;
        }
    }
    long end = System.nanoTime();
    System.out.println("Duration of single operand: " + (end - start));
}

@Test
public void testDoubleOperand() {
    int counter = 0;
    int[] numbers = new int[size];
    for (int j = 0; j < size; j++) {
        numbers[j] = random.nextInt(max);
    }

    long start = System.nanoTime(); //start measuring after an array has been filled
    for (int i = 0; i < numbers.length; i++) {
        if (numbers[i] >= x & numbers[i] <= y) {
            counter++;
        }
    }
    long end = System.nanoTime();
    System.out.println("Duration of double operand: " + (end - start));
}

With the end result being that the comparison with && always wins in terms of speed, being about 1.5/2 milliseconds quicker than &.

EDIT: As @SvetlinZarev pointed out, I was also measuring the time it took Random to get an integer. Changed it to use a pre-filled array of random numbers, which caused the duration of the single operand test to wildly fluctuate; the differences between several runs were up to 6-7ms.

Thermic answered 20/9, 2016 at 8:8 Comment(8)
Ok, interesting: I can see that the first condition will mostly succeed (generated >= x), which means the predictor will usually get things right (if it works the way I think it does). I'm going to try fiddling with those 'x' and 'y' values - I think x=40000 and y=60000 will be interesting (50% success on each test).Grimbal
With those values, && still beats &. This time the average difference between the two seemed higher too, never falling below 2ms and occasionally it was even above 3ms.Mclaughlin
you are measuring the random.nextInt() as it take s much mor etime than the simple && or &. Your tests is flawedExpert
@SvetlinZarev good point - and if there are any conditionals in there, they'll swamp it. Better to build an array of randoms first, and then iterate it in the timed loop? Rats, I want to try this, but I'm at work.Grimbal
Just use JMH. it has nice @Setup methods to make the setup and generates nice reports.Expert
@SvetlinZarev Good point on the random comment; I've changed it to use an array filled with random integers, with the same end result being that && is faster than &.Mclaughlin
@Thermic you are still lacking warmup :)Expert
@SvetlinZarev How much warmup is needed? I ran it manually about ~20 times.Mclaughlin
H
0

The way this was explained to me, is that && will return false if the first check in a series is false, while & checks all items in a series regardless of how many are false. I.E.

if (x>0 && x <=10 && x

Will run faster than

if (x>0 & x <=10 & x

If x is greater than 10, because single ampersands will continue to check the rest of the conditions whereas double ampersands will break after the first non-true condition.

Hel answered 29/9, 2016 at 1:16 Comment(1)
Sorry, this misses the point of the question! Look at the first "Note" in the question - I was quite explicit about this. Obviously, if significant time can be saved by not executing the subsequent conditions, then fine, we all know about that. But to do that involves a branch, and modern processor instruction pipelines sometimes make guesses about the direction a branch will take which turn out to be a) wrong and b) quite expensive. Please read the top answer to the (very famous) question I linked to, and then decide whether you want to keep this answer.Grimbal

© 2022 - 2024 — McMap. All rights reserved.