Why does Java switch on contiguous ints appear to run faster with added cases?
Asked Answered
C

5

287

I am working on some Java code which needs to be highly optimized as it will run in hot functions that are invoked at many points in my main program logic. Part of this code involves multiplying double variables by 10 raised to arbitrary non-negative int exponents. One fast way (edit: but not the fastest possible, see Update 2 below) to get the multiplied value is to switch on the exponent:

double multiplyByPowerOfTen(final double d, final int exponent) {
   switch (exponent) {
      case 0:
         return d;
      case 1:
         return d*10;
      case 2:
         return d*100;
      // ... same pattern
      case 9:
         return d*1000000000;
      case 10:
         return d*10000000000L;
      // ... same pattern with long literals
      case 18:
         return d*1000000000000000000L;
      default:
         throw new ParseException("Unhandled power of ten " + power, 0);
   }
}

The commented ellipses above indicate that the case int constants continue incrementing by 1, so there are really 19 cases in the above code snippet. Since I wasn't sure whether I would actually need all the powers of 10 in case statements 10 thru 18, I ran some microbenchmarks comparing the time to complete 10 million operations with this switch statement versus a switch with only cases 0 thru 9 (with the exponent limited to 9 or less to avoid breaking the pared-down switch). I got the rather surprising (to me, at least!) result that the longer switch with more case statements actually ran faster.

On a lark, I tried adding even more cases which just returned dummy values, and found that I could get the switch to run even faster with around 22-27 declared cases (even though those dummy cases are never actually hit while the code is running). (Again, cases were added in a contiguous fashion by incrementing the prior case constant by 1.) These execution time differences are not very significant: for a random exponent between 0 and 10, the dummy padded switch statement finishes 10 million executions in 1.49 secs versus 1.54 secs for the unpadded version, for a grand total savings of 5ns per execution. So, not the kind of thing that makes obsessing over padding out a switch statement worth the effort from an optimization standpoint. But I still just find it curious and counter-intuitive that a switch doesn't become slower (or perhaps at best maintain constant O(1) time) to execute as more cases are added to it.

switch benchmarking results

These are the results I obtained from running with various limits on the randomly-generated exponent values. I didn't include the results all the way down to 1 for the exponent limit, but the general shape of the curve remains the same, with a ridge around the 12-17 case mark, and a valley between 18-28. All tests were run in JUnitBenchmarks using shared containers for the random values to ensure identical testing inputs. I also ran the tests both in order from longest switch statement to shortest, and vice-versa, to try and eliminate the possibility of ordering-related test problems. I've put my testing code up on a github repo if anyone wants to try to reproduce these results.

So, what's going on here? Some vagaries of my architecture or micro-benchmark construction? Or is the Java switch really a little faster to execute in the 18 to 28 case range than it is from 11 up to 17?

github test repo "switch-experiment"

UPDATE: I cleaned up the benchmarking library quite a bit and added a text file in /results with some output across a wider range of possible exponent values. I also added an option in the testing code not to throw an Exception from default, but this doesn't appear to affect the results.

UPDATE 2: Found some pretty good discussion of this issue from back in 2009 on the xkcd forum here: http://forums.xkcd.com/viewtopic.php?f=11&t=33524. The OP's discussion of using Array.binarySearch() gave me the idea for a simple array-based implementation of the exponentiation pattern above. There's no need for the binary search since I know what the entries in the array are. It appears to run about 3 times faster than using switch, obviously at the expense of some of the control flow that switch affords. That code has been added to the github repo also.

Cocci answered 25/3, 2013 at 17:28 Comment(20)
The only logical explanation I can think of is Java compiler is trying to optimize the switch statement with more cases, perhaps going into a O(1) logic as most C compilers did in the dark ages.Arrestment
Now all Googlers everywhere will have precisely 22 cases in all switch statements, as it's clearly the most optimal solution. :D (Don't show this to my lead, please.)Pestalozzi
Do you have a simpler SSCCE? This one doesn't compile for me. As weak as I am with Java performance, I wanna take a shot at this.Socher
Benchmark without dependencies would be helpful.Keys
You might find the section "Switches in the JVM" in my answer about string-based cases helpful. I think what's happening here is that you are switching from a lookupswitch to a tableswitch. Disassembling your code with javap would show you for sure.Wyckoff
yes @Wyckoff I remembered your answer a while back - did enough to search it. ThanksArrestment
@erickson, I thought that too, but its always tableswitch.Keys
I added the dependency jars to the /lib folder in the repo. @Socher Sorry, I've kind of already spent too much time going down this rabbit hole! If you take the "extends AbstractBenchmark" off the test classes and get rid of the "com.carrotsearch" imports, you can run with just the JUnit dependency, but the carrotsearch stuff is pretty nice for filtering out some of the noise from the JIT and warmup periods. Unfortunately I don't know how to run these JUnit tests outside of IntelliJ though.Cocci
@AndrewBissell I managed to repro your results with a much simpler benchmark. The branch vs. table for the small vs. medium size performance was a somewhat obvious guess. But I have no better insight than anyone else about the dip going into 30 cases...Socher
@Socher Thanks for reproducing it! Some of the details of the actual practical use for the switch did leak into the test a bit, I will probably try and move those out if this question keeps trucking. ;)Cocci
@Wyckoff Actually all tests compile to tableswitch at the bytecode level but for some reasons the JIT seems to compile tableswitch to a lookupswitch when the number of cases is low.Topliffe
Related answer from another question #10288200Chrystal
@AndrewBissell Note that (i) your new static block could be written: ARRAY_32[i] = Math.pow(1, i); instead of the succession of if/else and (ii) 1000000000000000000L * 10; and following are negative numbers due to Long overflow - you should use a double[] instead.Topliffe
@Topliffe I knew someone would call me out on the overflow. :) I didn't mind having it in there once the goal changed to "throw a lot of stuff at the switch and see how it performs." Once I move this into production (most likely with an array implementation) I'll probably just put an upper limit of 14 or 15 on the exponent and throw an Exception if clients try to exceed it. Your idea of using double for the exponent values is interesting though, since it may save the expense of a cast for the operation. I'll let you know what I find in test.Cocci
So why not get rid of the case all together, and just do something like: final long[] pow = {1, 10, 100, 1000, 10000, etc.}; and return d * pow[exponent];Perineum
@EbbeM.Pedersen That's the solution described in UPDATE 2 (but one would except the JVM to already do something similar with a switch).Topliffe
@Mysticial, you need a couple of JVM options to be as good, namely -XX:+PrintAssembly and +XX:+PrintFlagsFinal - then treat it as C. But the usual case is lack of inline which is trivial to see in the assembly.Fruma
As your question is specifically about switch cases, this is not an answer, but on the subject of performance of the pow() operation have you considered an approximation? I found this cool article: martin.ankerl.com/2007/10/04/…Rusell
A brief update-by-comment, this issue was filed in the Oracle bugs database: bugs.sun.com/bugdatabase/view_bug.do?bug_id=8010941Cocci
Oracle engineers have reduced the minimum size of HotSpot generated jump tables to 10 on x86 and 5 on SPARC: hg.openjdk.java.net/jdk8/jdk8/hotspot/rev/34bd5e86aadbCocci
T
241

As pointed out by the other answer, because the case values are contiguous (as opposed to sparse), the generated bytecode for your various tests uses a switch table (bytecode instruction tableswitch).

However, once the JIT starts its job and compiles the bytecode into assembly, the tableswitch instruction does not always result in an array of pointers: sometimes the switch table is transformed into what looks like a lookupswitch (similar to an if/else if structure).

Decompiling the assembly generated by the JIT (hotspot JDK 1.7) shows that it uses a succession of if/else if when there are 17 cases or less, an array of pointers when there are more than 18 (more efficient).

The reason why this magic number of 18 is used seems to come down to the default value of the MinJumpTableSize JVM flag (around line 352 in the code).

I have raised the issue on the hotspot compiler list and it seems to be a legacy of past testing. Note that this default value has been removed in JDK 8 after more benchmarking was performed.

Finally, when the method becomes too long (> 25 cases in my tests), it is in not inlined any longer with the default JVM settings - that is the likeliest cause for the drop in performance at that point.


With 5 cases, the decompiled code looks like this (notice the cmp/je/jg/jmp instructions, the assembly for if/goto):

[Verified Entry Point]
  # {method} 'multiplyByPowerOfTen' '(DI)D' in 'javaapplication4/Test1'
  # parm0:    xmm0:xmm0   = double
  # parm1:    rdx       = int
  #           [sp+0x20]  (sp of caller)
  0x00000000024f0160: mov    DWORD PTR [rsp-0x6000],eax
                                                ;   {no_reloc}
  0x00000000024f0167: push   rbp
  0x00000000024f0168: sub    rsp,0x10           ;*synchronization entry
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@-1 (line 56)
  0x00000000024f016c: cmp    edx,0x3
  0x00000000024f016f: je     0x00000000024f01c3
  0x00000000024f0171: cmp    edx,0x3
  0x00000000024f0174: jg     0x00000000024f01a5
  0x00000000024f0176: cmp    edx,0x1
  0x00000000024f0179: je     0x00000000024f019b
  0x00000000024f017b: cmp    edx,0x1
  0x00000000024f017e: jg     0x00000000024f0191
  0x00000000024f0180: test   edx,edx
  0x00000000024f0182: je     0x00000000024f01cb
  0x00000000024f0184: mov    ebp,edx
  0x00000000024f0186: mov    edx,0x17
  0x00000000024f018b: call   0x00000000024c90a0  ; OopMap{off=48}
                                                ;*new  ; - javaapplication4.Test1::multiplyByPowerOfTen@72 (line 83)
                                                ;   {runtime_call}
  0x00000000024f0190: int3                      ;*new  ; - javaapplication4.Test1::multiplyByPowerOfTen@72 (line 83)
  0x00000000024f0191: mulsd  xmm0,QWORD PTR [rip+0xffffffffffffffa7]        # 0x00000000024f0140
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@52 (line 62)
                                                ;   {section_word}
  0x00000000024f0199: jmp    0x00000000024f01cb
  0x00000000024f019b: mulsd  xmm0,QWORD PTR [rip+0xffffffffffffff8d]        # 0x00000000024f0130
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@46 (line 60)
                                                ;   {section_word}
  0x00000000024f01a3: jmp    0x00000000024f01cb
  0x00000000024f01a5: cmp    edx,0x5
  0x00000000024f01a8: je     0x00000000024f01b9
  0x00000000024f01aa: cmp    edx,0x5
  0x00000000024f01ad: jg     0x00000000024f0184  ;*tableswitch
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@1 (line 56)
  0x00000000024f01af: mulsd  xmm0,QWORD PTR [rip+0xffffffffffffff81]        # 0x00000000024f0138
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@64 (line 66)
                                                ;   {section_word}
  0x00000000024f01b7: jmp    0x00000000024f01cb
  0x00000000024f01b9: mulsd  xmm0,QWORD PTR [rip+0xffffffffffffff67]        # 0x00000000024f0128
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@70 (line 68)
                                                ;   {section_word}
  0x00000000024f01c1: jmp    0x00000000024f01cb
  0x00000000024f01c3: mulsd  xmm0,QWORD PTR [rip+0xffffffffffffff55]        # 0x00000000024f0120
                                                ;*tableswitch
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@1 (line 56)
                                                ;   {section_word}
  0x00000000024f01cb: add    rsp,0x10
  0x00000000024f01cf: pop    rbp
  0x00000000024f01d0: test   DWORD PTR [rip+0xfffffffffdf3fe2a],eax        # 0x0000000000430000
                                                ;   {poll_return}
  0x00000000024f01d6: ret    

With 18 cases, the assembly looks like this (notice the array of pointers which is used and suppresses the need for all the comparisons: jmp QWORD PTR [r8+r10*1] jumps directly to the right multiplication) - that is the likely reason for the performance improvement:

[Verified Entry Point]
  # {method} 'multiplyByPowerOfTen' '(DI)D' in 'javaapplication4/Test1'
  # parm0:    xmm0:xmm0   = double
  # parm1:    rdx       = int
  #           [sp+0x20]  (sp of caller)
  0x000000000287fe20: mov    DWORD PTR [rsp-0x6000],eax
                                                ;   {no_reloc}
  0x000000000287fe27: push   rbp
  0x000000000287fe28: sub    rsp,0x10           ;*synchronization entry
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@-1 (line 56)
  0x000000000287fe2c: cmp    edx,0x13
  0x000000000287fe2f: jae    0x000000000287fe46
  0x000000000287fe31: movsxd r10,edx
  0x000000000287fe34: shl    r10,0x3
  0x000000000287fe38: movabs r8,0x287fd70       ;   {section_word}
  0x000000000287fe42: jmp    QWORD PTR [r8+r10*1]  ;*tableswitch
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@1 (line 56)
  0x000000000287fe46: mov    ebp,edx
  0x000000000287fe48: mov    edx,0x31
  0x000000000287fe4d: xchg   ax,ax
  0x000000000287fe4f: call   0x00000000028590a0  ; OopMap{off=52}
                                                ;*new  ; - javaapplication4.Test1::multiplyByPowerOfTen@202 (line 96)
                                                ;   {runtime_call}
  0x000000000287fe54: int3                      ;*new  ; - javaapplication4.Test1::multiplyByPowerOfTen@202 (line 96)
  0x000000000287fe55: mulsd  xmm0,QWORD PTR [rip+0xfffffffffffffe8b]        # 0x000000000287fce8
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@194 (line 92)
                                                ;   {section_word}
  0x000000000287fe5d: jmp    0x000000000287ff16
  0x000000000287fe62: mulsd  xmm0,QWORD PTR [rip+0xfffffffffffffe86]        # 0x000000000287fcf0
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@188 (line 90)
                                                ;   {section_word}
  0x000000000287fe6a: jmp    0x000000000287ff16
  0x000000000287fe6f: mulsd  xmm0,QWORD PTR [rip+0xfffffffffffffe81]        # 0x000000000287fcf8
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@182 (line 88)
                                                ;   {section_word}
  0x000000000287fe77: jmp    0x000000000287ff16
  0x000000000287fe7c: mulsd  xmm0,QWORD PTR [rip+0xfffffffffffffe7c]        # 0x000000000287fd00
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@176 (line 86)
                                                ;   {section_word}
  0x000000000287fe84: jmp    0x000000000287ff16
  0x000000000287fe89: mulsd  xmm0,QWORD PTR [rip+0xfffffffffffffe77]        # 0x000000000287fd08
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@170 (line 84)
                                                ;   {section_word}
  0x000000000287fe91: jmp    0x000000000287ff16
  0x000000000287fe96: mulsd  xmm0,QWORD PTR [rip+0xfffffffffffffe72]        # 0x000000000287fd10
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@164 (line 82)
                                                ;   {section_word}
  0x000000000287fe9e: jmp    0x000000000287ff16
  0x000000000287fea0: mulsd  xmm0,QWORD PTR [rip+0xfffffffffffffe70]        # 0x000000000287fd18
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@158 (line 80)
                                                ;   {section_word}
  0x000000000287fea8: jmp    0x000000000287ff16
  0x000000000287feaa: mulsd  xmm0,QWORD PTR [rip+0xfffffffffffffe6e]        # 0x000000000287fd20
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@152 (line 78)
                                                ;   {section_word}
  0x000000000287feb2: jmp    0x000000000287ff16
  0x000000000287feb4: mulsd  xmm0,QWORD PTR [rip+0xfffffffffffffe24]        # 0x000000000287fce0
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@146 (line 76)
                                                ;   {section_word}
  0x000000000287febc: jmp    0x000000000287ff16
  0x000000000287febe: mulsd  xmm0,QWORD PTR [rip+0xfffffffffffffe6a]        # 0x000000000287fd30
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@140 (line 74)
                                                ;   {section_word}
  0x000000000287fec6: jmp    0x000000000287ff16
  0x000000000287fec8: mulsd  xmm0,QWORD PTR [rip+0xfffffffffffffe68]        # 0x000000000287fd38
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@134 (line 72)
                                                ;   {section_word}
  0x000000000287fed0: jmp    0x000000000287ff16
  0x000000000287fed2: mulsd  xmm0,QWORD PTR [rip+0xfffffffffffffe66]        # 0x000000000287fd40
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@128 (line 70)
                                                ;   {section_word}
  0x000000000287feda: jmp    0x000000000287ff16
  0x000000000287fedc: mulsd  xmm0,QWORD PTR [rip+0xfffffffffffffe64]        # 0x000000000287fd48
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@122 (line 68)
                                                ;   {section_word}
  0x000000000287fee4: jmp    0x000000000287ff16
  0x000000000287fee6: mulsd  xmm0,QWORD PTR [rip+0xfffffffffffffe62]        # 0x000000000287fd50
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@116 (line 66)
                                                ;   {section_word}
  0x000000000287feee: jmp    0x000000000287ff16
  0x000000000287fef0: mulsd  xmm0,QWORD PTR [rip+0xfffffffffffffe60]        # 0x000000000287fd58
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@110 (line 64)
                                                ;   {section_word}
  0x000000000287fef8: jmp    0x000000000287ff16
  0x000000000287fefa: mulsd  xmm0,QWORD PTR [rip+0xfffffffffffffe5e]        # 0x000000000287fd60
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@104 (line 62)
                                                ;   {section_word}
  0x000000000287ff02: jmp    0x000000000287ff16
  0x000000000287ff04: mulsd  xmm0,QWORD PTR [rip+0xfffffffffffffe5c]        # 0x000000000287fd68
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@98 (line 60)
                                                ;   {section_word}
  0x000000000287ff0c: jmp    0x000000000287ff16
  0x000000000287ff0e: mulsd  xmm0,QWORD PTR [rip+0xfffffffffffffe12]        # 0x000000000287fd28
                                                ;*tableswitch
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@1 (line 56)
                                                ;   {section_word}
  0x000000000287ff16: add    rsp,0x10
  0x000000000287ff1a: pop    rbp
  0x000000000287ff1b: test   DWORD PTR [rip+0xfffffffffd9b00df],eax        # 0x0000000000230000
                                                ;   {poll_return}
  0x000000000287ff21: ret    

And finally the assembly with 30 cases (below) looks similar to 18 cases, except for the additional movapd xmm0,xmm1 that appears towards the middle of the code, as spotted by @cHao - however the likeliest reason for the drop in performance is that the method is too long to be inlined with the default JVM settings:

[Verified Entry Point]
  # {method} 'multiplyByPowerOfTen' '(DI)D' in 'javaapplication4/Test1'
  # parm0:    xmm0:xmm0   = double
  # parm1:    rdx       = int
  #           [sp+0x20]  (sp of caller)
  0x0000000002524560: mov    DWORD PTR [rsp-0x6000],eax
                                                ;   {no_reloc}
  0x0000000002524567: push   rbp
  0x0000000002524568: sub    rsp,0x10           ;*synchronization entry
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@-1 (line 56)
  0x000000000252456c: movapd xmm1,xmm0
  0x0000000002524570: cmp    edx,0x1f
  0x0000000002524573: jae    0x0000000002524592  ;*tableswitch
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@1 (line 56)
  0x0000000002524575: movsxd r10,edx
  0x0000000002524578: shl    r10,0x3
  0x000000000252457c: mulsd  xmm0,QWORD PTR [rip+0xfffffffffffffe3c]        # 0x00000000025243c0
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@364 (line 118)
                                                ;   {section_word}
  0x0000000002524584: movabs r8,0x2524450       ;   {section_word}
  0x000000000252458e: jmp    QWORD PTR [r8+r10*1]  ;*tableswitch
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@1 (line 56)
  0x0000000002524592: mov    ebp,edx
  0x0000000002524594: mov    edx,0x31
  0x0000000002524599: xchg   ax,ax
  0x000000000252459b: call   0x00000000024f90a0  ; OopMap{off=64}
                                                ;*new  ; - javaapplication4.Test1::multiplyByPowerOfTen@370 (line 120)
                                                ;   {runtime_call}
  0x00000000025245a0: int3                      ;*new  ; - javaapplication4.Test1::multiplyByPowerOfTen@370 (line 120)
  0x00000000025245a1: mulsd  xmm0,QWORD PTR [rip+0xfffffffffffffe27]        # 0x00000000025243d0
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@358 (line 116)
                                                ;   {section_word}
  0x00000000025245a9: jmp    0x0000000002524744
  0x00000000025245ae: mulsd  xmm0,QWORD PTR [rip+0xfffffffffffffe22]        # 0x00000000025243d8
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@348 (line 114)
                                                ;   {section_word}
  0x00000000025245b6: jmp    0x0000000002524744
  0x00000000025245bb: mulsd  xmm0,QWORD PTR [rip+0xfffffffffffffe1d]        # 0x00000000025243e0
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@338 (line 112)
                                                ;   {section_word}
  0x00000000025245c3: jmp    0x0000000002524744
  0x00000000025245c8: mulsd  xmm0,QWORD PTR [rip+0xfffffffffffffe18]        # 0x00000000025243e8
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@328 (line 110)
                                                ;   {section_word}
  0x00000000025245d0: jmp    0x0000000002524744
  0x00000000025245d5: mulsd  xmm0,QWORD PTR [rip+0xfffffffffffffe13]        # 0x00000000025243f0
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@318 (line 108)
                                                ;   {section_word}
  0x00000000025245dd: jmp    0x0000000002524744
  0x00000000025245e2: mulsd  xmm0,QWORD PTR [rip+0xfffffffffffffe0e]        # 0x00000000025243f8
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@308 (line 106)
                                                ;   {section_word}
  0x00000000025245ea: jmp    0x0000000002524744
  0x00000000025245ef: mulsd  xmm0,QWORD PTR [rip+0xfffffffffffffe09]        # 0x0000000002524400
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@298 (line 104)
                                                ;   {section_word}
  0x00000000025245f7: jmp    0x0000000002524744
  0x00000000025245fc: mulsd  xmm0,QWORD PTR [rip+0xfffffffffffffe04]        # 0x0000000002524408
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@288 (line 102)
                                                ;   {section_word}
  0x0000000002524604: jmp    0x0000000002524744
  0x0000000002524609: mulsd  xmm0,QWORD PTR [rip+0xfffffffffffffdff]        # 0x0000000002524410
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@278 (line 100)
                                                ;   {section_word}
  0x0000000002524611: jmp    0x0000000002524744
  0x0000000002524616: mulsd  xmm0,QWORD PTR [rip+0xfffffffffffffdfa]        # 0x0000000002524418
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@268 (line 98)
                                                ;   {section_word}
  0x000000000252461e: jmp    0x0000000002524744
  0x0000000002524623: mulsd  xmm0,QWORD PTR [rip+0xfffffffffffffd9d]        # 0x00000000025243c8
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@258 (line 96)
                                                ;   {section_word}
  0x000000000252462b: jmp    0x0000000002524744
  0x0000000002524630: movapd xmm0,xmm1
  0x0000000002524634: mulsd  xmm0,QWORD PTR [rip+0xfffffffffffffe0c]        # 0x0000000002524448
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@242 (line 92)
                                                ;   {section_word}
  0x000000000252463c: jmp    0x0000000002524744
  0x0000000002524641: movapd xmm0,xmm1
  0x0000000002524645: mulsd  xmm0,QWORD PTR [rip+0xfffffffffffffddb]        # 0x0000000002524428
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@236 (line 90)
                                                ;   {section_word}
  0x000000000252464d: jmp    0x0000000002524744
  0x0000000002524652: movapd xmm0,xmm1
  0x0000000002524656: mulsd  xmm0,QWORD PTR [rip+0xfffffffffffffdd2]        # 0x0000000002524430
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@230 (line 88)
                                                ;   {section_word}
  0x000000000252465e: jmp    0x0000000002524744
  0x0000000002524663: movapd xmm0,xmm1
  0x0000000002524667: mulsd  xmm0,QWORD PTR [rip+0xfffffffffffffdc9]        # 0x0000000002524438
                                                ;*dmul
                                                ; - javaapplication4.Test1::multiplyByPowerOfTen@224 (line 86)
                                                ;   {section_word}

[etc.]

  0x0000000002524744: add    rsp,0x10
  0x0000000002524748: pop    rbp
  0x0000000002524749: test   DWORD PTR [rip+0xfffffffffde1b8b1],eax        # 0x0000000000340000
                                                ;   {poll_return}
  0x000000000252474f: ret    
Topliffe answered 25/3, 2013 at 17:59 Comment(23)
@Abu this is more trustable source. - disassembly.Arrestment
@syb0rg To be honest I don't understand the fine details either ;-)Topliffe
+1 for great answer! Could you disassemble something with 30+ cases to compare for when the performance exits the "dip" in the OP's chart?Pestalozzi
@Topliffe Nice answers to you and Vishal! I guess my next question would be, could this padding trick somehow be incorporated into the JVM to improve performance in the 6-17 case range? (Not that it necessarily is worth the effort, I'm just wondering if it would be possible semantically by proving that there is a limit on the range of values that can come into the switch.)Cocci
@AndrewBissell Can you confirm that the cases are contiguous in all your tests (i.e. 0 to 3, then 0 to 4, 0 to 5 etc. 0 to 30)? Because some of the other answers make the difference between sparse and contiguous.Topliffe
@Keys Yes - the change happens when there are 18 cases or more.Topliffe
@Topliffe Confirmed. There was a LOT of copy-paste involved but it looks like I got it right. Maybe the range from 10-17 or so is treated as sparse by the compiler?Cocci
@assylias, how were you able to generate the JITted code? I'm rusty with my assembly and I'm more familiar with the motorola flavor, but I will take a look at this again after lunch to see why it's doing this. Unless someone else has already figured this out!Sinegold
@VivinPaliath #1503979Topliffe
@AndrewBissell My guess is that the different behaviour is based on either (i) cross-architecture performance tests which have shown that the array of pointers is only efficient when the number of cases is larger than 18 or (ii) the code is profiled as it is run and the profiler determines which approach is better during runtime. I can't find the answer.Topliffe
The 30-case disassembly and the 18-case one look mostly the same. The differences seem mostly limited to an extra bit of extra register shuffling after about the 11th case. Can't say why the JITter does that; it appears unnecessary.Mangrove
I've actually never taken a look at JIT-generated assembly code like this before, and certainly never in the context of an observable performance impact! Thanks again to @Topliffe and cHao for the analysis.Cocci
Any idea why the cut-off for this optimization is 18 cases? Seems like this would be a valuable optimization for as few as about 4 cases, right?Abandon
@BenLee That was my thinking as well. Given that they're somewhat difficult to write and maintain, switch statements are likely to be used (by skilled developers at least) in sections of code where high performance is desired. Since the bytecode calls for a tableswitch in all these cases as shown in the answer below, I'd almost wonder if this isn't some kind of bug or oversight in the JIT compiler.Cocci
@AndrewBissell I have added something we should have thought of earlier: when you have too many cases, the method becomes too long to be inlined, hence the drop in performance when there are too many case statements.Topliffe
@BenLee It is due to the default value of one of the JVM flags (MinJumpTableSize) - see my update.Topliffe
@Topliffe That makes sense, and also accounts for why different testers are observing the drop in performance with a different number of total cases. Thanks for raising this on the HotSpot compiler list, it's good to see the Oracle guys' take on this. Interesting also to see that AMD64 is a little weak with jumptables (or was in 2005/2006, anyway).Cocci
did you try XX:MaxInlineSize=... (default is 35) ?Fruma
@Fruma In this case, you need to use -XX:FreqInlineSize=... which has priority over MaxInlineSize for hot methods (confirmed empirically). Interesting related post: mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2008-April/…Topliffe
Indeed, a good post there; when I have a bit of time I will look at hotspot code regarding maxInlineSize, although spec.org puts Freq higher than Max, so probably true.Fruma
You sir sure know how to answer a JVM question directly and to the point (without that "JIT will do its magic and I don't know how nor why, so don't worry!" answers that always get accepted for some rason).Moderator
@assylias, Shouldn't we add a conclusion or something? Is the conclusion to stay away from switch and use arrays to implement manual jump tables if we ever need guaranteed switch performance?Heartsick
@Heartsick It has been fixed in JDK 8: hg.openjdk.java.net/jdk8/jdk8/hotspot/rev/34bd5e86aadb - as often with the JVM, performance improves over time so unless your switch is the reason for a performance issue, leave it as is and its performance will improve automagically when you switch to Java 8 if you have not already done so!Topliffe
A
46

Switch - case is faster if the case values are placed in a narrow range Eg.

case 1:
case 2:
case 3:
..
..
case n:

Because, in this case the compiler can avoid performing a comparison for every case leg in the switch statement. The compiler make a jump table which contains addresses of the actions to be taken on different legs. The value on which the switch is being performed is manipulated to convert it into an index in to the jump table. In this implementation , the time taken in the switch statement is much less than the time taken in an equivalent if-else-if statement cascade. Also the time taken in the switch statement is independent of the number of case legs in the switch statement.

As stated in wikipedia about switch statement in Compilation section.

If the range of input values is identifiably 'small' and has only a few gaps, some compilers that incorporate an optimizer may actually implement the switch statement as a branch table or an array of indexed function pointers instead of a lengthy series of conditional instructions. This allows the switch statement to determine instantly what branch to execute without having to go through a list of comparisons.

Allisonallissa answered 25/3, 2013 at 17:34 Comment(15)
that's not correct. It will be faster regardless of the case values being narrow or wide in range. It is O(1) - shouldn't matter how apart the case values are.Arrestment
what I am thinking is Java compiler has to do with something that is increasing the performance. Not sure how though.Arrestment
@Aniket: Read this article of wikipedia. en.wikipedia.org/wiki/Branch_tableAllisonallissa
This would explain the intuitive behavior of fewer switch statements being faster. And maybe the upward cliff at 29 cases, but not the valley the OP asked about in the first place.Mellette
@Aniket: It's not O(1) if the range is wide and sparse. There are two kinds of switches, and if the range is too spread out, Java will compile it to a "lookupswitch" rather than a "tableswitch". The former requires a comparison per branch til one's found, while the latter doesn't.Mangrove
I still cannot trust this answer. A look up can also be put into a hashtable and the correct index can be looked up.Arrestment
@Abu I have upvoted the answer too. And I hardly care about reputation of the user. I started with 1 :-/Arrestment
The reason I don't trust this answer is because it is referring constantly to wikipedia. An implementation documentation reference would be more trustable. Or a disassembly as suggested by ericksonArrestment
Wikipedia is a decent place to find references, but should not be considered an authoritative source. Anything you read there is at best second-hand info.Mangrove
I appreciate @Aniket's doubts. Vishal may have answered as to why the sequential values make it faster, but the "dip" in the chart is still not explained. But I'm wondering if it really can be explained.Cardie
@Jessemon this is where I begin to trust the disassembly more - why the dipArrestment
@Abu see assylias's answer? What do you trust more? wikipedia or the disassembly itself?Arrestment
@Aniket, you don't know when to give up, heh?Zemstvo
@Aniket: In all fairness, the disassembly is specific to a given JVM on a specific platform. Others may translate it differently. Some might in fact use a hash table for a lookupswitch. It still won't perform as well as a tableswitch, but it could at least be close. It'd just take longer to JIT, and involve applying a hashing algorithm to the input. So although the resulting assembly code can be enlightening, it's not authoritative either unless you're specifically talking about Hotspot v1.7.whatever on Windows x86_64.Mangrove
@Aniket Bytecode is the canonical representation. JITted code can differ based on various factors.Sinegold
S
32

The answer lies in the bytecode:

SwitchTest10.java

public class SwitchTest10 {

    public static void main(String[] args) {
        int n = 0;

        switcher(n);
    }

    public static void switcher(int n) {
        switch(n) {
            case 0: System.out.println(0);
                    break;

            case 1: System.out.println(1);
                    break;

            case 2: System.out.println(2);
                    break;

            case 3: System.out.println(3);
                    break;

            case 4: System.out.println(4);
                    break;

            case 5: System.out.println(5);
                    break;

            case 6: System.out.println(6);
                    break;

            case 7: System.out.println(7);
                    break;

            case 8: System.out.println(8);
                    break;

            case 9: System.out.println(9);
                    break;

            case 10: System.out.println(10);
                    break;

            default: System.out.println("test");
        }
    }       
}

Corresponding bytecode; only relevant parts shown:

public static void switcher(int);
  Code:
   0:   iload_0
   1:   tableswitch{ //0 to 10
        0: 60;
        1: 70;
        2: 80;
        3: 90;
        4: 100;
        5: 110;
        6: 120;
        7: 131;
        8: 142;
        9: 153;
        10: 164;
        default: 175 }

SwitchTest22.java:

public class SwitchTest22 {

    public static void main(String[] args) {
        int n = 0;

        switcher(n);
    }

    public static void switcher(int n) {
        switch(n) {
            case 0: System.out.println(0);
                    break;

            case 1: System.out.println(1);
                    break;

            case 2: System.out.println(2);
                    break;

            case 3: System.out.println(3);
                    break;

            case 4: System.out.println(4);
                    break;

            case 5: System.out.println(5);
                    break;

            case 6: System.out.println(6);
                    break;

            case 7: System.out.println(7);
                    break;

            case 8: System.out.println(8);
                    break;

            case 9: System.out.println(9);
                    break;

            case 100: System.out.println(10);
                    break;

            case 110: System.out.println(10);
                    break;
            case 120: System.out.println(10);
                    break;
            case 130: System.out.println(10);
                    break;
            case 140: System.out.println(10);
                    break;
            case 150: System.out.println(10);
                    break;
            case 160: System.out.println(10);
                    break;
            case 170: System.out.println(10);
                    break;
            case 180: System.out.println(10);
                    break;
            case 190: System.out.println(10);
                    break;
            case 200: System.out.println(10);
                    break;
            case 210: System.out.println(10);
                    break;

            case 220: System.out.println(10);
                    break;

            default: System.out.println("test");
        }
    }       
}

Corresponding bytecode; again, only relevant parts shown:

public static void switcher(int);
  Code:
   0:   iload_0
   1:   lookupswitch{ //23
        0: 196;
        1: 206;
        2: 216;
        3: 226;
        4: 236;
        5: 246;
        6: 256;
        7: 267;
        8: 278;
        9: 289;
        100: 300;
        110: 311;
        120: 322;
        130: 333;
        140: 344;
        150: 355;
        160: 366;
        170: 377;
        180: 388;
        190: 399;
        200: 410;
        210: 421;
        220: 432;
        default: 443 }

In the first case, with narrow ranges, the compiled bytecode uses a tableswitch. In the second case, the compiled bytecode uses a lookupswitch.

In tableswitch, the integer value on the top of the stack is used to index into the table, to find the branch/jump target. This jump/branch is then performed immediately. Hence, this is an O(1) operation.

A lookupswitch is more complicated. In this case, the integer value needs to be compared against all the keys in the table until the correct key is found. After the key is found, the branch/jump target (that this key is mapped to) is used for the jump. The table that is used in lookupswitch is sorted and a binary-search algorithm can be used to find the correct key. Performance for a binary search is O(log n), and the entire process is also O(log n), because the jump is still O(1). So the reason the performance is lower in the case of sparse ranges is that the correct key must first be looked up because you cannot index into the table directly.

If there are sparse values and you only had a tableswitch to use, table would essentially contain dummy entries that point to the default option. For example, assuming that the last entry in SwitchTest10.java was 21 instead of 10, you get:

public static void switcher(int);
  Code:
   0:   iload_0
   1:   tableswitch{ //0 to 21
        0: 104;
        1: 114;
        2: 124;
        3: 134;
        4: 144;
        5: 154;
        6: 164;
        7: 175;
        8: 186;
        9: 197;
        10: 219;
        11: 219;
        12: 219;
        13: 219;
        14: 219;
        15: 219;
        16: 219;
        17: 219;
        18: 219;
        19: 219;
        20: 219;
        21: 208;
        default: 219 }

So the compiler basically creates this huge table containing dummy entries between the gaps, pointing to the branch target of the default instruction. Even if there isn't a default, it will contain entries pointing to the instruction after the switch block. I did some basic tests, and I found that if the gap between the last index and the previous one (9) is greater than 35, it uses a lookupswitch instead of a tableswitch.

The behavior of the switch statement is defined in Java Virtual Machine Specification (§3.10):

Where the cases of the switch are sparse, the table representation of the tableswitch instruction becomes inefficient in terms of space. The lookupswitch instruction may be used instead. The lookupswitch instruction pairs int keys (the values of the case labels) with target offsets in a table. When a lookupswitch instruction is executed, the value of the expression of the switch is compared against the keys in the table. If one of the keys matches the value of the expression, execution continues at the associated target offset. If no key matches, execution continues at the default target. [...]

Sinegold answered 25/3, 2013 at 18:14 Comment(6)
I understood from the question that the numbers are always contiguous but the range is more or less long - i.e. in one example the cases go from 0 to 5 whereas in another example they go from 0 to 30 - and none of the examples use sparse valuesTopliffe
@Topliffe Hmm, interesting. I guess I misunderstood the question. Let me do some more experimentation. So you're saying that even with a contiguous range from 0-30, the compiler uses a lookupswitch?Sinegold
@VivinPaliath: Yes, in my tests the case constants are always contiguous, so I'm basically testing switches on [0, 1], [0, 1, 2], [0, 1, 2, 3] ... etcCocci
@VivinPaliath No, the bytecode always uses a tableswitch - however the JIT compiler does not seem to compile the tableswitch to assembly the same way depending on how many items it contains.Topliffe
@AndrewBissell, assylias I completely misread the question. I just looked at assylias's answer and it seems to make more sense. The JITted code seems to behave differently based on the number of items. I can leave the answer here for more information, or I can delete it. Either way works!Sinegold
@VivinPaliath I could have worded the question more clearly for sure. I'm sort of out of my depth when it comes to evaluating answers involving this low-level bytecode & assembly stuff. It still seems to me like the tableswitch/lookupswitch distinction is actually important here, and yours is the only answer that employs those terms so far (though the others are probably setting forth the same concept with different terminology). Plus I do like having the JVM Spec link as well.Cocci
F
19

Since the question is already answered (more or less), here is some tip. Use

private static final double[] mul={1d, 10d...};
static double multiplyByPowerOfTen(final double d, final int exponent) {
      if (exponent<0 || exponent>=mul.length) throw new ParseException();//or just leave the IOOBE be
      return mul[exponent]*d;
}

That code uses significantly less IC (instruction cache) and will be always inlined. The array will be in L1 data cache if the code is hot. The lookup table is almost always a win. (esp. on microbenchmarks :D )

Edit: if you wish the method to be hot-inlined, consider the non-fast paths like throw new ParseException() to be as short as minimum or move them to separate static method (hence making them short as minimum). That is throw new ParseException("Unhandled power of ten " + power, 0); is a weak idea b/c it eats a lot of the inlining budget for code that can be just interpreted - string concatenation is quite verbose in bytecode . More info and a real case w/ ArrayList

Fruma answered 27/3, 2013 at 21:58 Comment(0)
R
0

Based on javac source, you can write switch in a way that it uses tableswitch.

We can use the calculation from javac source to calculate cost for your second example.

lo = 0
hi = 220
nlabels = 24

table_space_cost = 4 + hi - lo + 1
table_time_cost = 3
lookup_space_cost = 3 + 2 * nlabels
lookup_time_cost = nlabels

table_cost = table_space_cost + 3 * table_time_cost // 234
lookup_cost = lookup_space_cost + 3 * lookup_time_cos // 123

Here tableswitch cost is higher (234) than the lookupswitch (123) and therefore lookupswitch will be selected as the opcode for this switch statement.

Remanence answered 20/4, 2021 at 15:56 Comment(2)
These "cost" numbers are of course just heuristics that trade off code-size and best vs. worst case for a chain of cmp/branch instructions. Picking the same case repeatedly will make an indirect branch predict well even on a CPU with a weak indirect branch predictor, and the table pointers will stay hot in data cache, so a lookup table does very well on this particular microbenchmark.Trillion
Oh, and that's only in javac governing the choice of bytecode. The JIT will have its own heuristics for how to implement a tableswitch in native machine code, as described in other answers.Trillion

© 2022 - 2024 — McMap. All rights reserved.