Why simple Scala tailrec loop for fibonacci calculation is faster in 3x times than Java loop?
Asked Answered
H

1

7

Scala

code:

@annotation.tailrec
private def fastLoop(n: Int, a: Long = 0, b: Long = 1): Long = 
  if (n > 1) fastLoop(n - 1, b, a + b) else b

bytecode:

  private long fastLoop(int, long, long);
    Code:
       0: iload_1
       1: iconst_1
       2: if_icmple     21
       5: iload_1
       6: iconst_1
       7: isub
       8: lload         4
      10: lload_2
      11: lload         4
      13: ladd
      14: lstore        4
      16: lstore_2
      17: istore_1
      18: goto          0
      21: lload         4
      23: lreturn

result is 53879289.462 ± 6289454.961 ops/s:

https://travis-ci.org/plokhotnyuk/scala-vs-java/jobs/56117116#L2909

Java

code:

private long fastLoop(int n, long a, long b) {
    while (n > 1) {
        long c = a + b;
        a = b;
        b = c;
        n--;
    }
    return b;
}

bytecode:

  private long fastLoop(int, long, long);
    Code:
       0: iload_1
       1: iconst_1
       2: if_icmple     24
       5: lload_2
       6: lload         4
       8: ladd
       9: lstore        6
      11: lload         4
      13: lstore_2
      14: lload         6
      16: lstore        4
      18: iinc          1, -1
      21: goto          0
      24: lload         4
      26: lreturn

result is 17444340.812 ± 9508030.117 ops/s:

https://travis-ci.org/plokhotnyuk/scala-vs-java/jobs/56117116#L2881

Yes, it depends on environment parameters (JDK version, CPU model & frequency of RAM) and dynamic state. But why mostly the same bytecode on the same environment can produce stable 2x-3x difference for range of function arguments?

Here is list of ops/s numbers for different values of function arguments from my notebook with Intel(R) Core(TM) i7-2640M CPU @ 2.80GHz (max 3.50GHz), RAM 12Gb DDR3-1333, Ubuntu 14.10, Oracle JDK 1.8.0_40-b25 64-bit:

[info] Benchmark            (n)   Mode  Cnt          Score          Error  Units
[info] JavaFibonacci.loop     2  thrpt    5  171776163.027 ±  4620419.353  ops/s
[info] JavaFibonacci.loop     4  thrpt    5  144793748.362 ± 25506649.671  ops/s
[info] JavaFibonacci.loop     8  thrpt    5   67271848.598 ± 15133193.309  ops/s
[info] JavaFibonacci.loop    16  thrpt    5   54552795.336 ± 17398924.190  ops/s
[info] JavaFibonacci.loop    32  thrpt    5   41156886.101 ± 12905023.289  ops/s
[info] JavaFibonacci.loop    64  thrpt    5   24407771.671 ±  4614357.030  ops/s
[info] ScalaFibonacci.loop    2  thrpt    5  148926292.076 ± 23673126.125  ops/s
[info] ScalaFibonacci.loop    4  thrpt    5  139184195.527 ± 30616384.925  ops/s
[info] ScalaFibonacci.loop    8  thrpt    5  109050091.514 ± 23506756.224  ops/s
[info] ScalaFibonacci.loop   16  thrpt    5   81290743.288 ±  5214733.740  ops/s
[info] ScalaFibonacci.loop   32  thrpt    5   38937420.431 ±  8324732.107  ops/s
[info] ScalaFibonacci.loop   64  thrpt    5   22641295.988 ±  5961435.507  ops/s

Additional question is "why values of ops/s are decreasing in non-linear way as above?"

Halleyhalli answered 27/3, 2015 at 20:15 Comment(7)
You'd have to start by examining the bytecode. ISTR a very similar question where the difference was implicit loop unrolling.Jessen
Can you give more detail about the bechmarking mechanism you used? It seems much more likely that your measuring technique is problematic than that this is actually slower.Andee
My previous comment was not entirely clear. What I mean to say is that when I benchmark these, I get the result that the two methods are exactly the same as each other. I do not see any 3-fold difference. I don't even see a 10% difference.Marna
Did you read this shipilev.net/blog/2014/java-scala-divided-we-fail? JVM's bytecode does not affect the performance, you have to repeat all the studies from this paper. Please provide stack and perf profilers output.Indene
Also notice the errors are rather huge. If you use the "proper" units, say, "ops/us", then it would be clearly visible most differences in Java/Scala results in the table above are not significant.Rife
You also have to follow up what happens before and after you call fastLoop. I digged for your GitHub project, and on my machine, with n=10, only 20% of CPU time is spent in fastLoop, and other 80% are spent dealing with BigInt/BigInteger.Rife
Thank you, Aleksey! I missed that Java/Scala use caches of BigInteger/BigInt values with different ranges and performance characteristics. That is an answer for both main and additional questions.Halleyhalli
H
1

Yes, I was wrong, and missed that tested method was not just fastLoop calls:

Scala

  @Benchmark
  def loop(): BigInt =
    if (n > 92) loop(n - 91, 4660046610375530309L, 7540113804746346429L)
    else fastLoop(n)

Java

 @Benchmark
    public BigInteger loop() {
        return n > 92 ?
                loop(n - 91, BigInteger.valueOf(4660046610375530309L), BigInteger.valueOf(7540113804746346429L)) :
                BigInteger.valueOf(fastLoop(n, 0, 1));
    }

As Aleksey noted lot of time was spend in conversions from Long/long to BigInt/BigInteger.

I have wrote separate benchmarks which tests just fastLoop(n, 0, 1) call. Here are results from them:

[info] JavaFibonacci.fastLoop     2  thrpt    5  338071686.910 ± 66146042.535  ops/s
[info] JavaFibonacci.fastLoop     4  thrpt    5  231066635.073 ±  3702419.585  ops/s
[info] JavaFibonacci.fastLoop     8  thrpt    5  174832245.690 ± 36491363.939  ops/s
[info] JavaFibonacci.fastLoop    16  thrpt    5   95162799.968 ± 16151609.596  ops/s
[info] JavaFibonacci.fastLoop    32  thrpt    5   60197918.766 ± 10662747.434  ops/s
[info] JavaFibonacci.fastLoop    64  thrpt    5   29564087.602 ±  3610164.011  ops/s
[info] ScalaFibonacci.fastLoop    2  thrpt    5  336588218.560 ± 56762496.725  ops/s
[info] ScalaFibonacci.fastLoop    4  thrpt    5  224918874.670 ± 35499107.133  ops/s
[info] ScalaFibonacci.fastLoop    8  thrpt    5  121952667.394 ± 17314931.711  ops/s
[info] ScalaFibonacci.fastLoop   16  thrpt    5   96573968.960 ± 12757890.175  ops/s
[info] ScalaFibonacci.fastLoop   32  thrpt    5   59462408.940 ± 14924369.138  ops/s
[info] ScalaFibonacci.fastLoop   64  thrpt    5   28922994.377 ±  7209467.197  ops/s

Lessons that I learned:

  • Scala implicits can eat lot of performance, while are easy to be overlooked;

  • Cashing of BigInt values in Scala can speed up some functions comparing with Java's BigInteger.

Halleyhalli answered 1/4, 2015 at 12:41 Comment(1)
Please use the "proper" units, for example, "ops/us". It is too hard to readIndene

© 2022 - 2024 — McMap. All rights reserved.