Hidden performance cost in Scala?

Asked 22/3, 2014 at 17:38 Answered 24/3, 2014 at 8:20

Solved java performance scala jvm microbenchmark

I came across this old question and did the following experiment with scala 2.10.3.

I rewrote the Scala version to use explicit tail recursion:

import scala.annotation.tailrec

object ScalaMain {
  private val t = 20

  private def run() {
    var i = 10
    while(!isEvenlyDivisible(2, i, t))
      i += 2
    println(i)
  }

  @tailrec private def isEvenlyDivisible(i: Int, a: Int, b: Int): Boolean = {
    if (i > b) true
    else (a % i == 0) && isEvenlyDivisible(i+1, a, b)
  }

  def main(args: Array[String]) {
    val t1 = System.currentTimeMillis()
    var i = 0
    while (i < 20) {
      run()
      i += 1
    }
    val t2 = System.currentTimeMillis()
    println("time: " + (t2 - t1))
  }
}

and compared it to the following Java version. I consciously made the functions non-static for a fair comparison with Scala:

public class JavaMain {
    private final int t = 20;

    private void run() {
        int i = 10;
        while (!isEvenlyDivisible(2, i, t))
            i += 2;
        System.out.println(i);
    }

    private boolean isEvenlyDivisible(int i, int a, int b) {
        if (i > b) return true;
        else return (a % i == 0) && isEvenlyDivisible(i+1, a, b);
    }

    public static void main(String[] args) {
        JavaMain o = new JavaMain();
        long t1 = System.currentTimeMillis();
        for (int i = 0; i < 20; ++i)
          o.run();
        long t2 = System.currentTimeMillis();
        System.out.println("time: " + (t2 - t1));
    }
}

Here are the results on my computer:

> java JavaMain
....
time: 9651
> scala ScalaMain
....
time: 20592

This is scala 2.10.3 on (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_51).

My question is what is the hidden cost with the scala version?

Many thanks.

Melan answered 22/3, 2014 at 17:38 Comment(6)

Could you change it to be exactly equal to the code in scala? An if-else-statement may for example not be seen as exacty equal to the ternary operatort in isEvenlyDivisible. It shouldn't matter for speed comparison, but the code now is not fully equal. Also, did warm up the JVM first and provide an average over X runs afterwards? – Severity 22/3, 2014 at 17:41

Warming up the JVM is important. Also decompiling the byte code may yield some insights. – Fatwitted 22/3, 2014 at 17:42

Edited. I ran each benchmark 20 times and used while in Scala to void closure creation. Empirically, the period between each output from Scala is also longer. – Melan 22/3, 2014 at 17:51

Considering you are testing the interpreter and not JITed performance.. sure it's not a big surprise that the scala code could run slower - some additional indirections. Now if we were testing the actual JITed code? Not so sure that there would be any difference. – Eldred 22/3, 2014 at 20:13

@Voo: What's the proper way to do JITed? I compiled both files and ran the resulting .class files. Anyway, see my answer for the updated result. – Melan 22/3, 2014 at 20:44

@Melan This question will answer at least some of your questions on that point. Sadly Cliff's paper from JavaOne isn't online anymore it seems, but those SO posts itself are pretty good. (Especially the first point of the accepted answer about warming up). – Eldred 22/3, 2014 at 23:37

136

Well, OP's benchmarking is not the ideal one. Tons of effects need to be mitigated, including warmup, dead code elimination, forking, etc. Luckily, JMH already takes care of many things, and has bindings for both Java and Scala. Please follow the procedures on JMH page to get the benchmark project, then you can transplant the benchmarks below there.

This is the sample Java benchmark:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Benchmark)
@Fork(3)
@Warmup(iterations = 5)
@Measurement(iterations = 5)
public class JavaBench {

    @Param({"1", "5", "10", "15", "20"})
    int t;

    private int run() {
        int i = 10;
        while(!isEvenlyDivisible(2, i, t))
            i += 2;
        return i;
    }

    private boolean isEvenlyDivisible(int i, int a, int b) {
        if (i > b)
            return true;
        else
            return (a % i == 0) && isEvenlyDivisible(i + 1, a, b);
    }

    @GenerateMicroBenchmark
    public int test() {
        return run();
    }

}

...and this is the sample Scala benchmark:

@BenchmarkMode(Array(Mode.AverageTime))
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Benchmark)
@Fork(3)
@Warmup(iterations = 5)
@Measurement(iterations = 5)
class ScalaBench {

  @Param(Array("1", "5", "10", "15", "20"))
  var t: Int = _

  private def run(): Int = {
    var i = 10
    while(!isEvenlyDivisible(2, i, t))
      i += 2
    i
  }

  @tailrec private def isEvenlyDivisible(i: Int, a: Int, b: Int): Boolean = {
    if (i > b) true
    else (a % i == 0) && isEvenlyDivisible(i + 1, a, b)
  }

  @GenerateMicroBenchmark
  def test(): Int = {
    run()
  }

}

If you run these on JDK 8 GA, Linux x86_64, then you'll get:

Benchmark             (t)   Mode   Samples         Mean   Mean error    Units
o.s.ScalaBench.test     1   avgt        15        0.005        0.000    us/op
o.s.ScalaBench.test     5   avgt        15        0.489        0.001    us/op
o.s.ScalaBench.test    10   avgt        15       23.672        0.087    us/op
o.s.ScalaBench.test    15   avgt        15     3406.492        9.239    us/op
o.s.ScalaBench.test    20   avgt        15  2483221.694     5973.236    us/op

Benchmark            (t)   Mode   Samples         Mean   Mean error    Units
o.s.JavaBench.test     1   avgt        15        0.002        0.000    us/op
o.s.JavaBench.test     5   avgt        15        0.254        0.007    us/op
o.s.JavaBench.test    10   avgt        15       12.578        0.098    us/op
o.s.JavaBench.test    15   avgt        15     1628.694       11.282    us/op
o.s.JavaBench.test    20   avgt        15  1066113.157    11274.385    us/op

Notice we juggle t to see if the effect is local for the particular value of t. It is not, the effect is systematic, and Java version being twice as fast.

PrintAssembly will shed some light on this. This one is the hottest block in Scala benchmark:

0x00007fe759199d42: test   %r8d,%r8d
0x00007fe759199d45: je     0x00007fe759199d76  ;*irem
                                               ; - org.sample.ScalaBench::isEvenlyDivisible@11 (line 52)
                                               ; - org.sample.ScalaBench::run@10 (line 45)
0x00007fe759199d47: mov    %ecx,%eax
0x00007fe759199d49: cmp    $0x80000000,%eax
0x00007fe759199d4e: jne    0x00007fe759199d58
0x00007fe759199d50: xor    %edx,%edx
0x00007fe759199d52: cmp    $0xffffffffffffffff,%r8d
0x00007fe759199d56: je     0x00007fe759199d5c
0x00007fe759199d58: cltd   
0x00007fe759199d59: idiv   %r8d

...and this is similar block in Java:

0x00007f4a811848cf: movslq %ebp,%r10
0x00007f4a811848d2: mov    %ebp,%r9d
0x00007f4a811848d5: sar    $0x1f,%r9d
0x00007f4a811848d9: imul   $0x55555556,%r10,%r10
0x00007f4a811848e0: sar    $0x20,%r10
0x00007f4a811848e4: mov    %r10d,%r11d
0x00007f4a811848e7: sub    %r9d,%r11d         ;*irem
                                              ; - org.sample.JavaBench::isEvenlyDivisible@9 (line 63)
                                              ; - org.sample.JavaBench::isEvenlyDivisible@19 (line 63)
                                              ; - org.sample.JavaBench::run@10 (line 54)

Notice how in Java version the compiler employed the trick for translating integer remainder calculation into the multiplication and shifting right (see Hacker's Delight, Ch. 10, Sect. 19). This is possible when compiler detects we compute the remainder against the constant, which suggests Java version hit that sweet optimization, but Scala version did not. You can dig into the bytecode disassembly to figure out what quirk in scalac have intervened, but the point of this exercise is that surprising minute differences in code generation are magnified by benchmarks a lot.

P.S. So much for @tailrec...

UPDATE: A more thorough explanation of the effect: http://shipilev.net/blog/2014/java-scala-divided-we-fail/

Waistline answered 24/3, 2014 at 8:20 Comment(1)

Great answer. Would be even greater if you added the bytecode listings of the two methods! – Turgent 24/3, 2014 at 9:11

I changed the val

private val t = 20

to a constant definition

private final val t = 20

and got a significant performance boost, now it seems that both versions perform almost equally [on my system, see update and comments].

I have not looked into into the bytecode, but if you use val t = 20 you can see using javap that there is a method (and that version is as slow as the one with the private val).

So I assume that even a private val involves calling a method, and that's not directly comparable with a final in Java.

Update

On my system I got these results

Java version : time: 14725

Scala version: time: 13228

Using OpenJDK 1.7 on a 32-Bit Linux.

In my experience Oracle's JDK on a 64-Bit system does actually perform better, so this probably explains that other measurements yield even better results in favour of the Scala version.

As for the Scala version performing better I assume that tail recursion optimization does have an effect here (see Phil's answer, if the Java version is rewritten to use a loop instead of recursion, it performs equally again).

Showmanship answered 22/3, 2014 at 18:56 Comment(7)

Actually that your new version (as well as the one with t moved to inside run) runs twice as fast as Java on my computer. It would be great if somebody can explain to me what exactly is going on in the bytecode. I'm not experienced with the JVM and don't know how to decompile bytecode :) – Melan 22/3, 2014 at 19:0

It was my impression as well, that the Scala version performs slightly better, possibly tail recursion optimization has an effect. – Showmanship 22/3, 2014 at 19:8

Making that change I got the following results: scala time: 7852 java time: 14657 It's not only slightly faster, it's twice faster! – Larkins 22/3, 2014 at 19:13

@Larkins I have added my results from my system to the answer. But yes, glad to hear that it has this tremendous effect in other environments. – Showmanship 22/3, 2014 at 19:52

See #13412886 for a related answer on the difference between private final val and private val. – Ambrosial 22/3, 2014 at 20:39

Also, maybe private[this] would help? It eliminates the "getter method", too. – Eppie 8/7, 2014 at 19:39

@VasyaNovikov A quick test showed that private[this] does not help (without JMH, Scala 2.10.3). It shows the same performance like like private val. – Showmanship 9/7, 2014 at 7:53

I looked at this question and edited the Scala version to have t inside run:

object ScalaMain {
  private def run() {
    val t = 20
    var i = 10
    while(!isEvenlyDivisible(2, i, t))
      i += 2
    println(i)
  }

  @tailrec private def isEvenlyDivisible(i: Int, a: Int, b: Int): Boolean = {
    if (i > b) true
    else (a % i == 0) && isEvenlyDivisible(i+1, a, b)
  }

  def main(args: Array[String]) {
    val t1 = System.currentTimeMillis()
    var i = 0
    while (i < 20) {
      run()
      i += 1
    }
    val t2 = System.currentTimeMillis()
    println("time: " + (t2 - t1))
  }
}

The new Scala version now runs twice as fast as the original Java one:

> fsc ScalaMain.scala
> scala ScalaMain
....
time: 6373
> fsc -optimize ScalaMain.scala
....
time: 4703

I figured out it is because Java not having tail calls. The optimized Java one with loop instead of recursion runs just as fast:

public class JavaMain {
    private static final int t = 20;

    private void run() {
        int i = 10;
        while (!isEvenlyDivisible(i, t))
            i += 2;
        System.out.println(i);
    }

    private boolean isEvenlyDivisible(int a, int b) {
        for (int i = 2; i <= b; ++i) {
            if (a % i != 0)
                 return false;
        }
        return true;
    }

    public static void main(String[] args) {
        JavaMain o = new JavaMain();
        long t1 = System.currentTimeMillis();
        for (int i = 0; i < 20; ++i)
            o.run();
        long t2 = System.currentTimeMillis();
        System.out.println("time: " + (t2 - t1));
    }
}

Now my confusion is fully solved:

> java JavaMain
....
time: 4795

In conclusion, the original Scala version was slow because I didn't declare t to be final (directly or indirectly, as Beryllium's answer points out). And the original Java version was slow due to lack of tail calls.

Melan answered 22/3, 2014 at 19:16 Comment(3)

Then it's the tail recursion optimization. Good to know that it works as expected. – Showmanship 22/3, 2014 at 19:28

for completeness, did you also try moving t into run in java? – Sensate 22/3, 2014 at 21:3

@RobStarling I did. It's about the same. It seems changing scope, final-ness and static-ness in Java results in no impact. – Melan 22/3, 2014 at 21:6

To make the Java version completely equivalent to your Scala code you need to change it like this.

private int t = 20;


private int t() {
    return this.t;
}

private void run() {
    int i = 10;
    while (!isEvenlyDivisible(2, i, t()))
        i += 2;
    System.out.println(i);
}

It is slower because the JVM can not optimize the method calls.

Cranmer answered 22/3, 2014 at 19:5 Comment(2)

Hey thanks! This new version of Java is slightly slower than my original Java version (11122ms on my computer), but still faster than my original Scala one. But see updates for the strangely faster Scala version. – Melan 22/3, 2014 at 19:9

Why the downvote here? It's a correct way to put it. – Larkins 1/4, 2014 at 22:10

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags