High cost of polymorphism in Java Hotspot server
Asked Answered
S

2

6

When I run my timing test program in Java Hotspot client, I get consistent behavior. However, when I run it in Hotspot server, I get unexpected result. Essentially, the cost of polymorphism is unacceptably high in certain situations that I've tried to duplicate bellow.

Is this a known issue/bug with Hotspot server, or am I doing something wrong?

Test program and timing are given bellow:

Intel i7, Windows 8
Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode)
Mine2: 0.387028831 <--- polymorphic call with expected timing
Trivial: 1.545411765 <--- some more polymorphic calls
Mine: 0.727726371 <--- polymorphic call with unexpected timing. Should be about 0.38
Mine: 0.383132698 <--- direct call with expected timing

The situation gets worse as I add additional tests. Timing of the tests near the end of the list are completely off.

interface canDoIsSquare {
    boolean isSquare(long x);
}

final class Trivial implements canDoIsSquare {
    @Override final public boolean isSquare(long x) {
        if (x > 0) {
            long t = (long) Math.sqrt(x);
            return t * t == x;
        }
        return x == 0;
    }
    @Override public String toString() {return "Trivial";}
}

final class Mine implements canDoIsSquare {
    @Override final public boolean isSquare(long x) {
        if (x > 0) {
            while ((x & 3) == 0)
                x >>= 2;
            if ((x & 2) != 0 || (x & 7) == 5)
                return false;
            final long t = (long) Math.sqrt(x);
            return (t * t == x);
        }
        return x == 0;
    }

    @Override public String toString() {return "Mine";}
}

final class Mine2 implements canDoIsSquare {
    @Override final public boolean isSquare(long x) {
        // just duplicated code for this test
        if (x > 0) {
            while ((x & 3) == 0)
                x >>= 2;
            if ((x & 2) != 0 || (x & 7) == 5)
                return false;
            final long t = (long) Math.sqrt(x);
            return (t * t == x);
        }
        return x == 0;
    }
    @Override final public String toString() {return "Mine2";}
}

public class IsSquared {
    static final long init = (long) (Integer.MAX_VALUE / 8)
            * (Integer.MAX_VALUE / 2) + 1L;

    static long test1(final canDoIsSquare fun) {
        long r = init;
        long startTimeNano = System.nanoTime();
        while (!fun.isSquare(r))
            ++r;
        long taskTimeNano = System.nanoTime() - startTimeNano;
        System.out.println(fun + ": " + taskTimeNano / 1e9);
        return r;
    }

    static public void main(String[] args) {
        Mine mine = new Mine();
        Trivial trivial = new Trivial();
        Mine2 mine2 = new Mine2();

        test1(mine2);
        test1(trivial);
        test1(mine);

        long r = init;
        long startTimeNano = System.nanoTime();
        while (!mine.isSquare(r))
            ++r;
        long taskTimeNano = System.nanoTime() - startTimeNano;
        System.out.println(mine + ": " + taskTimeNano / 1e9);
        System.out.println(r);
    }
}
Semipalmate answered 8/11, 2013 at 6:24 Comment(8)
You should expect that more work takes longer and having more methods you call from a given point means more work. There is a table for results based on the number of possible call sites for a method. vanillajava.blogspot.co.uk/2012/12/…Pilfer
When there is only one call, it can inline the method and just check the type, this is going to be very fast. When there is two call, it can inline two implementations, but it has to check two types, this will be slower. When you have multiple methods, it cannot inline them all (or it doesn't) Instead it uses a vtable lookup (like C++ does) to find the method required and this is relatively expensive and actually gets more expensive up to about 6 possible locations.Pilfer
BTW When methods are inlined, they can be optimises in more ways making a vtable method call even more expensive in some cases (as you can't optimise as much)Pilfer
In short, if you want your timings to be fast, copy the timing code, it is usually trivial. Also make sure your loop is warmed up, not just the code you are calling. e.g. your loop in main() will be cold to start with.Pilfer
I assume you know you don't need to brute force search for the next squared number, you can just calculate it ;)Pilfer
It's not predicable at compile time or run time. Looking at the method test1() alone, you can't predict which class will be called. You can predict it if the entire method is inlined, but you are not calling it enough times for this to happen (and I suspect the method is to large) for this to happen. BTW the javac compiler optimises almost nothing, it is the JIT which does this. You can't expect the JVM to perform magic.Pilfer
Object is strongly typed, passing a sub-class of object doesn't make it more strongly typed. Calling a sub-class implementation of hashCode() behaves exactly the same way. You are passing a reference to a canDoIsSquare not a concrete type. Create two copies of test1 i.e. test2 and you as long as you always pass the same implementation class you will get the best performance.Pilfer
@peter: Please post your thoughts as a post rather than random comments. Thank you.Semipalmate
W
7

The cost is high, indeed, but your benchmark doesn't measure anything really relevant. The JIT can optimize away most of the overhead, but you didn't give it any chance. See e.g. here.

In any case, there's no benchmark warmup and there's On Stack Replacement.

The explanation is probably that the Server Hotspot optimizes better but slower. It assumes that it has enough time and collects the necessary stats longer. So while the Client Hotspot optimized your program, the Server Hotspot was preparing itself to produce better code.

The reason for the worsening with additional tests is that the initially monomorphic call site became bimorphic and then megamorphic.

In reality it's possible that only one of the methods gets called. If you want benchmark this, you have to run each test in its own JVM. This is a real pain, but existing benchmarking frameworks do it for you.

Or you may want to measure the polymorphic case, but then you need to warm up the code with all cases first. This way you can find out which method is faster even in a single JVM (though each will be slowed down by the megamorphic call overhead.

Update

The explanation seems to be the change from monomorphic to megamorhic. When the first test was run, the JVM was knew all the classes (as the instances were already created), but was optimistically assuming that only Mine2 occurs on the call site. So it did a quick check (translated as a conditional branch, which was always correctly predicted and thus very fast), and called the proper method. As it later saw the other two instances being used there, it had to create a branch table for them (the branch prediction still works, but the overhead is higher).

Question

What's unclear: The JVM can move this test out of the loop and thus reduce it's cost to nearly nothing. I can't tell why it doesn't happen.

Watermark answered 8/11, 2013 at 6:59 Comment(12)
@user2943324: I updated my answer, and you're right, it's not exactly clear what happens. Did you try -XX:+PrintAssembly?Watermark
@maaatinus: No I have not looked at the assembly code. I agree with your assessment that the cost should not be doubled when there is only 3 choices to be made. If that were the case, the whole concept of polymorphism becomes rather useless. Of course, a small overhead is acceptable for its convenience and clarity. I wonder if this is limited to Windows version of Hotspot server, or Linux version suffers the same fate.Semipalmate
@Semipalmate polymorphism is not about making systems go faster. BTW developers often using multi-threaded code which is actually much slower (and more complicated) than single threaded code, but that is another story. ;)Pilfer
@Semipalmate Random comments are only suitable as comments. A post to the question should be concise and answer the question. If we come to some agreement on what the fundamental issue is, I could write an answer to that. The question appears to be; why doesn't it work the way I think it should, or why doesn't the program know what I know. The problem is I don't yet know what you believe could happen or don't know. I only know what the JVM does now.Pilfer
@Peter Lawrey: I disagree... even if you call it random comments, there'd be better placed in an answer. You could edit or extend them and the comment space would stay clear. I guess your comments would make a good answer, even if you couldn't explain everything (I see now, I couldn't do it either).Watermark
@Watermark Ok, I have added an answer.Pilfer
@maaartinus: What I really want is an O(log log n) algorithm for isSquared(n). I don't care about any constant multipliers of the order function (yes, it is log of log). So if u know of any published work, that'll save me a lot of time and would earn u my many thanks.Semipalmate
@user2943324: There are a lot algorithms for int here and some of them could be adapted for arbitrary numbers. But O(log(log(n))) is clearly impossible as the input length is O(log(n)), so you're looking for an algorithm faster than reading the whole input.Watermark
@maaartins: I am surprised by your comment since I assume you came up with the test in the post you reference. At any rate, I have posted the question here: cstheory.stackexchange.com/questions/19734/… Please feel free to comment there.Semipalmate
To be clear, I am referring to test for a 6-bit number, not the entire function call.Semipalmate
You lost me with the 6-bit number. I'm using the lowest 6 bits for accessing the table goodMask, but this test gets used for all numbers.Watermark
No worries. I have lost myself too. It was just my misunderstanding.Semipalmate
P
1

In short, the JIT can optimises a single method call, and two method calls, in ways it cannot with more multi-polymorphic calls. The number of possible methods which might be called on any given line is what matters and the JIT builds up this picture over time. When a method is inlined further optimisations are possible, but in your case the line in question increases the number of possible method calls from test1 over the life of the run and so it gets slower.

The way I get around this is to duplicate the short test code so each class is tested equally (assuming this is realistic) If you program will be multi-polymorphic when it is running, this is what you should test to be realistic as you can see it can change the results.

When you run the method from a fresh loop you see the benefit of only calling one method from that line of code.

Here is a table of different costs you might see depending on the number of possible methods any individual line can call. http://vanillajava.blogspot.co.uk/2012/12/performance-of-inlined-virtual-method.html

Polymorphism is not designed to improve performance and for me it is entirely reasonable that as the complexity of the polymorphism increases it should be slower.

BTW making methods final doesn't improve the performance any more. The JIT works out if you have called a sub-class on a line by line basis (as discussed)


EDIT As you can see the client JVM doesn't optimise the code as much as it is designed fr relatively light eight startup times. This means the client JVM is more consistent, but consistently slower. If you want the best performance you need to consider a number of optimisation strategies which leads to multiple possible outcomes depending on whether the optimisation is applied or not.

Pilfer answered 8/11, 2013 at 9:54 Comment(2)
@Semipalmate Added to my answer taking about the client JVM.Pilfer
@Semipalmate I must have mis-understood what you meant by "the cost should not be doubled when there is only 3 choices to be made. If that were the case, the whole concept of polymorphism becomes rather useless." I thought you were implying that the value of polymorphism was reduced by increased cost.Pilfer

© 2022 - 2024 — McMap. All rights reserved.