First time a Java loop is run SLOW, why? [Sun HotSpot 1.5, sparc]
Asked Answered
I

11

11

In benchmarking some Java code on a Solaris SPARC box, I noticed that the first time I call the benchmarked function it runs EXTREMELY slowly (10x difference):

  • First | 1 | 25295.979 ms
  • Second | 1 | 2256.990 ms
  • Third | 1 | 2250.575 ms

Why is this? I suspect the JIT compiler, is there any way to verify this?

Edit: In light of some answers I wanted to clarify that this code is the simplest possible test-case I could find exhibiting this behavior. So my goal isn't to get it to run fast, but to understand what's going on so I can avoid it in my real benchmarks.

Solved: Tom Hawtin correctly pointed out that my "SLOW" time was actually reasonable. Following this observation, I attached a debugger to the Java process. During the first, the inner loop looks like this:

0xf9037218:     cmp      %l0, 100
0xf903721c:     bge,pn   %icc,0xf90371f4        ! 0xf90371f4
0xf9037220:     nop
0xf9037224:     ld       [%l3 + 92], %l2
0xf9037228:     ld       [%l2 + 8], %l6
0xf903722c:     add      %l6, 1, %l5
0xf9037230:     st       %l5, [%l2 + 8]
0xf9037234:     inc      %l0
0xf9037238:     ld       [%l1], %g0
0xf903723c:     ba,pt    %icc,0xf9037218        ! 0xf9037218

On the following iterations, the loop looks like this:

0xf90377d4:     sub      %l2, %l0, %l3
0xf90377d8:     add      %l3, %l0, %l2
0xf90377dc:     add      %l2, 1, %l4
0xf90377e0:     inc      %l0
0xf90377e4:     cmp      %l0, 100
0xf90377e8:     bl,pn    %icc,0xf90377d8        ! 0xf90377d8

So HotSpot removed memory accesses from the inner loop, speeding it up by an order of magnitude.

Lesson: Do the math! I should have done Tom's calculation myself.

Benchmark Java code:

    private int counter;
    private int nThreads;

    private void measure(String tag) throws Exception {
            MyThread threads[] = new MyThread[nThreads];
            int i;

            counter = 0;

            for (i = 0; i < nThreads; i++)
                    threads[i] = new MyThread();

            long start = System.nanoTime();

            for (i = 0; i < nThreads; i++)
                    threads[i].start();

            for (i = 0; i < nThreads; i++)
                    threads[i].join();

            if (tag != null)
                    System.out.format("%-20s | %-2d | %.3f ms \n", tag, nThreads,
                                     new Double((System.nanoTime() - start) / 1000000.0));
    }
    public MyBench() {
            try {
                    this.nThreads = 1;
                    measure("First");
                    measure("Second");
                    measure("Third");
            } catch (Exception e) {
                    System.out.println("Error: " + e);
            }
    }

    private class MyThread extends Thread {
            public void run() {
                    while (counter < 10000000) {
                            // work
                            for (int j = 0; j < 100; j++)
                                    counter++;
                            counter -= 99;
                    }
            }
    }
Illustrator answered 13/5, 2009 at 20:28 Comment(5)
I can't think of an easy way to verify that it is the JIT compiler, but I would say that it almost certainly is the Hotspot going into action compiling the loop into native code.Cramp
By the way, this benchmark will be even more unreliable if you have more than one thread. Because you're using a global variable to control the outer loop, each thread will likely loop a different number of times; maybe more than 10000000, maybe less.Minerva
By the way (again), this may be a silly question, but have you run this more than once with similar results? A single test is basically meaningless.Minerva
By my calculation on 1 GHz processor, even the "fast" times would be taking 200 cycles every increment, which seems suspiciously slow to me (I wouldn't be surprised to see -server optimise the lot away).Trundle
Ah. My mistake I misread the code...Trundle
T
8

Some ugly, unrealistic code (the stuff of microbenchmarks):

                while (counter < 10000000) {
                        // work
                        for (int j = 0; j < 100; j++)
                                counter++;
                        counter -= 99;
                }

So what is this doing and how fast should it run.

The inner loop increments counter 100 times, then the counter is decremented by 99. So an increment of 1. Note counter is a member variable of an outer class, so some overhead there. This is then run 10,000,000 times. So the inner loop is run 1,000,000,000 times.

A loop using to accessor methods, call it 25 cycles. 1,000,000,000 times at 1 GHz, gives 25s.

Hey, we predicted the SLOW time. The slow time is fast. The fast times are after the benchmark has been broken in some way - 2.5 cycles an iteration? Use -server and you might find it gets even more silly.

Trundle answered 13/5, 2009 at 21:17 Comment(5)
accesor methods can be inlined, eliminating the overhead. loops can be unrooled, reducing the overhead and modern processors have branch prediction and may be able to do several operations at once, so an average of 2,5 cycles per iteration is normal.Beaird
and loops can be removed - easier than inlining virtual methods. so an average of 0.0 cycles per iteration is normal.Trundle
Tom, thanks. Following your hint I also benchmarked this code in C (verifying the assembly code) and it took on the order of 20 seconds as well. So I attached a debugger to the Java process. Turns out that after the first measure() call, the JIT compiler optimizes out the memory accesses in the inner loop and makes the loop work on registers.Illustrator
I wouldn't be surprised even if JIT would eventually optimize that code into "count = 10000000;"Sgraffito
Yeah, it's the sort of thing server HotSpot might do. Might well do it if you assign counter to zero. Currently it would have to do something like if (this$0.count < 10000000) this$0.count = 10000000;. Not the best benchmark.Trundle
S
5

Probably it's class loading or dynamic linking of native methods. If you run Java with the following JVM parameters (see here for full list), it will print information about what is taking the time:

-verbose:class -verbose:jni -verbose:gc -XX:+PrintCompilation

To find out exactly where each of the measure() calls start and end, add initializations of some new classes between those methods as markers, so that -verbose:class will show at what point in the logs the marker class is loaded. See this answer for a similar measurement.

To find out exactly what your code does, I modified it like this:

public MyBench() {
    try {
        this.nThreads = 1;
        new Mark1();
        measure("First");
        new Mark2();
        measure("Second");
        new Mark3();
        measure("Third");
        new Mark4();
    } catch (Exception e) {
        System.out.println("Error: " + e);
    }
}

private static class Mark1 {
}
private static class Mark2 {
}
private static class Mark3 {
}
private static class Mark4 {
}

Then by looking at when the JVM loaded those Mark1 etc. classes, here are the results.

During the first call to measure(), a total of 85 classes were loaded, 11 native methods were dynamically linked and 5 methods were JIT compiled:

[Loaded MyBench$Mark1 from file:/D:/DEVEL/Test/classes/]
[Loaded java.net.InetSocketAddress from shared objects file]
[Loaded java.net.InetAddress from shared objects file]
[Loaded MyBench$MyThread from file:/D:/DEVEL/Test/classes/]
[Loaded sun.security.action.GetBooleanAction from shared objects file]
[Dynamic-linking native method java.net.InetAddress.init ... JNI]
[Loaded java.net.InetAddress$Cache from shared objects file]
[Loaded java.lang.Enum from shared objects file]
[Loaded java.net.InetAddress$Cache$Type from shared objects file]
[Loaded java.net.InetAddressImplFactory from shared objects file]
[Dynamic-linking native method java.net.InetAddressImplFactory.isIPv6Supported ... JNI]
 22       MyBench::access$508 (12 bytes)
[Loaded java.net.InetAddressImpl from shared objects file]
[Loaded java.net.Inet4AddressImpl from shared objects file  1%      MyBench$MyThread::run @ 14 (48 bytes)
]
[Loaded sun.net.spi.nameservice.NameService from shared objects file]
[Loaded java.net.InetAddress$1 from shared objects file]
[Loaded java.net.Inet4Address from shared objects file]
[Dynamic-linking native method java.net.Inet4Address.init ... JNI]
[Dynamic-linking native method java.net.PlainSocketImpl.socketCreate ... JNI]
[Dynamic-linking native method java.net.PlainSocketImpl.socketBind ... JNI]
[Dynamic-linking native method java.net.PlainSocketImpl.socketListen ... JNI]
[Loaded java.net.Socket from shared objects file]
[Dynamic-linking native method java.net.PlainSocketImpl.socketAccept ... JNI]
[Loaded java.lang.Integer$IntegerCache from shared objects file]
[Loaded java.util.Formatter from C:\Program Files\Java\jdk1.6.0_11\jre\lib\rt.jar]
[Loaded java.util.regex.Pattern$6 from C:\Program Files\Java\jdk1.6.0_11\jre\lib\rt.jar]
[Loaded java.text.DecimalFormatSymbols from shared objects file]
[Loaded java.util.spi.LocaleServiceProvider from shared objects file]
[Loaded java.text.spi.DecimalFormatSymbolsProvider from shared objects file]
[Loaded sun.util.LocaleServiceProviderPool from shared objects file]
[Loaded java.util.LinkedHashSet from shared objects file]
[Loaded sun.util.LocaleServiceProviderPool$1 from shared objects file]
[Loaded java.util.ServiceLoader from shared objects file]
[Loaded java.util.ServiceLoader$LazyIterator from shared objects file]
[Loaded java.util.ServiceLoader$1 from shared objects file]
[Loaded java.util.HashMap$EntrySet from shared objects file]
[Loaded java.util.LinkedHashMap$LinkedHashIterator from shared objects file]
[Loaded java.util.LinkedHashMap$EntryIterator from shared objects file]
[Loaded sun.misc.Launcher$1 from shared objects file]
 23  !    java.io.BufferedReader::readLine (304 bytes)
[Loaded sun.misc.Launcher$2 from shared objects file]
[Loaded sun.misc.URLClassPath$2 from shared objects file]
[Loaded java.lang.ClassLoader$2 from shared objects file]
[Loaded sun.misc.URLClassPath$1 from shared objects file]
[Loaded java.net.URLClassLoader$3 from shared objects file]
[Loaded sun.misc.CompoundEnumeration from shared objects file]
 24       sun.nio.cs.UTF_8$Decoder::decodeArrayLoop (553 bytes)
[Loaded java.io.FileNotFoundException from shared objects file]
[Loaded java.net.URLClassLoader$3$1 from shared objects file]
[Dynamic-linking native method java.security.AccessController.doPrivileged ... JNI]
[Loaded sun.util.resources.LocaleData from shared objects file]
[Loaded sun.util.resources.LocaleData$1 from shared objects file]
[Loaded java.util.ResourceBundle$Control from shared objects file]
[Loaded sun.util.resources.LocaleData$LocaleDataResourceBundleControl from shared objects file]
[Loaded java.util.Arrays$ArrayList from shared objects file]
[Loaded java.util.Collections$UnmodifiableCollection from shared objects file]
 25       java.lang.String::startsWith (78 bytes)
[Loaded java.util.Collections$UnmodifiableList from shared objects file]
[Loaded java.util.Collections$UnmodifiableRandomAccessList from shared objects file]
[Loaded java.util.ResourceBundle from shared objects file]
[Loaded java.util.ResourceBundle$1 from shared objects file]
[Dynamic-linking native method java.util.ResourceBundle.getClassContext ... JNI]
[Loaded java.util.ResourceBundle$RBClassLoader from shared objects file]
[Loaded java.util.ResourceBundle$RBClassLoader$1 from shared objects file]
[Loaded java.util.ResourceBundle$CacheKey from shared objects file]
[Loaded java.util.ResourceBundle$CacheKeyReference from shared objects file]
[Loaded java.util.ResourceBundle$LoaderReference from shared objects file]
[Loaded java.util.ResourceBundle$SingleFormatControl from shared objects file]
[Loaded sun.util.LocaleDataMetaInfo from shared objects file]
[Loaded java.util.AbstractList$Itr from shared objects file]
[Loaded java.util.ListResourceBundle from shared objects file]
[Loaded sun.text.resources.FormatData from shared objects file]
[Dynamic-linking native method java.lang.Class.isAssignableFrom ... JNI]
[Loaded java.util.ResourceBundle$BundleReference from shared objects file]
[Loaded sun.text.resources.FormatData_fi from C:\Program Files\Java\jdk1.6.0_11\jre\lib\rt.jar]
[Loaded sun.text.resources.FormatData_fi_FI from C:\Program Files\Java\jdk1.6.0_11\jre\lib\rt.jar]
[Loaded java.util.Currency from shared objects file]
[Loaded java.util.Currency$1 from shared objects file]
[Loaded java.util.CurrencyData from shared objects file]
[Loaded sun.reflect.UnsafeFieldAccessorFactory from shared objects file]
[Loaded sun.reflect.UnsafeQualifiedStaticFieldAccessorImpl from shared objects file]
[Loaded sun.reflect.UnsafeQualifiedStaticObjectFieldAccessorImpl from shared objects file]
[Loaded java.util.spi.CurrencyNameProvider from shared objects file]
[Loaded sun.util.resources.OpenListResourceBundle from shared objects file]
[Loaded sun.util.resources.LocaleNamesBundle from shared objects file]
[Loaded sun.util.resources.CurrencyNames from shared objects file]
[Loaded sun.util.resources.CurrencyNames_fi_FI from C:\Program Files\Java\jdk1.6.0_11\jre\lib\rt.jar]
[Loaded java.util.regex.MatchResult from shared objects file]
[Loaded java.util.regex.Matcher from shared objects file]
[Loaded java.util.regex.ASCII from shared objects file]
[Loaded java.util.Formatter$FormatString from C:\Program Files\Java\jdk1.6.0_11\jre\lib\rt.jar]
[Loaded java.util.Formatter$FormatSpecifier from C:\Program Files\Java\jdk1.6.0_11\jre\lib\rt.jar]
[Loaded java.util.Formatter$Flags from C:\Program Files\Java\jdk1.6.0_11\jre\lib\rt.jar]
[Loaded java.util.Formatter$Conversion from C:\Program Files\Java\jdk1.6.0_11\jre\lib\rt.jar]
[Loaded java.util.Formatter$FixedString from C:\Program Files\Java\jdk1.6.0_11\jre\lib\rt.jar]
[Loaded java.util.Formattable from C:\Program Files\Java\jdk1.6.0_11\jre\lib\rt.jar]
[Dynamic-linking native method java.io.FileOutputStream.writeBytes ... JNI]
First                | 1  | [Loaded sun.misc.FormattedFloatingDecimal from C:\Program Files\Java\jdk1.6.0_11\jre\lib\rt.jar]
[Loaded sun.misc.FormattedFloatingDecimal$1 from C:\Program Files\Java\jdk1.6.0_11\jre\lib\rt.jar]
[Loaded sun.misc.FormattedFloatingDecimal$Form from C:\Program Files\Java\jdk1.6.0_11\jre\lib\rt.jar]
[Loaded sun.misc.FormattedFloatingDecimal$2 from C:\Program Files\Java\jdk1.6.0_11\jre\lib\rt.jar]
2072,825 ms 

During the second call, only one method was JIT compiled:

[Loaded MyBench$Mark2 from file:/D:/DEVEL/Test/classes/]
 26       MyBench$MyThread::run (48 bytes)
Second               | 1  | 2058,669 ms 

During the third call, there was no additional work happening:

[Loaded MyBench$Mark3 from file:/D:/DEVEL/Test/classes/]
Third                | 1  | 2093,659 ms 

This was run on Windows with jdk1.6.0_11, so your system might do things a little bit differently. For example, maybe one of those linkings of dynamic methods is exceptionally slow on your system. Or then all of the class loading is slower. Try to have a look at the logs, whether there is an unusually long pause, or whether all of those operations are equally slow.

Sgraffito answered 13/5, 2009 at 21:31 Comment(1)
True, it would be unusual for that to take so long, unless the system was something seriously broken. Tom and Adam found out the real reason.Sgraffito
D
3

Add class loading in as a suspect. Classes are loaded lazily on first reference. So the first time the code runs, you're probably referencing some classes for the first time.

Duress answered 13/5, 2009 at 20:35 Comment(0)
C
3

The best way to verify if the JIT compiler is the reason for the speedup in later iterations is to run the benchmark with the JIT compiler turned off. To do this, specify the system property java.compiler=NONE (the word "none" must be in upper case).

Time spent doing class loading can also cause the benchmarked code to run slower the first time. Finally, there is a nondeterministic delay between calling Thread.start() and the Thread's run() method being called.

You might want to consider finding a benchmark framework. A good framework will "warm up" the code by running several iterations, then do multiple timings with a different number of iterations. See Java theory and practice: Anatomy of a flawed microbenchmark.

Cumin answered 13/5, 2009 at 21:4 Comment(0)
M
1

That's an interesting question. I'd suspect the JIT compiler, but these are my numbers:

First                | 1  | 2399.233 ms 
Second               | 1  | 2322.359 ms 
Third                | 1  | 2408.342 ms 

Possibly Solaris is doing something funny with threads; have you tried with nThreads = 10 or so?

Minerva answered 13/5, 2009 at 20:34 Comment(1)
This is a stripped down version of the code, though it exhibits the same behavior. I just wanted to post the simplest possible case here. In other words, I'm not trying to make this specific benchmark as fast as possible, but rather understand what went wrong here so it won't impact my real benchmarking.Illustrator
E
1

I suggest you make nThread = Runtime.getRuntime().availableProcessors() This will give you the optimal number of threads to use all the cores in your system.

You can try turning off the JIT to see what difference it makes.

Equal answered 13/5, 2009 at 20:43 Comment(2)
Thanks, I'm not trying to make this specific code fast, but rather trying to understand what is going on behind this scene to make it exhibit this behavior.Illustrator
When Java first runs code it does so in an interperated mode. It does this as most code is only executed once and its not worth coding code in this situations. When code is used repeatedly it is optimised and can get further optimised based on changes in how the code is used. By default, full optimisation waits until something is called 10,000 times. It does this to collect statistics on how the code is used before optimises the code. e.g. Reducing this number can lead to poorer performance in the long run.Equal
N
1

You can get the VM to log information about classloading and compilation, try the following VM args: -XX:+PrintCompilation -XX:+TraceClassLoading This might give some further clues as to what's happening under the hood.

EDIT: I'm not sure those options work in java 1.5 (I've used them in 1.6). I'll try to check... EDIT again: It works in java 1.5 (note you need +, not -, or you turn the option off...)

Nelle answered 13/5, 2009 at 20:54 Comment(0)
T
1

I believe you can also use the non-standard option for the java command of -Xint to disable HotSpot and have your code interpreted only. This could at least take HotSpot out of the equation for interpreting your timing.

Turntable answered 13/5, 2009 at 21:11 Comment(0)
K
0

It's the hotspot compiler at work. AFAIK, the first time it runs the function, runs "interpreted" and the execution path is analyzed, then the JIT compiler can optimize the subsequent function calls.

Kalevala answered 13/5, 2009 at 20:34 Comment(1)
It should replace the code on the stack midway through (that was introduced in 2000). Perhaps there is something in this example that is preventing it.Trundle
S
0

It's most certainly the hotspot compiler. If you're running on 64 bit solaris it defaults to the server VM and hotspot just start optimizing on first execution. On the client VM the code may need to run a few times before hotspot kicks in. (i believe solaris only has the server vm but I may be wrong)

Stingy answered 13/5, 2009 at 20:47 Comment(0)
T
0

See http://java.sun.com/javase/6/docs/technotes/guides/vm/server-class.html for how the launcher selects between client and server VM, and what is supported on the different processors and OSes.

Teddy answered 13/5, 2009 at 21:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.