How can JVM jitter get caused by a for loop with no object allocations?

Asked 11/4, 2013 at 9:8 Answered 14/4, 2013 at 9:55

Solved java jvm jvm-hotspot microbenchmark

I have been micro benchmarking the following code and I noticed something interesting that I am hoping somebody can shed more light on. It leads to a situation where it looks like a for loop can carry on running fast while blocking other threads within the JVM. If that is true then I would like to understand why, if it is not true then any insight into what I may be missing would be appreciated.

To build up the situation, let me walk you through the benchmark that I am running and its results.

The code is pretty simple, iterate over every element in an array, summing its contents. Repeat 'targetCount' times.

public class UncontendedByteArrayReadBM extends Benchmark {

private int arraySize;
private byte[] array;

public UncontendedByteArrayReadBM( int arraySize ) {
    super( "array reads" );

    this.arraySize = arraySize;
}

@Override
public void setUp() {
    super.setUp();

    array = new byte[arraySize];
}

@Override
public void tearDown() {
    array = null;
}

@Override
public BenchmarkResult invoke( int targetCount ) {
    long sum = 0;
    for ( int i=0; i<targetCount; i++ ) {
        for ( int j=0; j<arraySize; j++ ) {
            sum += array[j];
        }
    }

    return new BenchmarkResult( ((long)targetCount)*arraySize, "uncontended byte array reads", sum );
}

}

On my 4 core - 2Ghz Intel Sandy Bridged/i7 laptop, running Java 6 (Oracle JVM) on OSX. This code repeatedly runs to

2.626852686364034 uncontended byte array reads/ns [totalTestRun=3806.837ms]

(I have culled the repeated runs used to warm the JVM)

This result appears reasonable to me.

Where this got interesting is when I started measuring JVM Jitter. To do that I start a background daemon thread that sleeps for 1ms and then figures out how much longer than 1ms it was really asleep for. And I changed the report to print out the max jitter for each repeated test run.

2.6109858273078306 uncontended byte array reads/ns [maxJitter=0.411ms totalTestRun=3829.971ms]

To get an idea of 'normal' jitter for my environment, before starting the actual test runs I monitor the jitter with no work going on and readings such as the following are typical (all in ms). Thus a jitter of 0.411ms is normal and not so interesting.

getMaxJitterMillis() = 0.599
getMaxJitterMillis() = 0.37
getMaxJitterMillis() = 0.352

I have included the code for how I measured Jitter at the end of this question.

The interesting part however, and yes it does happen during the 'JVM warmup' period and is thus not 'normal' but I would like to understand in more detail is the following:

2.4519521584902644 uncontended byte array reads/ns  [maxJitter=2561.222ms totalTestRun=4078.383ms]

Notice that the jitter is over 2.5 seconds. Usually I would put this down to GC. However I did trigger a System.gc() before the test runs, AND -XX:+PrintGCDetails shows no GC at this time. Infact there is no GC during any of the test runs as next to no Object allocation is happening in this test of summing up pre-allocated bytes. It also happens every time I run a new test, and thus I am not suspecting it to be interference from some other process that happens randomly.

My curiousity rocketed because when I noticed that while the jitter is very high the total runtime, and indeed the number of array element reads per nanosecond remained more or less unchanged. So here is a situation where a thread is getting heavily lagged on a 4 core machine while the work thread itself is not getting lagged and there is no GC going on.

Investigating further I looked into what the Hotspot compiler was doing and found the following via -XX:+PrintCompilation:

2632   2%      com.mosaic.benchmark.datastructures.array.UncontendedByteArrayReadBM::invoke @ 14 (65 bytes)
6709   2%     made not entrant  com.mosaic.benchmark.datastructures.array.UncontendedByteArrayReadBM::invoke @ -2 (65 bytes)

The lag between these two lines being printed out was about 2.5 seconds. Right when the method that contains the big for loops has had its optimised code marked as not being entrant any more.

My understanding was that Hotspot runs on a background thread, and when it is ready to swap in a new version of the code it waits for that code that is already running to reach a safe point and then it is swapped in. In the case of a big for-loop that is at the end of each loop body (which may have been unrolled some). I would not expect a 2.5s delay unless this swap out had to perform a stop-the-world event across the JVM. Does it do that when de-optimising previous compiled code?

So my first question for the JVM internal experts is, am I on the right track here? Could the 2.5s delay be due to marking the method as 'made not entrant'; and if so, why does it have such an extreme impact on other threads? If that is not likely to be the cause, then any ideas on what else to investigate would be great.

(for completeness, here is the code that I use for measuring the Jitter)

private static class MeasureJitter extends Thread {
    private AtomicLong maxJitterWitnessedNS = new AtomicLong(0);

    public MeasureJitter() {
        setDaemon( true );
    }

    public void reset() {
        maxJitterWitnessedNS.set( 0 );
    }

    public double getMaxJitterMillis() {
        return maxJitterWitnessedNS.get()/1000000.0;
    }

    public void printMaxJitterMillis() {
        System.out.println( "getMaxJitterMillis() = " + getMaxJitterMillis() );
    }

    @Override
    public void run() {
        super.run();

        long preSleepNS = System.nanoTime();
        while( true ) {
            try {
                Thread.sleep( 1 );
            } catch (InterruptedException e) {
                e.printStackTrace();
            }

            long wakeupNS = System.nanoTime();
            long jitterNS = Math.max(0, wakeupNS - (preSleepNS+1000000));

            long max = Math.max( maxJitterWitnessedNS.get(), jitterNS );
            maxJitterWitnessedNS.lazySet( max );

            preSleepNS = wakeupNS;
        }
    }
}

Broaddus answered 11/4, 2013 at 9:8 Comment(0)

This took some digging to find the smoking gun, but the lessons have been valuable; especially on how to prove and isolate the cause. So I thought it good to document them here.

The JVM was indeed waiting to perform a Stop The World event. Alexey Ragozin has a very good blog post about this topic at http://blog.ragozin.info/2012/10/safepoints-in-hotspot-jvm.html and this was the post that got me onto the right track. He points out that safe points are on JNI method boundaries and Java method calls. Thus the for loop that I have here has no safe points in it.

To understand stop the world events in Java use the following JVM flags: -XX:+PrintGCApplicationStoppedTime -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1

The first prints out the total duration of a stop the world event, and it is not limited to just GC. In my case here that printed out:

Total time for which application threads were stopped: 2.5880809 seconds

Which proved that I had a problem with threads waiting to reach a safe point. The next two arguments print out why the JVM wanted to wait for a global safe point to be reached.

         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
4.144: EnableBiasedLocking              [      10          1              1    ]      [  2678     0  2678     0     0    ]  0   
Total time for which application threads were stopped: 2.6788891 seconds

So this says that the JVM waited 2678ms while trying to enable Biased Locking. Why is this a stop the world event? Thankfully Martin Thompson has also hit this problem in the past and he has documented it here. It turns out that the Oracle JVM has quite a lot of thread contention during startup, during that time biased locking has a very high cost so it delays turning the optimisation on for four seconds. So what happened here was that my micro benchmark went over four seconds and then its loop had no safe points in it. So when the JVM tried to turn Biased Locking on it had to wait.

Candidate solutions that all worked for me were:

-XX:-UseBiasedLocking (turn biased locking off)
-XX:BiasedLockingStartupDelay=0 (enable biased locking immediately)
Change the loop to have a safe point within it (eg a method call that does not get optimised out or inlined)

Broaddus answered 14/4, 2013 at 9:55 Comment(0)

There are many causes of jitter

sleep is very unreliable at the milli-second level.
context switches
interrupts
cache misses due to other programs running

Even if you busy wait, bind the thread so to a cpu which has been isolated, e.g. with isocpus, and move all the interrupts you can off that cpu, you can still see small amounts of jitter. All you can do is reduce it.

BTW: jHiccup does exactly what you are doing to measure the jitter of your system.

Tester answered 11/4, 2013 at 9:22 Comment(6)

Agreed, and this advice has done me well over the last fifteen years :) But now I have an itch to push my understanding harder. I am aware that measuring time in Java is open to fairly large clock granularity problems as well as delays, perhaps as high as 16 or 32ms for System.currentTimeMillis() which is why a common jitter of .4ms did not peak my interest. 2.5 seconds is major in comparison to that. – Broaddus 11/4, 2013 at 10:7

I am currently thinking that context switches and interrupts from external events would be more random than I am seeing. This 2.5s delay is like clockwork. And wouldn't CPU cache level misses tend to show more in the throughput of the array reads rather than jitter on another thread? Unless perhaps the scheduler could have assigned the jitter thread onto the same core as the one that is running the for loop (which is not yielding very often) and it is yet to move it to another core.. hmmm how to test that? :) – Broaddus 11/4, 2013 at 10:11

BTW If you have two+ sockets and an old version of Windows you will see that the cpu uptime on the two sockets is different. This could explain a fixed apparent delay of 2.5 seconds. I suggest you also print out negative delays. I would expect to see a -2.4 to -2.5 delay to appear in between your 2.5 second delays – Tester 11/4, 2013 at 10:18

On newer OSes, it will correct for the difference in uptime between sockets, but old versions of linux and windows do not. – Tester 11/4, 2013 at 10:20

That is an interesting fact for multiple cpu socketed machines. This is running on a single socket (quad core) laptop running OSx 10.8.3. But just to make certain I have added an extra AtomicLong to capture Math.min jitter latencies (and made sure that the call to max(0,jitter) was removed) and found that it always came through as 0ms. I think that this weekend is going to be spent reading OpenJDK code :) – Broaddus 11/4, 2013 at 13:37

It is a common misconception that the JVM has control over how threading works. In general it just wraps the OS system calls which do all the real work. – Tester 11/4, 2013 at 14:1

Recommended topics

Hot tags