Allocation latency seems high, why?

Asked 17/11, 2009 at 20:36 Answered 6/1, 2010 at 21:1

java latency allocation low-level jvm-hotspot

I have a (java) application that runs in a low latency environment, it typically processes instructions in ~600micros (+/- 100). Naturally as we've moved further into the microsecond space the things you see that cost latency change and right now we've noticed that 2/3 of that time is spent in the allocation of the 2 core domain objects.

Benchmarking has isolated the offending sections of code to literally the construction of the objects from existing references, i.e. basically a load of references (~15 in each class) and a couple of lists new'ed, though see note below on exactly what is measured here.

Each one consistently takes ~100micros which is inexplicable to me and I'm trying to track down why. A quick benchmark suggests a similarly sized object full of strings takes about 2-3micros to new up, obviously this sort of benchmark is fraught with difficulty but thought it might be useful as a baseline.

There are 2 Qs here

how does one investigate this sort of behaviour?
what explanations are there for slow allocation?

Note the hardware involved is Solaris 10 x86 on Sun X4600s with 8*dual core opterons @ 3.2GHz

Things we've looked at include

checking the PrintTLAB stats, shows v few slow allocs so there should be no contention there.
PrintCompilation suggests one of these bits of code is not JIT friendly though Solaris seems to have some unusual behaviour here (viz a vis a modern linux, don't have a linux of similar vintage to solaris10 to bench on right now)
LogCompilation... bit harder to parse to say the least so that is an ongoing job, nothing obvious so far
JVM versions... consistent across 6u6 and 6u14, not tried 6u18 or latest 7 yet

Any & all thoughts appreciated

A summary of the comments on assorted posts to try and make things clearer

the cost I'm measuring is the total cost of creating the object which is built via a Builder (like one of these) and whose private constructor invokes new ArrayList a few times as well as setting references to existing objects. The cost measured covers the cost of setting up the builder and the conversion of the builder to a domain object
compilation (by hotspot) has a marked impact but it's still relatively slow (compilation in this case takes it from 100micros down to ~60)
compilation (by hotspot) on my naive benchmark takes allocation time down from ~2micros to ~300ns
latency does not vary with young gen collection algo (ParNew or Parallel scavenge)

Weir answered 17/11, 2009 at 20:36 Comment(9)

Your question is extremely verbose, but do I understand it correctly that you wonder why it takes 150µs to instantiate a list? If so, which list implementation? And what exactly is a "load of reference"? – Glyptodont 17/11, 2009 at 21:26

Could you post (part of) the offending code? – Asymptote 17/11, 2009 at 21:36

>Benchmarking has isolated the offending sections of code to literally the construction of the objects from existing references, i.e. basically a load of references Can you supply the code? – Ube 17/11, 2009 at 21:39

I mean that I have a class that has a number of attributes (a few Strings, a couple of enums, a few longs, some other domain objects) that have already been allocated/completely constructed and all the constructor (really a builder that calls a private ctor) does is set the class members to point at those references & also creates a couple of empty ArrayLists. There is no "work" being done in the construction of this object. – Weir 17/11, 2009 at 22:51

not sure how to post the code in a meaningful way tbh – Weir 17/11, 2009 at 22:53

Have you tried benchmarking construction of your domain object with existing ArrayList refs instead of allocating new ones? – Epagoge 17/11, 2009 at 23:40

As a comparison, instantiating an ArrayList takes about 30ns on my system which is an order of 4000 off the results you are referring to. Without you providing more details on your setup (are you perhaps using some kind of object instrumentation, aspects or similar), I suppose no one is really able to help you. – Glyptodont 18/11, 2009 at 0:23

Could garbage collection be playing a part in this? – Kellner 18/11, 2009 at 4:26

no instrumentation or aspects involved, a naive benchmark (allocating an object that is consumed by another thread hence escapes) of something like a similar size shows time to allocate as about 2micros dropping to <300ns when the call is compiled. – Weir 6/1, 2010 at 20:25

Since your question was more about how to go about investigating the problem rather than 'what is my problem' I'll stick with some tools to try out.

A very useful tool for getting a better idea of what is going on and when is BTrace. It is similar to DTrace however a pure java tool. On that note I'm assuming you know DTrace, if not that is also useful if not obtuse. These will give you some visibility about what is happening and when in the JVM and OS.

Oh, another thing to clarify in your original posting. What collector are you running? I'm assuming with high latency problem that you are using a low pause collector like CMS. If so have you tried any tuning?

Capillary answered 18/11, 2009 at 13:45 Comment(1)

yes, we're using CMS. I haven't benched/tuned for allocation performance so far, only for overall pause times. Another round of GC tuning with allocation performance in mind is on the list. Btrace looks interesting, will try this out. DTrace sadly is not allowed in our environment thanks to our ever useful engineering group. Sad & bizarre but true. – Weir 18/11, 2009 at 21:13

When you are repeating the same task many times, your CPU tends to run very efficiently. This is because your cache miss times and warm up of the CPU doesn't appear as a factor. Its is also possible you are not considering your JVM warm time either.

If you try the same thing when the JVM and/or the CPU is not warmed up. You will get a very different results.

Try doing the same thing say 25 times (Less than your compile threshold) and sleep(100) between tests. You should expect to see much higher times, closer to what you see in the real application.

The behaviour of your app will differ but to illustrate my point. I have found waiting for IO can be more disruptive than a plain sleep.

When you performing your benchmark, you should try to ensure you are comparing like with like.

import java.io.*;
import java.util.Date;

/**
Cold JVM with a Hot CPU took 123 us average
Cold JVM with a Cold CPU took 403 us average
Cold JVM with a Hot CPU took 314 us average
Cold JVM with a Cold CPU took 510 us average
Cold JVM with a Hot CPU took 316 us average
Cold JVM with a Cold CPU took 514 us average
Cold JVM with a Hot CPU took 315 us average
Cold JVM with a Cold CPU took 545 us average
Cold JVM with a Hot CPU took 321 us average
Cold JVM with a Cold CPU took 542 us average
Hot JVM with a Hot CPU took 44 us average
Hot JVM with a Cold CPU took 111 us average
Hot JVM with a Hot CPU took 32 us average
Hot JVM with a Cold CPU took 96 us average
Hot JVM with a Hot CPU took 26 us average
Hot JVM with a Cold CPU took 80 us average
Hot JVM with a Hot CPU took 26 us average
Hot JVM with a Cold CPU took 90 us average
Hot JVM with a Hot CPU took 25 us average
Hot JVM with a Cold CPU took 98 us average
 */
public class HotColdBenchmark {
    public static void main(String... args) {
        // load all the classes.
        performTest(null, 25, false);
        for (int i = 0; i < 5; i++) {
            // still pretty cold
            performTest("Cold JVM with a Hot CPU", 25, false);
            // still pretty cold
            performTest("Cold JVM with a Cold CPU", 25, true);
        }

        // warmup the JVM
        performTest(null, 10000, false);
        for (int i = 0; i < 5; i++) {
            // warmed up.
            performTest("Hot JVM with a Hot CPU", 25, false);
            // bit cold
            performTest("Hot JVM with a Cold CPU", 25, true);
        }
    }

    public static long performTest(String report, int n, boolean sleep) {
        long time = 0;
        long ret = 0;
        for (int i = 0; i < n; i++) {
            long start = System.nanoTime();
            try {
                ByteArrayOutputStream baos = new ByteArrayOutputStream();
                ObjectOutputStream oos = new ObjectOutputStream(baos);
                oos.writeObject(new Date());
                oos.close();
                ObjectInputStream ois = new ObjectInputStream(new ByteArrayInputStream(baos.toByteArray()));
                Date d = (Date) ois.readObject();
                ret += d.getTime();
                time += System.nanoTime() - start;
                if (sleep) Thread.sleep(100);
            } catch (Exception e) {
                throw new AssertionError(e);
            }
        }
        if (report != null) {
            System.out.printf("%s took %,d us average%n", report, time / n / 1000);
        }
        return ret;
    }
}

Koa answered 6/1, 2010 at 21:1 Comment(8)

completely agree but am confident this is not the issue in this case as I've done long benchmarks (>24hrs) and repeated benchmarks. This means n * 30 runs of 20mins greater than compilation time and reporting on the aggregated results, i.e. reporting on performance every n s and then summarising the results per period (which means an equivalent stage in the run) with min/max/avg and so on. The results are consistent once compilation has kicked in. Also this is an application benchmark not a synthetic one hence load is representative. – Weir 6/1, 2010 at 21:50

Is you CPU hot or cold for the timed period? i.e. does it block or sleep before the timed period? – Koa 7/1, 2010 at 5:41

the server(s) uptime is usually into the months & this is an application benchmark so it's running in normal conditions hence I'm not directly controlling the state of the cores before the test starts. They will be mostly idle. Given that there are 16 cores on the box & no processor sets in place & the solaris scheduler tends to move lwps around as it feels like it then it's pretty impossible control the state of any single core. – Weir 7/1, 2010 at 11:6

So if your CPU is typically cold, then you need to benchmark how your application runs from a cold start to get comparable results. I have found latency increases 2-5x in this case. BTW, don't assume there is nothing you can do about it. ;) – Koa 7/1, 2010 at 21:38

the results are repeatable across many runs when the app is subjected to normal load. It never calls Thread.sleep (except at startup), in general the processing threads will either be active or waiting for work (aka Unsafe.park via a LinkedTransferQueue.take). I may well be missing you your point but your example says to me "do your best to force yourself onto the cpu and never yield to keep things spinning. I don't see how this is viable in a real app unless you have the whole box to a single process, sadly not a luxury I have! – Weir 7/1, 2010 at 23:7

also worth bearing in mind that the particularly bit of code I'm concerned about is in the middle of some other processing hence the thread is actively on the cpu (given that we don't overload the box then it's rare to have an active thread bumped) and is doing a variety of (related) work as opposed to just the specific bit I'm talking about here – Weir 7/1, 2010 at 23:10

You have to decide whether it is worth dedicating a thread/CPU to a task by busy waiting or not. If you dedicate the CPU, you will get better latency timings, if you don't, the CPU can be used to perform other tasks (from what you say it suggests this isn't really required) If latency is a real concern, you shouldn't be passing work around between threads in any case. – Koa 9/1, 2010 at 13:24

it's not passing units of work across threads but I do have to balance latency against throughput so tying up a thread like that isn't really feasible. – Weir 10/1, 2010 at 9:38

Memory allocation can cause side-effects. Is it possible that the memory allocation is causing the heap to get compacted? Have you looked to see if your memory allocation is causing the GC to run at the same time?

Have you separately timed how long it takes to create the new ArrayLists?

Toluol answered 18/11, 2009 at 11:6 Comment(0)

There is probably no hope in expecting microsecond-latency guarantees from a general-purpose VM running on a general-purpose OS, even with such great hardware. Massive throughput is the best you can hope for. How about switching to a real-time VM if you need one (I'm talking RTSJ and all that...)

...my two cents

Disfavor answered 18/11, 2009 at 11:12 Comment(1)

this is true however we don't really have hard realtime needs (typically as fast as possible is good enough) & RTSJ is rather invasive afaik. In general as long as I can explain where time i is spent then I'm happy. In this case, it is inexplicably slow so I need to understand it otherwise we have unknown behaviour. IME unknown behaviour tends to result in problems further down the line. – Weir 18/11, 2009 at 21:8

Just some wild guesses:

It's my understanding that Java VMs handle the memory of short lived objects differently from long-term objects. It would seem reasonable to me that at the point where an object goes from having one function-local reference to having references in the global heap would be a big event. Instead of being available for cleanup at function exit, it now has to be tracked by the GC.

Or it could be that going from one reference to multiple references to a single object has to change the GC accounting. As long as an object has a single reference it is easy to clean up. Multiple references can have reference loops and/or the GC may have to search for the reference in all other objects.

Deserted answered 23/12, 2009 at 15:54 Comment(3)

every object is managed by GC & allocation should be pretty cheap given that at the point the object is allocated it is allocated from a preallocated thread local buffer (TLAB) hence is essentially a bump the pointer event. There is a cost to an intergenerational reference but this is, AFAIK, small at the point of creation and incurs a cost at collect time due to the increased cost of determining what is live or not. on the 2nd point, it sounds like you're talking about reference counting which JVMs do not do. – Weir 26/12, 2009 at 19:12

Matt: regarding reference counting: there is reference counting, and there is reference counting... A GC that knows the difference between 1 and Many references can make things much easier on itself. When cleaning objects that hold the one and only reference to another object, that object can also be immediately collected. – Deserted 27/12, 2009 at 1:46

right but the cost (or saving) in that case would be at collect time as opposed to allocation time – Weir 28/12, 2009 at 15:26

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

A summary of the comments on assorted posts to try and make things clearer

Recommended topics

Hot tags