JVM OutOfMemory error "death spiral" (not memory leak)

D

6

19

We have recently been migrating a number of applications from running under RedHat linux JDK1.6.0_03 to Solaris 10u8 JDK1.6.0_16 (much higher spec machines) and we have noticed what seems to be a rather pressing problem: under certain loads our JVMs get themselves into a "Death Spiral" and eventually go out of memory. Things to note:

this is not a case of a memory leak. These are applications which have been running just fine (in one case for over 3 years) and the out-of-memory errors are not certain in any case. Sometimes the applications work, sometimes they don't
this is not us moving to a 64-bit VM - we are still running 32 bit
In one case, using the latest G1 garbage collector on 1.6.0_18 seems to have solved the problem. In another, moving back to 1.6.0_03 has worked
Sometimes our apps are falling over with HotSpot SIGSEGV errors
This is affecting applications written in Java as well as Scala

The most important point is this: the behaviour manifests itself in those applications which suddenly get a deluge of data (usually via TCP). It's as if the VM decides to keep adding more data (possibly progressing it to the TG) rather than running a GC on "newspace" until it realises that it has to do a full GC and then, despite practically everything in the VM being garbage, it somehow decides not to collect it!

It sounds crazy but I just don't see what else it is. How else can you explain an app which one minute falls over with a max heap of 1Gb and the next works just fine (never going about 256M when the app is doing exactly the same thing)

So my questions are:

Has anyone else observed this kind of behaviour?
has anyone any suggestions as to how I might debug the JVM itself (as opposed to my app)? How do I prove this is a VM issue?
Are there any VM-specialist forums out there where I can ask the VM's authors (assuming they aren't on SO)? (We have no support contract)
If this is a bug in the latest versions of the VM, how come no-one else has noticed it?

Dorathydorca answered 19/2, 2010 at 16:34 Comment(3)

Presuming you can reproduce the problem: (1) Create the smallest test case that causes the failure; (2) run the test case under another JVM (openjdk.java.net); (3) send the test case to Sun/Oracle. The JVM should not violate segments. – Heins 19/2, 2010 at 17:43

Yes - I can imagine it taking ages to reproduce. I mean they must have tests for this stuff, right? – Dorathydorca 19/2, 2010 at 17:52

I've got a similar issue: it also only happens when a lot of data is feeded to the app, but then it's always triggering a SIGSEGV and I don't know what to do with it, so I created a new question: #2299750 My workaround is to use a 1.5 JVM meanwhile. – Heindrick 19/2, 2010 at 20:14

D

2

Interesting problem. Sounds like one of the garbage collectors works poorly on your particular situation.

Have you tried changing the garbage collector being used? There are a LOT of GC options, and figuring out which ones are optimal seems to be a bit of a black art, but I wonder if a basic change would work for you.

I know there is a "Server" GC that tends to work a lot better than the default ones. Are you using that?

Threaded GC (which I believe is the default) is probably the worst for your particular situation, I've noticed that it tends to be much less aggressive when the machine is busy.

One thing I've noticed, it often takes two GCs to convince Java to actually take out the trash. I think the first one tends to unlink a bunch of objects and the second actually deletes them. What you might want to do is occasionally force two garbage collections. This WILL cause a significant GC pause, but I've never seen a case where it took more than two to clean out the entire heap.

Deirdredeism answered 19/2, 2010 at 18:9 Comment(3)

Well, server is the default in a machine like this. The problem is that this is only happening on our production servers, so it's not like I can just tinker around on my PC (where everything works as normal). Also, previous messing with GC options did not leave me thinking that anything was any better than the default. It would be a massive task to start blindly going through options but I tried what was recommended here for high throughput to no avail: java.sun.com/performance/reference/whitepapers/tuning.html. – Dorathydorca 19/2, 2010 at 18:24

How about monitoring your memory usage and running System.gc() twice whenever it gets, say, 3/4 full. You'd have to also include a mechanism to ensure this doesn't happen too often, but if you only have a problem when your data is bursty, it may be a workable solution. You might also want to set your min memory to the same as your max so that it allocates it all at once instead of during the burst--that makes your 3/4 full measurement more reliable. – Deirdredeism 19/2, 2010 at 18:29

I tried setting Xms to be the same as Xmx and it didn't make any difference. One of the apps (which is batch-based) I have added an explicit call to gc and I'll see how that pans out early next week – Dorathydorca 20/2, 2010 at 12:57

D

2

Yes, I've observed this behavior before, and usually after countless hours of tweaking JVM parameters it starts working.
Garbage Collection, especially in multithreaded situations is nondeterministic. Defining a bug in nondeterministic code can be a challenge. But you could try DTrace if you are using Solaris, and there are a lot of JVM options for peering into HotSpot.
Go on Scala IRC and see if Ismael Juma is hanging around (ijuma). He's helped me before, but I think real in-depth help requires paying for it.
I think most people doing this kind of stuff accept that they either need to be JVM tuning experts, have one on staff, or hire a consultant. There are people who specialize in JVM tuning.

In order to solve these problems I think you need to be able to replicate them in a controlled environment where you can precisely duplicate runs with different tuning parameters and/or code changes. If you can't do that hiring an expert probably isn't going to do you any good, and the cheapest way out of the problem is probably buying more RAM.

Daffie answered 19/2, 2010 at 16:34 Comment(0)

H

2

I have had the same issue on Solaris machines, and I solved it by decreasing the maximum size of the JVM. The 32 bit Solaris implementation apparently needs some overhead room beyond what you allocate for the JVM when doing garbage collections. So, for example, with -Xmx3580M I'd get the errors you describe, but with -Xmx3072M it would be fine.

Humanoid answered 19/2, 2010 at 17:11 Comment(3)

But these apps are really not very big - usually 256Mb and the machines they are on are beasts (24Gb of RAM) and currently under-utilized. I see no reason why solaris would be having problems finding any extra memory for housekeeping! – Dorathydorca 19/2, 2010 at 17:18

Maybe it's proportional to data throughput and/or GC load, and yours is just that much higher than mine? What did you set the maximum heap size to? – Humanoid 19/2, 2010 at 17:21

I eventually ramped it up to 1Gb in desperation and watched as the app happily started, never going above 256Mb! But this was not deterministic (it didn't work first time) - it failed, it failed, it failed, it failed, it worked! – Dorathydorca 19/2, 2010 at 17:27

D

2

Interesting problem. Sounds like one of the garbage collectors works poorly on your particular situation.

Have you tried changing the garbage collector being used? There are a LOT of GC options, and figuring out which ones are optimal seems to be a bit of a black art, but I wonder if a basic change would work for you.

I know there is a "Server" GC that tends to work a lot better than the default ones. Are you using that?

Threaded GC (which I believe is the default) is probably the worst for your particular situation, I've noticed that it tends to be much less aggressive when the machine is busy.

One thing I've noticed, it often takes two GCs to convince Java to actually take out the trash. I think the first one tends to unlink a bunch of objects and the second actually deletes them. What you might want to do is occasionally force two garbage collections. This WILL cause a significant GC pause, but I've never seen a case where it took more than two to clean out the entire heap.

Deirdredeism answered 19/2, 2010 at 18:9 Comment(3)

Well, server is the default in a machine like this. The problem is that this is only happening on our production servers, so it's not like I can just tinker around on my PC (where everything works as normal). Also, previous messing with GC options did not leave me thinking that anything was any better than the default. It would be a massive task to start blindly going through options but I tried what was recommended here for high throughput to no avail: java.sun.com/performance/reference/whitepapers/tuning.html. – Dorathydorca 19/2, 2010 at 18:24

How about monitoring your memory usage and running System.gc() twice whenever it gets, say, 3/4 full. You'd have to also include a mechanism to ensure this doesn't happen too often, but if you only have a problem when your data is bursty, it may be a workable solution. You might also want to set your min memory to the same as your max so that it allocates it all at once instead of during the burst--that makes your 3/4 full measurement more reliable. – Deirdredeism 19/2, 2010 at 18:29

I tried setting Xms to be the same as Xmx and it didn't make any difference. One of the apps (which is batch-based) I have added an explicit call to gc and I'll see how that pans out early next week – Dorathydorca 20/2, 2010 at 12:57

C

1

What kind of OutOfMemoryError are you getting? Is the heap space exhausted or is the problem related to any of the other memory pools (the Error usually have a message giving more details on its cause).

If the heap is exhausted and the problem can be reproduced (it sounds as if it can), I would first of all configure the VM to produce a heap dump on OutOfMemoryErrors. You can then analyze the heap and make sure that it's not filled with objects, which are still reachable through some unexpected references.

It's of course not impossible that you are running into a VM bug, but if your application is relying on implementation specific behaviour in 1.6.0_03, it may for some reason or another end up as a memory hog when running on 1.6.0_16. Such problems may also be found if you are using some kind of server container for your application. Some developers are obviously unable to read documentation, but tend to observe the API behaviour and make their own conclusions about how something is supposed to work. This is of course not always correct and I've ran into similar problems both with Tomcat and with JBoss (both products at least used to work only with specific VMs).

Coleman answered 19/2, 2010 at 17:19 Comment(6)

jhat doesn't seem to be capable of analyzing any heaps >= 256Mb in size, unfortunately because it goes out of memory! It's a mixture of "heap exhausted" and "GC overhead limit exceeded" errors and it's not running in a container (other than Spring) – Dorathydorca 19/2, 2010 at 17:23

I've been told to try YourKit but I'm reticent to spend time on this approach. After all, if the app runs on 1.6.0_03/linux but not on 1.6.0_18/solaris then the issue is surely with the VM - how will profiling my heap help? – Dorathydorca 19/2, 2010 at 17:24

I use the Eclipse Memory Analyzer (eclipse.org/mat), perhaps you want to take a look at it? I already explained in my answer why your problem is not necessarily caused by a VM bug and why I think you should take a closer look at the heap dump. – Coleman 19/2, 2010 at 17:31

@Jarnbjo - I'm not sure you explained anything of the sort: an application runs fine for 3 years and then starts falling over when migrated to a new VM and this is a memory leak? – Dorathydorca 19/2, 2010 at 22:44

Short summary: If your application depends on VM implementation (instead of documented) behaviour, the problem you are seeing may not be a VM bug, but a bug in your application. – Coleman 20/2, 2010 at 0:10

What does "depends on JVM implementation" mean in this case? I expect that if I have garbage the VM will collect it (which is documented behaviour any VM should have). I suppose I do depend on the virtual machine not having a bug in it but that is hardly an unreasonable dependency – Dorathydorca 21/2, 2010 at 15:38

B

1

Also make sure it's not a hardware fault (try running MemTest86 or similar on the server.)

Boodle answered 19/2, 2010 at 18:33 Comment(0)

P

1

Which kind of SIGSEV errors exactly do you encounter?

If you run a 32bit VM, it could be what I described here: http://janvanbesien.blogspot.com/2009/08/mysterious-jvm-crashes-explained.html

Preterhuman answered 19/2, 2010 at 19:21 Comment(0)

Recommended topics

Hot tags