Java ConcurrentMarkSweep garbage collector not removing all garbage
Asked Answered
S

4

10

Short form: The CMS garbage collector appears to be failing to collect an ever-increasing amount of garbage; eventually, our JVM fills up, and the application becomes unresponsive. Forcing a GC via an external tool (JConsole or jmap -histo:live) cleans it up once.

UPDATE: The problem appears to be related to the JTop plugin for JConsole; if we don't run JConsole, or run it without the JTop plugin, the behavior goes away.

(Technical notes: we're running Sun JDK 1.6.0_07, 32-bit, on a Linux 2.6.9 box. Upgrading the JDK version is not really an option, unless there's an unavoidable, major reason. Also, our system is not hooked up to an Internet-accessible machine, so screenshots of JConsole, etc aren't an option.)

We're currently running our JVM with the following flags:

-server -Xms3072m -Xmx3072m -XX:NewSize=512m -XX:MaxNewSize=512m 
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled 
-XX:CMSInitiatingOccupancyFraction=70 
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
-XX:+DisableExplicitGC

Observing the memory graph in JConsole, there's a full GC that runs every ~15 minutes or so during the first several hours of our application's lifespan; after each full GC, there's more and more memory still in use. After a few hours, the system hits a steady state where there's approximately 2GB of used memory in the CMS old gen.

Which sounds like a classic memory leak, except that if we use any tool that forces a full GC (hitting the "collect garbage" button in JConsole, or running jmap -histo:live, etc), the old gen suddenly drops to ~500MB used, and our application becomes responsive again for the next several hours (during which time the same pattern continues - after each full GC, more and more of the old gen is full.)

One thing of note: in JConsole, the reported ConcurrentMarkSweep GC count will stay at 0 until we force a GC with jconsole/jmap/etc.

Using jmap -histo and jmap -histo:live in sequence, I am able to determine that the apparently uncollected objects consist of:

  • several million HashMaps and arrays of HashMap$Entry (in a 1:1 ratio)
  • several million Vectors and Object arrays (1:1 ratio, and about the same as the number of HashMaps)
  • several million HashSet, Hashtable, and com.sun.jmx.remote.util.OrderClassLoaders, as well as arrays of Hashtable$Entry (about the same number of each; about half as many as the HashMaps and Vectors)

There are some excerpts from the GC output below; my interpretation of them appears to be that the CMS GC is getting aborted without failing over to the stop-the-world GC. Am I misinterpreting this output somehow? Is there something that would cause that?

During the normal runtime, the CMS GC output blocks look about like this:

36301.827: [GC [1 CMS-initial-mark: 1856321K(2621330K)] 1879456K(3093312K), 1.7634200 secs] [Times: user=0.17 sys=0.00, real=0.18 secs]
36303.638: [CMS-concurrent-mark-start]
36314.903: [CMS-concurrent-mark: 7.804/11.264 secs] [Times: user=2.13 sys=0.06, real=1.13 secs]
36314.903: [CMS-concurrent-preclean-start]
36314.963: [CMS-concurrent-preclean: 0.037/0.060 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
36314.963: [CMS-concurrent-abortable-preclean-start]
36315.195: [GC 36315.195: [ParNew: 428092K->40832K(471872K), 1.1705760 secs] 2284414K->1897153K(3093312K), 1.1710560 secs] [Times: user=0.13 sys=0.02, real=0.12 secs]
CMS: abort preclean due to time 36320.059: [CMS-concurrent-abortable-preclean: 0.844/5.095 secs] [Times: user=0.74 sys=0.05, real=0.51 secs]
36320.062: [GC[YG occupancy: 146166 K (471872 K)]36320.062: [Rescan (parallel), 1.54078550 secs]36321.603: [weak refs processing, 0.0042640 secs] [1 CMS-remark: 1856321K(2621440K)] 2002488K(3093312K), 1.5456150 secs] [Times: user=0.18 sys=0.03, real=0.15 secs]
36321.608: [CMS-concurrent-sweep-start]
36324.650: [CMS-concurrent-sweep: 2.686/3.042 secs] [Times: uesr=0.66 sys=0.02, real=0.30 secs]
36324.651: [CMS-concurrent-reset-start]
36324.700: [CMS-concurrent-reset: 0.050/0.050 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]

and that's it; the next line will be the next ParNew GC.

When we force a GC using jmap -histo:live, we instead get:

48004.088: [CMS-concurrent-mark: 8.012/8.647 secs] [Times: user=1.15 sys=0.02, real=0.87 secs]
(concurrent mode interrupted)

followed by ~125 lines of the form below: (some GeneratedMethodAccessor, some GeneratedSerializationConstructorAccessor, some GeneratedConstructorAccessor, etc)

[Unloading class sun.reflect.GeneratedMethodAccessor3]

followed by:

: 1911295K->562232K(2621440K), 15.6886180 secs] 2366440K->562232K(3093312K), [CMS Perm: 52729K->51864K(65536K)], 15.6892270 secs] [Times: user=1.55 sys=0.01, real=1.57 secs]

Thanks in advance!

Selfcontent answered 6/10, 2010 at 14:36 Comment(5)
have you at least tried the latest JVM to see if the problem is fixed? it might be a good data point.Amathiste
I don't know the CMS collector well enough to offer good advice, but one thing that jumps out at me is that the New generation does not seem to reduce in size. Normally, I would expect a full collection to move objects from New to Tenured.Rectangle
@SB I will try a newer JVM - as you say, a good data point. @Rectangle - isn't that backwards? I thought new->tenured happened on a young GC.Selfcontent
There's some wording in the Sun whitepaper (oracle.com/technetwork/java/javase/tech/…) that would seem to indicate promotion happens in a full collection: "Objects that survive some number of young generation collections are eventually promoted, or tenured, to the old generation." But regardless, I would expect any GC to collect the young generation, particularly if it's almost full.Rectangle
instead of using jmap to ask the jvm for a histogram, it might be less disruptive to ask for a heap dump. There are some great offline hprof analysis tools (eclipse mat and yourkit) that can give you a lot more info. The other benefit of this approach is you don't pollute the heap you're studying (I dunno if -histo does or not, but I'm just saying).Unstick
F
7

com.sun.jmx.remote.util.OrderClassLoader is used in the remote'ing layer for JMX and a quick review of the code suggests they're created as part of the unmarshalling process for remote requests inside of the JVM. The lifetime of those classloaders will be directly related to the lifetime of the thing that was unmarshalled such that once there are no longer any references to that thing the classloader could be released.

I wouldn't be surprised if in this case the presence of these instances was a direct result of you using JConsole to examine the goings on in the JVM. And it looks like they'd just be cleaned up by GC as part of normal operation.

I guess it's possible there's a bug in the JMX implementation (seems unlikely in a relatively up-to-date JVM) or perhaps you have some custom MBeans or are using some custom JMX tools that are causing the problem. But ultimately, I'm suspecting the OrderClassLoader is probably a red-herring and the issue lies elsewhere (broken GC or some other leak).

Faldstool answered 6/10, 2010 at 18:1 Comment(2)
You appear to be right - the objects and behavior in question disappear entirely if we don't use JConsole, or if we don't use the JTop plugin for JConsole. I still don't know why they weren't getting cleaned up correctly, but at least the system is working again.Selfcontent
1.7.0_05 still has the same bugWow
G
5

Technical notes: we're running Sun JDK 1.6.0_07, 32-bit, on a Linux 2.6.9 box. Upgrading the JDK version is not really an option, unless there's an unavoidable, major reason.

Several newer Java versions have had updates to the CMS garbage collector. Notably 6u12, 6u14, and 6u18.

I'm not an expert with GC stuff, but I'm guessing the preclean fixes in 6u14 may fix the issue you're seeing. Of course, I could say the same thing about 6u18's class unloading bugs. Like I said, I'm not an expert at GC stuff.

There are fixes for:

  • 6u10: (affects 6u4+) CMS never clears referents when -XX:+ParallelRefProcEnabled
  • 6u12: CMS: Incorrect encoding of overflown object arrays during concurrent precleaning
  • 6u12: CMS: Incorrect overflow handling when using parallel concurrent marking
  • 6u14: CMS: assertion failure "is_cms_thread == Thread::current()->is_ConcurrentGC_thread()"
  • 6u14: CMS: Need CMSInitiatingPermOccupancyFraction for perm, divorcing from CMSInitiatingOccupancyFraction
  • 6u14: CMS assert: _concurrent_iteration_safe_limit update missed
  • 6u14: CMS: Incorrect overflow handling during precleaning of Reference lists
  • 6u14: SIGSEGV or (!is_null(v),"oop value can never be zero") assertion when running with CMS and COOPs
  • 6u14: CMS: Livelock in CompactibleFreeListSpace::block_size().
  • 6u14: Make CMS work with compressed oops
  • 6u18: CMS: core dump with -XX:+UseCompressedOops
  • 6u18: CMS: bugs related to class unloading
  • 6u18: CMS: ReduceInitialCardMarks unsafe in the presence of cms precleaning
  • 6u18: [Regression] -XX:NewRatio with -XX:+UseConcMarkSweepGC causes fatal error
  • 6u20: card marks can be deferred too long

In addition to all of the above, 6u14 also introduced the G1 garbage collector, although it is still in testing. G1 is intended to replace CMS in Java 7.

G1 can be used in Java 6u14 and newer with the following command-line switches:

-XX:+UnlockExperimentalVMOptions -XX:+UseG1GC

Glossa answered 6/10, 2010 at 15:17 Comment(0)
U
-1

I would start with something far simpler, like:

-server -Xms3072m -Xmx3072m -XX:+UseParallelOldGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 

And see if this meets your needs.

Unstick answered 7/10, 2010 at 3:44 Comment(0)
C
-2

It looks like you are building objects which point back to their owners ( A points to B points to A ). This results in the reference counts remaining greater than zero, so the garbage collector can't clean them up. You need to break the cycle when you release them. Nullifying the reference in either A or B will solve the problem. This works even in larger reference looks like ( A -> B -> C -> D -> A). Vectors and object arrays may be used by your HashMaps.

The presense of the remote loaders may indicate a failure to cleanup and close references to objects loaded via JNDI or other remote access method.

EDIT: I took a second look at your last line. You may want to increase the perm allocation.

Cesta answered 6/10, 2010 at 14:52 Comment(2)
You understand that the Java garbage collectors don't use reference counting, right?Rectangle
As an addition to Anon's notes, concurrent mark and sweep's simple explanation is that it "marks all objects reachable from the roots" as live and collects the objects it considers dead. Granted, it only does this for the tenured generation; Java garbage collection has two generations: young and tenured.Glossa

© 2022 - 2024 — McMap. All rights reserved.