Why concurrent mark and sweep (CMS) is not cleaning up the same amount of memory as Full GC?

Asked 23/12, 2015 at 8:15 Answered 1/2, 2016 at 6:56

java garbage-collection heap-memory concurrent-mark-sweep

I have a strange issue with one of my production machines. It hosts a Java application that does CMS (concurrent mark and sweep), but it cleans up just a small part of the old generation. I suspected a memory leak and tried a heap dump. But the Full GC preceding the heap dump cleans up almost all of the old generation. What's happening? I've never seen this behavior of Java garbage collection. Normally CMS and Full GC should collect about the same amount of garbage, now CMS keeps about 10GB more.

Java 1.7.0_75
Linux Cent OS 7

GC logs:

**2016-01-04T07:37:40.196+0000: 431200.698: [GC [1 CMS-initial-mark: 21633423K(27336704K)] 22826703K(30101504K), 4.3910840 secs] [Times: user=4.40 sys=0.01, real=4.39 secs] 
2016-01-04T07:37:44.588+0000: 431205.090: [CMS-concurrent-mark-start]
2016-01-04T07:38:08.718+0000: 431229.220: [CMS-concurrent-mark: 18.213/24.131 secs] [Times: user=126.00 sys=2.22, real=24.13 secs] 
2016-01-04T07:38:08.718+0000: 431229.220: [CMS-concurrent-preclean-start]
2016-01-04T07:38:08.843+0000: 431229.345: [CMS-concurrent-preclean: 0.118/0.125 secs] [Times: user=0.29 sys=0.00, real=0.12 secs] 
2016-01-04T07:38:08.843+0000: 431229.345: [CMS-concurrent-abortable-preclean-start]
 CMS: abort preclean due to time 2016-01-04T07:38:13.906+0000: 431234.408: [CMS-concurrent-abortable-preclean: 4.975/5.063 secs] [Times: user=10.18 sys=0.03, real=5.06 secs] 
2016-01-04T07:38:14.001+0000: 431234.503: [GC[YG occupancy: 1312993 K (2764800 K)]2016-01-04T07:38:14.001+0000: 431234.503: [Rescan (parallel) , 4.6981760 secs]2016-01-04T07:38:18.699+0000: 431239.202: [weak refs processing, 0.0002970 secs]2016-01-04T07:38:18.700+0000: 431239.202: [scrub string table, 0.0006900 secs] [1 CMS-remark: 21633423K(27336704K)] 22946417K(30101504K), 4.6993310 secs] [Times: user=105.40 sys=0.36, real=4.70 secs] 
2016-01-04T07:38:18.701+0000: 431239.203: [CMS-concurrent-sweep-start]
2016-01-04T07:38:27.967+0000: 431248.469: [CMS-concurrent-sweep: 9.160/9.267 secs] [Times: user=17.91 sys=0.10, real=9.26 secs] 
2016-01-04T07:38:27.968+0000: 431248.470: [CMS-concurrent-reset-start]
2016-01-04T07:38:28.028+0000: 431248.531: [CMS-concurrent-reset: 0.061/0.061 secs] [Times: user=0.14 sys=0.00, real=0.06 secs]** 
2016-01-04T07:38:30.801+0000: 431251.303: [GC [1 CMS-initial-mark: 21633105K(27336704K)] 23039228K(30101504K), 5.6079370 secs] [Times: user=5.60 sys=0.01, real=5.61 secs] 
2016-01-04T07:38:36.409+0000: 431256.911: [CMS-concurrent-mark-start]
2016-01-04T07:38:54.673+0000: 431275.175: [CMS-concurrent-mark: 17.807/18.264 secs] [Times: user=119.97 sys=1.66, real=18.26 secs] 
2016-01-04T07:38:54.673+0000: 431275.175: [CMS-concurrent-preclean-start]
2016-01-04T07:38:54.799+0000: 431275.301: [CMS-concurrent-preclean: 0.119/0.126 secs] [Times: user=0.25 sys=0.00, real=0.13 secs] 
2016-01-04T07:38:54.799+0000: 431275.301: [CMS-concurrent-abortable-preclean-start]
 CMS: abort preclean due to time 2016-01-04T07:39:01.382+0000: 431281.884: [CMS-concurrent-abortable-preclean: 6.478/6.583 secs] [Times: user=12.23 sys=0.06, real=6.58 secs] 
2016-01-04T07:39:01.487+0000: 431281.989: [GC[YG occupancy: 1596183 K (2764800 K)]2016-01-04T07:39:01.487+0000: 431281.989: [Rescan (parallel) , 3.5737630 secs]2016-01-04T07:39:05.061+0000: 431285.563: [weak refs processing, 0.0002690 secs]2016-01-04T07:39:05.061+0000: 431285.563: [scrub string table, 0.0005740 secs] [1 CMS-remark: 21633105K(27336704K)] 23229288K(30101504K), 3.5747910 secs] [Times: user=80.26 sys=0.27, real=3.58 secs] 
2016-01-04T07:39:05.062+0000: 431285.564: [CMS-concurrent-sweep-start]
2016-01-04T07:39:21.895+0000: 431302.397: [CMS-concurrent-sweep: 9.449/16.834 secs] [Times: user=27.52 sys=0.16, real=16.83 secs] 
2016-01-04T07:39:21.895+0000: 431302.397: [CMS-concurrent-reset-start]
2016-01-04T07:39:21.995+0000: 431302.497: [CMS-concurrent-reset: 0.099/0.099 secs] [Times: user=0.23 sys=0.01, real=0.10 secs] 
2016-01-04T07:39:24.104+0000: 431304.606: [GC [1 CMS-initial-mark: **21631742K(27336704K)**] 23455261K(30101504K), 5.6592940 secs] [Times: user=5.67 sys=0.00, real=5.66 secs] 
2016-01-04T07:39:29.764+0000: 431310.266: [CMS-concurrent-mark-start]
2016-01-04T07:39:56.859+0000: 431337.361: [CMS-concurrent-mark: 19.813/27.096 secs] [Times: user=140.17 sys=2.62, real=27.10 secs] 
2016-01-04T07:39:56.859+0000: 431337.361: [CMS-concurrent-preclean-start]
2016-01-04T07:39:57.102+0000: 431337.604: [CMS-concurrent-preclean: 0.152/0.242 secs] [Times: user=0.35 sys=0.00, real=0.24 secs] 
2016-01-04T07:39:57.102+0000: 431337.604: [CMS-concurrent-abortable-preclean-start]
 CMS: abort preclean due to time 2016-01-04T07:40:03.016+0000: 431343.518: [CMS-concurrent-abortable-preclean: 5.376/5.914 secs] [Times: user=11.65 sys=0.05, real=5.91 secs] 
2016-01-04T07:40:03.016+0000: 431343.518: [GC[YG occupancy: 2045175 K (2764800 K)]2016-01-04T07:40:03.016+0000: 431343.518: [Rescan (parallel) , 3.4843400 secs]2016-01-04T07:40:06.501+0000: 431347.003: [weak refs processing, 0.0002510 secs]2016-01-04T07:40:06.501+0000: 431347.003: [scrub string table, 0.0006220 secs] [1 CMS-remark: **21631742K(27336704K)**] 23676918K(30101504K), 3.4853760 secs] [Times: user=78.31 sys=0.27, real=3.49 secs] 
2016-01-04T07:40:06.502+0000: 431347.004: [CMS-concurrent-sweep-start]
{Heap before GC invocations=1832 (full 5077):
 par new generation   total 2764800K, used 2166647K [0x00000000bae00000, 0x0000000176600000, 0x0000000176600000)
  eden space 2457600K,  77% used [0x00000000bae00000, 0x000000012e6a4c90, 0x0000000150e00000)
  from space 307200K,  89% used [0x0000000150e00000, 0x0000000161939318, 0x0000000163a00000)
  to   space 307200K,   0% used [0x0000000163a00000, 0x0000000163a00000, 0x0000000176600000)
 concurrent mark-sweep generation total 27336704K, used 21631644K [0x0000000176600000, 0x00000007fae00000, 0x00000007fae00000)
 concurrent-mark-sweep perm gen total 49956K, used 30116K [0x00000007fae00000, 0x00000007fdec9000, 0x0000000800000000)
**2016-01-04T07:40:12.775+0000: 431353.277: [Full GC2016-01-04T07:40:12.775+0000: 431353.277: [CMS2016-01-04T07:40:17.924+0000: 431358.426: [CMS-concurrent-sweep: 9.211/11.422 secs] [Times: user=16.59 sys=0.15, real=11.42 secs] 
 (concurrent mode interrupted): 21631644K->4907878K(27336704K), 39.2467600 secs] 23798292K->4907878K(30101504K), [CMS Perm : 30116K->28023K(49956K)], 39.2468730 secs] [Times: user=39.24 sys=0.05, real=39.25 secs]**

The same application is running normally on another machine using Cent OS 5, java 7.

java -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=1581
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false -verbose:gc -XX:+PrintGCDetails
-XX:+UseCompressedOops -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution -XX:+UseConcMarkSweepGC
-Xloggc:/usr/local/app/logs/current-gc.log -Xms29g -Xmx29g -XX:NewSize=3000m
-XX:SurvivorRatio=8 -XX:CMSInitiatingOccupancyFraction=70
-XX:+UseCMSInitiatingOccupancyOnly
-Dsun.rmi.dgc.client.gcInterval=0x7FFFFFFFFFFFFFFE
-Dsun.rmi.dgc.server.gcInterval=0x7FFFFFFFFFFFFFFE
-cp /usr/local/app/conf:/usr/local/app/app.jar:/usr/local/app/lib/* -Xdebug
-Xrunjdwp:transport=dt_socket,address=8099,server=y,suspend=n
-Dvisualvm.display.name=App -XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/usr/local/app/logs/
-XX:ErrorFile=/usr/local/app/logs/hs_err_pid%p.log
-Djgroups.tcpgossip.initial_host=localhost
-Dlog4j.hostname=host7.company.com com.company.app.service.App

UPDATE: The problem is still not solved. I've tried everything: update OS packages and kernel, update Java to latest version Java 1.7.0_80, rollback the application version, but without success.

I've also verified previous GC logs and found that this issue was not forever. It started about a month ago, after a deployment.

Chinkiang answered 23/12, 2015 at 8:15 Comment(14)

@ravindra The question is not related to G1. – Chinkiang 23/12, 2015 at 9:57

Have a look at this one:#3874135 – Decembrist 23/12, 2015 at 10:2

I thought that you were looking for explanation on why this happening. Have you tried to ssh and profile it directly on the machine? Regarding updated question I agree with @ravindra with his link that it could be a bug with jmx. – Speak 23/12, 2015 at 10:10

I have tried many combinations in CMS and finally moved to G1gc. For 29gb memory, I can't see any other algorithm better than G1GC. – Decembrist 23/12, 2015 at 10:10

@ravindra Interesting, but there was no remote jconsole connected to that jvm and there are no "Unloading ..." messages in the gc logs. – Chinkiang 23/12, 2015 at 10:11

@ravindra yes. But it is still very young algorithm and in JDK8 they were not sure that it should be default one. But anyway in practice for large heaps especially it seems to be the best one. – Speak 23/12, 2015 at 10:12

I have found one more interesting one :googleweblight.com/?lite_url=http://stackoverflow.com/questions/… – Decembrist 23/12, 2015 at 10:19

Full GC starts when the tenured space is full, or the suvivor space is exhaused (e.g. there are too many object copied from the eden space) or the CMS desices now is a good tile to try and do a concurrent cleanup. – Decembrist 23/12, 2015 at 10:20

Take a look also at Minor GV vs Major GC vs Full GC post – Speak 23/12, 2015 at 10:26

is CMS class unloading enabled? i don't recall whether that was available/enabled by default in 7. – Meteorology 23/12, 2015 at 12:29

1.7.0_80 is not the latest java version – Meteorology 27/1, 2016 at 11:20

Is there any way you could upload more logs. I'm not advocating for pasting it into the question, itself, but this Q/A on meta (meta.stackexchange.com/questions/15821/…) gives some decent suggestions for what I'm talking about. – Carousal 29/1, 2016 at 13:47

@Chinkiang Apart from the differing versions of the OS, are the environments on both hosts (and the hosts themselves) equivalent? – Carousal 30/1, 2016 at 2:34

The hardware is also different. They are 2 physical, not virtual machines. The one with the issue is newer. But no other differences. – Chinkiang 30/1, 2016 at 9:28

CMS does not unload classes by default in JDK 7.
You have to specify -XX:+CMSClassUnloadingEnabled, or better switch to JDK 8.

BTW, if you run jmap without live argument, it won't call Full GC before creating a heap dump.

Whereby answered 29/1, 2016 at 23:59 Comment(1)

Indeed you can take the dump without full gc and it seems it contains unreachable objects. Here is the log from Eclipse MAT [INFO] Removed 308,720,881 unreachable objects using 12,603,944,736 bytes. – Chinkiang 1/2, 2016 at 21:2

So there are a few possible causes to your woes and I'll state the likely quick fix first, and then state some best practices below that you should consider doing in the long term.

Likely quick fix: The "CMS: abort preclean due to time" entry of your logs suggest that your CMS collector isn't tuned properly to do it's job before the ultimate stop-the-world full GC phase. Basically what this log message means is that the preclean phase is timing out and further steps of the CMS collector aren't able to run. The default for the time to abort the preclean phase is 5 seconds, which appears to be the case from your logs. Consider increasing this time to something like 10 seconds, and observe what happens. You should be able to do this by adding the parameter -XX:CMSMaxAbortablePrecleanTime=10000. See Misamitsu's blog entry on the CMSMaxAbortablePrecleanTime tuning parameter. Also, based on what I can see from Stas' blog entry on the subject, the unit is milliseconds, not seconds.

Try these longer term things in the long term:

If you can, try the G1GC collector. You are using Java 7, so if you can in your current environment, give it a try. It's meant for performance with big heaps.
The fact that this happened as a result of a deployment and the rollback didn't work suggests to me that:
- Your entire app (such as JVM tuning parameters) aren't captured as part of your deployment, or...
- The cause of the performance issues are unrelated to your app. Perhaps the differences between your two production hosts and an unrelated app running on one host such as a different antivirus version is making life more difficult for your garbage collector. So again, if you can, deploy to exactly the same hardware and software.
Evaluate whether you really need such a large heap. Remember, the bigger the heap, the longer the garbage collector takes, irrespective of whether you are using a concurrent collector such as CMS.
You have a lot of tuning parameters and I would challenge that you need the majority of them. Consider running a benchmark experiment where you just obliterate the -XX:NewSize=3000m, -XX:SurvivorRatio=8, -XX:CMSInitiatingOccupancyFraction=70, -XX:+UseCMSInitiatingOccupancyOnly, -Dsun.rmi.dgc.client.gcInterval=0x7FFFFFFFFFFFFFFE, and -Dsun.rmi.dgc.server.gcInterval=0x7FFFFFFFFFFFFFFE parameters and see if your situation improves.

I hope that helps, please comment with follow up questions if needed.

Carousal answered 1/2, 2016 at 6:56 Comment(8)

Regarding the quick fix: the other machine that works normally displays in the logs the same 'CMS: abort preclean due to time'. But yes, it might be an issue under different circumstances. – Chinkiang 1/2, 2016 at 21:10

Regarding 1. CMS proved more predictable and we have no major issues with it until now. – Chinkiang 1/2, 2016 at 21:12

Regarding 2. Most likely the second option is the case because I had no issue on any other machine than this one. – Chinkiang 1/2, 2016 at 21:13

Regarding 3. Under present design, yes, the heap holds a large cache for fast access. And again, the gc times were always good. – Chinkiang 1/2, 2016 at 21:15

Regarding 4. These parameters are used on every deployment, they were tuned at some point and proved good. – Chinkiang 1/2, 2016 at 21:16

This answer show the best understanding of the problem and seems most helpful. I'll play a little with the -XX:CMSMaxAbortablePrecleanTime parameter to see if it makes any difference. – Chinkiang 1/2, 2016 at 21:19

Let me know if you have any further issues and/or questions. I'd be happy to help - irrespective of points. – Carousal 1/2, 2016 at 21:28

You write: "Basically what this log message means is that the preclean phase is timing out and further steps of the CMS collector aren't able to run." Do you have a reference for this claim? Everybody else says that "CMS: abort preclean due to time" is no big deal, preclean is designed to be aborted, and CMS will continue with the next phase. Which one is it then? I'm also seeing this problem, and it looks like CMS doesn't finish the cleanup (esp. weak references), old collection doesn't go down at all. – Sleek 28/4, 2017 at 11:57

There is the reason why this happening. It seems that when you are trying to do heap dump it is calling method dumpHeap(). For example VisualVM triggers full GC before making heap dump too.

From documentation:

void dumpHeap(String outputFile, boolean live) throws IOException

Dumps the heap to the outputFile file in the same format as the hprof heap dump. If this method is called remotely from another process, the heap dump output is written to a file named outputFile on the machine where the target VM is running. If outputFile is a relative path, it is relative to the working directory where the target VM was started.

Parameters:

outputFile - the system-dependent filename

live - if true dump only live objects i.e. objects that are reachable from others

Why?

When you are trying to find memory leak, you don't want to get objects which have no references (was not garbage-collected).

Recommendation

I'm seeing a pretty large heap, probably it is better to use G1.

The first focus of G1 is to provide a solution for users running applications that require large heaps with limited GC latency. This means heap sizes of around 6GB or larger, and stable and predictable pause time below 0.5 seconds.

Take a look at recommendation section also in documentation.

Speak answered 23/12, 2015 at 9:31 Comment(2)

I know this. What I'm asking here is why CMS and Full GC are not cleaning up the same amount of memory? Full GC is freeing 10GB more. – Chinkiang 23/12, 2015 at 9:50

I too agree with you on G1gc front. – Decembrist 23/12, 2015 at 9:54

I suspect that need of Major collection has not been come up in your application yet due to very large heap : Neither tenured Old Gen nor Survivor Space has been exhausted.

If you think otherwise, ( Major collection has not been triggered even though either of conditions have been met ), share you gclog statements.

On different note : since you are using large heap, G1GC has been preferred.

EDIT:

-XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly

By default CMS GC uses set of heuristic rules to trigger garbage collection. This makes GC less predictable and usually tends to delay collection until old generation is almost occupied.

-XX:+UseCMSInitiatingOccupancyOnly prevent usage of GC heuristics.

-XX:CMSInitiatingOccupancyFraction informs Java VM when CMS should be triggered.

You have configured it as 70%. If your OldGen [Max heap (29G) - new Gen (3G)] touches 70% of limit, GC will be triggered. As part of GC, first minor GC is triggered. If minor GC did not release enough space to allocate new objects at that juncture, then Major GC or Full GC is triggered.

Have a look at this article for more details.

I have found one more related SE question : CMS garbage collector - when does it run?. Have a look at this one too.

Decembrist answered 24/12, 2015 at 11:44 Comment(4)

From your logs, concurrent mark-sweep generation total 27336704K, used 21631644K implies 21G out of 27G has been used. Still 6G of memory is available. So FullGC is not required. – Decembrist 4/1, 2016 at 10:47

The idea is that the automatically triggered CMS is not cleaning up memory, but the manually invoked Full GC does. Why? – Chinkiang 4/1, 2016 at 10:52

I am assuming that FullGC log is generated when you manually invoked FullGC. Without manual intervention, CMS did not trigger FullGC at all. Correct me if I am wrong. – Decembrist 4/1, 2016 at 11:3

CMS did not trigger full gc since free memory of 6GB is already available. CMS algorithm has been optimised to reduce FullGC calls. – Decembrist 4/1, 2016 at 11:21

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Why?

Recommendation

Recommended topics

Hot tags