I am having a really weird issue with a Java application.
Essentially it is a web page that uses magnolia (a cms system), there are 4 instances available on production environment. Sometimes the CPU goes to 100% in a java process.
So, first approach was to make a thread dump, and check the offending thread, what I found was weird:
"GC task thread#0 (ParallelGC)" prio=10 tid=0x000000000ce37800 nid=0x7dcb runnable
"GC task thread#1 (ParallelGC)" prio=10 tid=0x000000000ce39000 nid=0x7dcc runnable
Ok, that is pretty weird, I have never had a problem with the garbage collector like that, so the next thing we did was to activate JMX and using jvisualvm inspect the machine: the heap memory usage was really high (95%).
Naive approach: Increase memory, so the problem takes more time to appear, result, on the restarted server with increased memory (6 GB!) the problem appeared 20 hours after restart while on other servers with less memory (4GB!) that had been running for 10 days, the problem took still a few more days to reappear. Also, I tried to use the apache access log from the server failing and use JMeter to replay the requests into a local server in an attemp to reproduce the error... it did not work either.
Then I investigated the logs a little bit more to find this errors
info.magnolia.module.data.importer.ImportException: Error while importing with handler [brightcoveplaylist]:GC overhead limit exceeded
at info.magnolia.module.data.importer.ImportHandler.execute(ImportHandler.java:464)
at info.magnolia.module.data.commands.ImportCommand.execute(ImportCommand.java:83)
at info.magnolia.commands.MgnlCommand.executePooledOrSynchronized(MgnlCommand.java:174)
at info.magnolia.commands.MgnlCommand.execute(MgnlCommand.java:161)
at info.magnolia.module.scheduler.CommandJob.execute(CommandJob.java:91)
at org.quartz.core.JobRunShell.run(JobRunShell.java:216)
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:549)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
Another example
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOf(Arrays.java:2894)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:117)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:407)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at java.lang.StackTraceElement.toString(StackTraceElement.java:175)
at java.lang.String.valueOf(String.java:2838)
at java.lang.StringBuilder.append(StringBuilder.java:132)
at java.lang.Throwable.printStackTrace(Throwable.java:529)
at org.apache.log4j.DefaultThrowableRenderer.render(DefaultThrowableRenderer.java:60)
at org.apache.log4j.spi.ThrowableInformation.getThrowableStrRep(ThrowableInformation.java:87)
at org.apache.log4j.spi.LoggingEvent.getThrowableStrRep(LoggingEvent.java:413)
at org.apache.log4j.AsyncAppender.append(AsyncAppender.java:162)
at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
at org.apache.log4j.Category.callAppenders(Category.java:206)
at org.apache.log4j.Category.forcedLog(Category.java:391)
at org.apache.log4j.Category.log(Category.java:856)
at org.slf4j.impl.Log4jLoggerAdapter.error(Log4jLoggerAdapter.java:576)
at info.magnolia.module.templatingkit.functions.STKTemplatingFunctions.getReferencedContent(STKTemplatingFunctions.java:417)
at info.magnolia.module.templatingkit.templates.components.InternalLinkModel.getLinkNode(InternalLinkModel.java:90)
at info.magnolia.module.templatingkit.templates.components.InternalLinkModel.getLink(InternalLinkModel.java:66)
at sun.reflect.GeneratedMethodAccessor174.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:622)
at freemarker.ext.beans.BeansWrapper.invokeMethod(BeansWrapper.java:866)
at freemarker.ext.beans.BeanModel.invokeThroughDescriptor(BeanModel.java:277)
at freemarker.ext.beans.BeanModel.get(BeanModel.java:184)
at freemarker.core.Dot._getAsTemplateModel(Dot.java:76)
at freemarker.core.Expression.getAsTemplateModel(Expression.java:89)
at freemarker.core.BuiltIn$existsBI._getAsTemplateModel(BuiltIn.java:709)
at freemarker.core.BuiltIn$existsBI.isTrue(BuiltIn.java:720)
at freemarker.core.OrExpression.isTrue(OrExpression.java:68)
Then I find out that such problem is due to the garbage collector using a ton of CPU but not able to free much memory
Ok, so it is a problem with the MEMORY that manifests itself in the CPU, so If the memory usage problem is solved, then the CPU should be fine, so I took a heapdump, unfortunatelly it was just too big to open it (the file was 10GB), anyway I run the server locallym loaded it a little bit and took a heapdump, after opening it, I found something interesting:
There are a TON of instances of
AbstractReferenceMap$WeakRef ==> Takes 21.6% of the memory, 9 million instances
AbstractReferenceMap$ReferenceEntry ==> Takes 9.6% of the memory, 3 million instances
In addition, I have found a Map which seems to be used as a "cache" (horrible but true), the problem is that such map is NOT synchronized and it is shared among threads (being static), the problem could be not only concurrent writes but also the fact that with lack of synchronization, there is no guarantee that thread A will see the changes done to the map by thread B, however, I am unable to figure out how to link this suspicious map using the memory eclipse analyzer, as it does not use the AbstracReferenceMap, it is just a normal HashMap.
Unfortunately, we do not use those classes directly (obviously the code uses them, but not directly), so I have seem to hit a dead end.
Problems for me are
- I cannot reproduce the error
- I cannot figure out where the hell the memory is leaking (if that is the case)
Any ideas at all?