Understanding why AWS Elasticsearch GC (young and old) keeps on rising while memory pressure is not

Asked 16/12, 2021 at 9:14 Answered 3/10, 2023 at 5:12

Solved amazon-web-services elasticsearch jvm

I am trying to understand if I have an issue with my AWS Elasticsearch garbage collection time but all the memory-related issues that I find are relating to memory pressure which seems OKish.

So while I run a load test on the environment, I observe a constant rise in all GC collection time metrics, for example:

But when looking at memory pressure, I see that I am not passing the 75% mark (but getting near..) which, according to documentation, will trigger a concurrent mark & sweep.

So I fear that once I add more load or run a longer test, I might start seeing real issues which will have an impact on my environment. So, do I have an issue here? how should I approach rising GC time when I cant take memory dumps and see what's going on?

Varuna answered 16/12, 2021 at 9:14 Comment(0)

I sent a query to AWS technical support and, counter to any intuitive behavior, the values of the Young and Old Collection time and count in Elasticsearch is cumulative. This means that this value keeps increasing and does not drop down to a value of 0 until there is a node drop or node restart

Varuna answered 21/12, 2021 at 11:56 Comment(0)

The top graph reports aggregate GC collection time, which is what's available from GarbageCollectorMXBean. It continues to increase because every young generation collection adds to it. And in the bottom graph, you can see lots of young generation collections happening.

Young generation collections are expected in any web-app (which is what an OpenSearch cluster is): you're constantly making requests (queries or updates), and the those requests create garbage.

I recommend looking at the major collection statistics. In my experience with OpenSearch, these happen when you're performing large numbers of updates, perhaps as a result of coalescing indexes. However, they should be infrequent unless you're constantly updating your cluster.

If you do experience memory pressure, the only real solution is to move to a larger node size. Adding nodes probably won't help, due to the way that indexes are sharded across nodes.

Indignant answered 16/12, 2021 at 14:14 Comment(2)

Hi Parisifal and thank you for your answer! Note that in the first graph I am referring to collection time (The actual AWS metric is JVMGCYoungCollectionTime) and according to AWS documentation this metric is "The amount of time, in milliseconds, that the cluster has spent performing "young generation" garbage collection." So unless it keeps on adding collection time on each new GC, the graph should stay flat. Regarding looking at major collection as you suggested, unfortunately AWS doesnt expose this metric. It only exposes young and old GC. – Varuna 19/12, 2021 at 8:5

@Varuna - (1) The metrics exposed are from JMX, so that is the best place to look for documentation. (2) I interpret "that the cluster has spent" to mean aggregate time, and therefore yes, it should be ever increasing. I would expect the documentation to be very explicit if it was actually "the time of the most recent young generation collection." – Indignant 20/12, 2021 at 14:4

The OpenSearch metrics presented for Elastic Search / indexing performance in AWS console and automatic CloudWatch dashboard for Young and Old Garbage Collection (GC) events and time are CUMULATIVE.

This is completely useless for performance monitoring but can be solved by adding calculated metrics to your metrics report.

For COUNTS:

Add the metric you would like to measure
Use the "Add math" option to create a new calculated metric
Set the calculated metric Details to "DIFF(m1)" where m1 is the ID of the value.

Example EVENT COUNTS (JVM GC Young) per time period.

For EXECUTION TIME:

Add the cumulative COUNT and TIME measures
Use the "Add math" option to create a new calculated metric
Set the calculated metric Details to "DIFF(m1) / DIFF(m2)" where m1 is COUNT metric ID and m2 is the TIME metric ID.

That will give "Time per Garbage Collection Event"

You can hide the COUNT and TIME metrics; set the scale appropriately and add a Horizontal annotation / threshold for easy performance monitoring.

Example Time of JVM GC Collection Events in ms with threshold

Especially answered 3/10, 2023 at 5:12 Comment(0)

Recommended topics

Hot tags