Cassandra: Long Par New GC Pauses when Bootstrapping new nodes to cluster
Asked Answered
P

1

7

I've seen an issue that happens fairly often when bootstrapping new nodes to a Datastax Enterprise Cassandra cluster (ver: 2.0.10.71)

When starting the new node to be bootstrapped, the bootstrap process starts to stream data from other nodes in the cluster. After a short period of time (usually a min or less) - other nodes in the cluster show high Par New GC pause times and then the nodes drop off from the cluster, failing the stream session.

INFO [main] 2015-04-27 16:59:58,644 StreamResultFuture.java (line 91) [Stream #d42dfef0-ecfe-11e4-8099-5be75b0950b8] Beginning stream session with /10.1.214.186

INFO [GossipTasks:1] 2015-04-27 17:01:06,342 Gossiper.java (line 890) InetAddress /10.1.214.186 is now DOWN

INFO [HANDSHAKE-/10.1.214.186] 2015-04-27 17:01:21,400 OutboundTcpConnection.java (line 386) Handshaking version with /10.1.214.186

INFO [RequestResponseStage:11] 2015-04-27 17:01:23,439 Gossiper.java (line 876) InetAddress /10.1.214.186 is now UP

Then on the other node:

10.1.214.186 ERROR [STREAM-IN-/10.1.212.233] 2015-04-27 17:02:07,007 StreamSession.java (line 454) [Stream #d42dfef0-ecfe-11e4-8099-5be75b0950b8] Streaming error occurred

Also see things in the logs:

10.1.219.232 INFO [ScheduledTasks:1] 2015-04-27 18:20:19,987 GCInspector.java (line 116) GC for ParNew: 118272 ms for 2 collections, 980357368 used; max is 12801015808

10.1.221.146 INFO [ScheduledTasks:1] 2015-04-27 18:20:29,468 GCInspector.java (line 116) GC for ParNew: 154911 ms for 1 collections, 1287263224 used; max is 12801015808`

It seems that it happens on different nodes each time we try to bootstrap a new node.

I've found this related ticket. https://issues.apache.org/jira/browse/CASSANDRA-6653

My only guess is that when the new node comes up a lot of compactions are firing off and that might be causing the GC pause times, I had considered setting concurrent_compactors = 1/2 my total CPU

Anyone have an idea?

Edit: More details around GC settings Using i2.2xlarge nodes on EC2:

MAX_HEAP_SIZE="12G"

HEAP_NEWSIZE="800M"

Also

JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"

JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"

JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"

JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"

JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=1"

JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75"

JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"

JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"

Primordium answered 27/4, 2015 at 20:36 Comment(10)
Increasing concurrent compactors is usually a good idea. core_count/2 is a good starting point.Glint
@Glint I think the default is commented out on the cassandra.yml - does that mean it's unbounded default or set to single compactor?Primordium
although i'm noticing now we have multithreaded_compaction: falsePrimordium
Leave multithreaded off. Can you paste the output of a tpstats on a node with the original settings? I would love to see what the flush writers are doing.Hoxie
@PatrickMcFadin - here is a nodetool tpstats output - gist.github.com/petecheslock/dcc72f15799c08d130e5 All the nodes are pretty similar. FWIW i've replaced/restreamed many of the nodes over the past few months.Primordium
Yeah. That is a blood bath. You are blocking 41627 times out of 208025 requests. That should be zero. So why all the GC? The flush writer is what moves the partition data in the memtable to disk. Memtables consume a lot of heap so keeping those writing out is a crucial operation to keep the heap in good health. Once that process is blocked, GC is inevitable and at the worst case, OOM. I2 instances have plenty of disk. Upping the writers is a good plan.Hoxie
@PatrickMcFadin which writers am i increasing? These GC's only happen for us when bootstrapping, not during normal operation, we've added/replaced most of the nodes over the last few months, so lots of bootstrapping has been going on.Primordium
sorry 'memtable_flush_writers' which is defaulted to 1. You can easily raise this to 4 or even 8. The concurrent compactors is also a good idea. All are preferred with the bandwith i2 instances provide. When you bootstrap, data from other nodes is streamed in. That data will hit the memtables first and at the high rate streams work, you'll need to be flushing quickly to disk. This is also going to happen when you run a repair. Given all the problems you've had with blocking, I would recommend running a repair as soon as you get this sorted out.Hoxie
@PatrickMcFadin Thanks! Going to give this a shot!Primordium
I agree with Patrick's thoughts on flushing. You may want to take a look at this jira --> issues.apache.org/jira/browse/CASSANDRA-8485. Are you in a version > 2.0.12?Runyan
P
4

With the help from the DSE crew - the following settings helped us.

With an i2.2xlarge node (8 cpu, 60G of ram, local SSD only)

Increasing Heap New Size to 512M * num CPU (in our case 4G) Setting memtable_flush_writers = 8 Setting concurrent_compactors = total CPU / 2 (in our case 4)

Making these changes no longer seeing ParNew GC times exceeding 1sec on bootstrap (previously we were seeing 50-100 SECOND Gc times). FWIW We don't see any ParNew GC times during normal operation - only bootstrap.

Primordium answered 28/4, 2015 at 13:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.