High disk I/O on Cassandra nodes
Asked Answered
D

1

5

Setup:
We have 3 nodes Cassandra cluster having data of around 850G on each node, we have LVM setup for Cassandra data directory (currently consisting 3 drives 800G + 100G + 100G) and have separate volume (non LVM) for cassandra_logs

Versions:
Cassandra v2.0.14.425
DSE v4.6.6-1

Issue:
After adding 3rd (100G) volume in LVM on each of the node, all the nodes went very high in disk I/O and they go down quite often, servers also become inaccessible and we need to reboot the servers, servers don't get stable and we need to reboot after every 10 - 15 mins.

Other Info:
We have DSE recommended server settings (vm.max_map_count, file descriptor) configured on all nodes
RAM on each node : 24G
CPU on each node : 6 cores / 2600MHz
Disk on each node : 1000G (Data dir) / 8G (Logs)

Declinometer answered 7/4, 2016 at 15:35 Comment(15)
What operations are you doing on Cassandra?Simulate
Mostly write operations, we have Cassandra running with Solr so we index the data which we want to read and read it from Solr indexes.Declinometer
Is there a special state where the servers stop to react?Contagion
Whenever I start DSE service it starts compaction on one of biggest keyspace which leads to high disk I/O and later node goes down.Declinometer
What kind of disks are these? By the sounds of it you are running out of disk bandwidth.Fanaticism
Our Cassandra nodes are VMs and disks on hypervisors are like this NL-SAS + SSD (for write caching) -> Ceph -> VMDeclinometer
@PatrickMcFadin I am running this cluster on these disks since 4 months with more or less same amount of data, we only started having this issue since 2 days after extending LVM with additional 100G volume on each node, can LVM be the cause of this issue ?Declinometer
I suspect you are going past the limit on throughput which is increasing your atime. Can you paste the output of a tpstats on an affected node?Fanaticism
@PatrickMcFadin Although we have this issue on all 3 nodes of the cluster but the output I am pasting here is from the node which is most affected pastebin.com/Pd9EpzQXDeclinometer
Other nodes Node02 : pastebin.com/ZkYQ9N98 Node03 : pastebin.com/jf8S29bhDeclinometer
Ok. I see your problem. Well, the reason your node is dying. I'll answer in the answer sectionFanaticism
@PatrickMcFadin the issue occures NOT after starting the nodes. they work for 15min to one hours normal, and within minutes they go down. shouldn't they show the behaviour you're describing below from the very beginning, so when the DSE node started?Contagion
No, because a flush only happens after writes occur on the database. Compactions can start when the nodes start but flush is only after it's been online a bit.Fanaticism
@PatrickMcFadin i've monitored the log of those nodes, and what comes up shortly before the instance "dies" is this: pastebin.com/xgxAE8iN would it be a solution to a.) increase the 600000 millis to somethin else OR b.) to start c* without solr enabled first and when flush/compaction is done, restart the nodes with solr activated again?Contagion
This line: Timeout while waiting for workers when flushing pool Index Solr is also backing up because the disk has stopped responding. Increasing the timeout will only create more back pressure. All of your processes are starving for disk time and not getting it. Really your only short term solution is adding more nodes to spread out the load and then work towards real disks on the nodes.Fanaticism
F
8

As I suspected, you are having throughput problems on your disk. Here's what I looked at to give you background. The nodetool tpstats output from your three nodes had these lines:

Pool Name                    Active   Pending      Completed   Blocked  All time blocked
FlushWriter                       0         0             22         0                 8
FlushWriter                       0         0             80         0                 6
FlushWriter                       0         0             38         0                 9 

The column I'm concerned about is the All Time Blocked. As a ratio to completed, you have a lot of blocking. The flushwriter is responsible for flushing memtables to the disk to keep the JVM from running out of memory or creating massive GC problems. The memtable is an in-memory representation of your tables. As your nodes take more writes, they start to fill and need to be flushed. That operation is a long sequential write to disk. Bookmark that. I'll come back to it.

When flushwriters are blocked, the heap starts to fill. If they stay blocked, you will see the requests starting to queue up and eventually the node will OOM.

Compaction might be running as well. Compaction is a long sequential read of SSTables into memory and then a long sequential flush of the merge sorted results. More sequential IO.

So all these operations on disk are sequential. Not random IOPs. If your disk is not able to handle simultaneous sequential read and write, IOWait shoots up, requests get blocked and then Cassandra has a really bad day.

You mentioned you are using Ceph. I haven't seen a successful deployment of Cassandra on Ceph yet. It will hold up for a while and then tip over on sequential load. Your easiest solution in the short term is to add more nodes to spread out the load. The medium term is to find some ways to optimize your stack for sequential disk loads, but that will eventually fail. Long term is get your data on real disks and off shared storage.

I have told this to consulting clients for years when using Cassandra "If your storage has an ethernet plug, you are doing it wrong" Good rule of thumb.

Fanaticism answered 8/4, 2016 at 2:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.