Simultaneous repairs cause repair to hang
Asked Answered
S

5

11

I'm running Cassandra 3.7 in a 24 node cluster with 3 data centers and 256 vnodes per node, and each node uses a cron job to run "nodetool repair -pr" once a day during a different hour of the day from the other nodes.

Sometimes the repair takes more than one hour to complete and the repairs overlap. When this happens, repair starts to get exceptions and can hang in a bad state. This leads to a cascading failure where each hour another node will try to start a repair and it will also hang.

Recovering from this is difficult. The only way I have found is to restart not just the nodes with a stuck repair, but all the nodes in the cluster.

The only idea I have for dealing with this is to build some kind of service that checks if any other node is running repair before it starts a repair, maybe by publishing in a Cassandra table when a repair is in progress.

I'm not sure how I will be able to repair all the nodes if the cluster gets bigger since there soon won't be enough hours in the day to run repair on all the nodes one by one.

So my main question is, am I running repair incorrectly and what is the recommended way to regularly repair all the nodes of a large cluster?

Is there a way to repair more than one node at a time? The documentation hints that there is, but it isn't clear how to do that. Is it normal that repair would crash and burn when run on more than one node at a time? Is there an easier way to kill the stuck repairs than restarting all the nodes?

Some things I tried:

  1. Running "nodetool repair" without -pr, but this also hangs if run on multiple nodes at once.
  2. Running "nodetool repair -dcpar" - this seems to repair the token ranges owned by the node it is run on in all the data centers, but it also hangs if run on multiple nodes at once.

My keyspace keeps only one replica per data center so I don't think I can use the -local option.

Some of the exceptions I see when repair hangs are:

ERROR [ValidationExecutor:4] 2016-07-07 12:00:31,938 CassandraDaemon.java (line 227) Exception in thread Thread[ValidationExecutor:4,1,main]
java.lang.NullPointerException: null
        at org.apache.cassandra.service.ActiveRepairService$ParentRepairSession.getActiveSSTables(ActiveRepairService.java:495) ~[main/:na]
        at org.apache.cassandra.service.ActiveRepairService$ParentRepairSession.access$300(ActiveRepairService.java:451) ~[main/:na]
        at org.apache.cassandra.service.ActiveRepairService.currentlyRepairing(ActiveRepairService.java:338) ~[main/:na]
        at org.apache.cassandra.db.compaction.CompactionManager.getSSTablesToValidate(CompactionManager.java:1320) ~[main/:na]

ERROR [Repair#6:1] 2016-07-07 12:00:35,221 CassandraDaemon.java (line 227) Exception in thread Thread[Repair#6:1,5,RMI Runtime]
com.google.common.util.concurrent.UncheckedExecutionException: org.apache.cassandra.exceptions.RepairException: [repair #67bd9b10-...
]]] Validation failed in /198.18.87.51
        at com.google.common.util.concurrent.Futures.wrapAndThrowUnchecked(Futures.java:1525) ~[guava-18.0.jar:na]
        at com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1511) ~[guava-18.0.jar:na]
        at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160) ~[main/:na]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_71]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[na:1.8.0_71]
        at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_71]
Caused by: org.apache.cassandra.exceptions.RepairException: [repair #67bd9b10...
]]] Validation failed in /198.18.87.51
        at org.apache.cassandra.repair.ValidationTask.treesReceived(ValidationTask.java:68) ~[main/:na]
        at org.apache.cassandra.repair.RepairSession.validationComplete(RepairSession.java:183) ~[main/:na]
        at org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:439) ~[main/:na]
        at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:169) ~[main/:na]

ERROR [ValidationExecutor:3] 2016-07-07 12:42:01,298 CassandraDaemon.java (line 227) Exception in thread Thread[ValidationExecutor:3,1,main]
java.lang.RuntimeException: Cannot start multiple repair sessions over the same sstables
        at org.apache.cassandra.db.compaction.CompactionManager.getSSTablesToValidate(CompactionManager.java:1325) ~[main/:na]
        at org.apache.cassandra.db.compaction.CompactionManager.doValidationCompaction(CompactionManager.java:1215) ~[main/:na]
        at org.apache.cassandra.db.compaction.CompactionManager.access$700(CompactionManager.java:81) ~[main/:na]
        at org.apache.cassandra.db.compaction.CompactionManager$11.call(CompactionManager.java:844) ~[main/:na]
Studley answered 7/7, 2016 at 15:15 Comment(7)
May I ask why "repairs" are continually necessary? It superficially seems to me that this is patching-over some underlying problem.Commonly
Because our network often requires restarting machines, and this causes Cassandra nodes to miss data and become inconsistent. Sometimes we also have network outages to nodes.Studley
Ahh, I see. Restarts are always a bugaboo.Commonly
Have you been able to solve this? If so, how did you do it? I am running a growing cluster of Cassandra nodes and getting concerned about the same issue.Ingroup
@Ingroup No I haven't found a solution. It is an ongoing headache. If it would just abort cleanly it wouldn't be that bad, but the only way to clear it seems to be a rolling restart of all nodes. Also I sometimes see repair hang if one node is running repair and another node in the cluster is restarted during the repair.Studley
Any solution for this problem.? We have lesser number of nodes and still getting odd situation with this problemSpoilt
@Spoilt I haven't found a solution. The problem seems to happen most often if you use a schema where a lot of data is in one table. It seems that repair can't repair the same table on more than one node at a time, and if you try it will crash and burn and leave you in this hung state. I've tried to mitigate the problem by splitting my data into multiple tables to reduce the odds of the same table being repaired at the same time.Studley
L
1

I see some confusion regarding the repair process, so let's see if I can help.

If you care about the data, you need to repair. There are a variety of scenarios that can create inconsistencies in a Cassandra database, for example, file corruption, network partition, process crash before flushing on disk, human error, etc… These data replica inconsistencies, often referred to as entropy, are resolved with an anti-entropy process: the repair.

Data usually needs to be repaired at least once during the gc_grace_period, which is set to 10 days by default. If this isn't done, you might encounter zombie data.

What's zombie data? When a node gets a delete command for a record it manages, it marks the record for deletion using a special value known as a tombstone. The node then tries to share this tombstone with other nodes that hold copies of the same record. If one of these replica nodes is inactive during this process, it doesn't receive the tombstone right away. As a result, it still holds the version of the record before it was deleted. When the inactive node becomes active again, if the tombstone has already been removed from the rest of the network, the database treats the record on the reactivated node as new information and spreads it across the network. This resurrected deleted record is referred to as a zombie.

To stop zombie data from emerging, the database gives each tombstone a grace period. This gives unresponsive nodes time to recover and handle tombstones as usual. When a read request involves multiple replica responses that differ from each other, the most recent value takes priority. For instance, if there's a newer tombstone and an older record, the database will go with the recent tombstone.

Due to this, if a node remains inactive for longer than the grace period, it's recommended not to revive and repair it. Instead, it's better to rebuild it.

Is there a way to repair more than one node at a time? Yes, it depends on the tool that you are using to repair the data.

A quick resume of the tools that you can use to repair a Cassandra cluster:

AxonOps

In a production environment, especially for medium to large clusters, it's recommended to use a tool like AxonOps. AxonOps is a user-friendly yet highly customizable solution that handles various tasks such as backups, restores, repairs, configurations, orchestration, monitoring, and alerting.

I used AxonOps on a production cluster comprising approximately 150 nodes, each with around 400-500 GB of data. For the repair in particular it has the function of Adaptive Repair, which slows down the load automatically when the cluster is busy.

NodeSync

NodeSync functions as an ongoing repair system running in the background, and it comes pre-installed in all DSE versions beyond 6.0, needing only activation. You can toggle its status for specific keyspaces or tables. When active, it effectively replaces anti-entropy with minimal resource overhead, effectively addressing entropy issues without affecting performance during peak usage.

NodeSync can be used alongside nodetool repair. Nodetool repair will conveniently ignore tables with active NodeSync, allowing you to enable it for tables that experience lower traffic while running nodetool repair -pr for others when it's more suitable for your traffic patterns.

Let's walk through a quick example:

First, confirm that the service is enabled.

$ nodetool nodesyncservice status

Next, activate NodeSync for a table.

$ nodesync enable -v keyspace.table
Nodesync enabled for keyspace.table 

Reaper

The tool originally had the name "Repair," but due to the developer's strong Swedish accent, it ended up being called "Reaper."

Cassandra Reaper serves as a remote repair management tool.

Reaper lets you schedule, pause, and resume repair tasks efficiently.

It's worth noting that Cassandra Reaper requires communication with Cassandra's JMX port, which can involve authentication. Some administrators find this controversial because exposing JMX on a public port can pose unnecessary security risks.

Additionally, Cassandra Reaper comes with built-in integrations with monitoring systems like Prometheus and Graphite.

Lemmon answered 1/9, 2023 at 16:24 Comment(0)
M
0

Depending on your data size and how the schema spreads the data by keyspace and table, and number of tokens by node, you could run multiple repairs targeting these dimensions. For large keyspaces and tables, you could also use the start/end token options on repair. You can find the tokens by node by running nodetool ring command. Another way to keep repair smaller in scope is running incremental and parallel repairs, check options in nodetool repair.

Misericord answered 8/7, 2016 at 6:55 Comment(1)
I don't understand what you're suggesting as a solution. Are you saying I should make a central repair controller that somehow analyzes the output of nodetool ring and runs some repairs in parallel? How do you do that? The default in 3.7 is for incremental and parallel repairs, so I'm already using those settings. That does not prevent repair from hanging if repair is run on more than one node at a time.Studley
M
0

I think @viorel was suggesting sub-range repair. Here's the datastax doc for cassandra 3.0 where they describe it as fast repair. And here's an explanation of why it might be faster. Basically, instead of computing the Merkle tree for a whole range, break the partition-range down into sub-ranges and then compare them. Here's an explanation of why that works.

Mescal answered 14/7, 2016 at 19:0 Comment(3)
The speed of the repair doesn't really matter to me, I just want to avoid repair throwing exceptions and hanging. If I could patch Cassandra so that repair would gracefully exit if it detected another repair was already in progress, then at least I wouldn't need an operator to manually restart the nodes to clean up the stuck repairs.Studley
I was thinking of avoiding the overlapping repairs by having them complete faster. I suppose that eventually the load will be large enough to make this only a temporary solution.Mescal
Yes, that's been my experience. I first ran into this problem when one of the nodes was paging memory heavily (due to a non-Cassandra application running on the same machine), and this seemed to slow down repair considerably, causing it to overlap with another node. So some kind of coordination mechanism or central planner seems to be needed since the nodes do not protect themselves from simultaneous repairs.Studley
E
0

You can try cassandra-reaper: Software to run automated repairs of Cassandra https://github.com/thelastpickle/cassandra-reaper

Erne answered 15/7, 2016 at 3:12 Comment(6)
It doesn't look like that tool is maintained anymore, and one of the open issues is that is doesn't work with newer releases like 3.x. I'd prefer not to use a centralized repair scheduler since that would become a single point of failure. The documentation talks about "opportunistically running multiple parallel repairs at the same time on different nodes", but I'm not clear on what can be repaired in parallel versus what will trigger repair to throw exceptions and hang. It seems that repair doesn't protect itself against attempts to repair the same table by multiple nodes.Studley
None of the answers were quite what I was looking for, but your answer was the most useful, so I'll award the bounty to you. Thanks for the information.Studley
I maintain a 50+ nodes cluster with 2 DC. I setup repair tasks on one node and repair different keyspace every day to make sure every keyspace get repaired once a week. I do not use -pr flag and everything works fine. Hope this can be helpful.Erne
Are you saying you run repair on a single node without -pr and it repairs the data on all 50 nodes in the cluster? I thought repair needed to be run on each node to repair the entire cluster. Or on each node in one DC using the -dcpar option.Studley
Yes, you are right. Since our data model never delete data, we choose one node to repair every week and eventually all node will get repaired. We plan to switch to sub range repair recently. I think you can also try DSE repair service: docs.datastax.com/en/opscenter/5.1/opsc/online_help/services/…Erne
What I think he meant is he use the -h argument to repair one different node every day so at the end of the week all token ranges are repaired thus he only need to run the command from one node were his script is located.Nightrider
T
0

That may be the tip of another problem. For example, you may have a very large sstable file that is not compatible with your repair strategy constrained in less than one hour. This very large file may contain data from several token range which causes competition between repair tasks

You may find it more easy to launch repair from a central point to effectively run repair sequentially and not every hour. Possible solutions:

  • cassandra-reaper if possible
  • a script on one node that would ssh to all nodes to run the nodetool command
  • a script on one node that would nodetool -h x.x.X.X repair -pr command, but that would need to allow remote JMX connection on each host
Toomay answered 11/5, 2021 at 12:7 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.