cant' replace dead cassandra node because it doesn't exist in gossip
Asked Answered
U

2

7

One of the nodes in a cassandra cluster has died.

I'm using cassandra 2.0.7 throughout.

When I do a nodetool status this is what I see (real addresses have been replaced with fake 10 nets)

[root@beta-new:/opt] #nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens  Owns   Host ID                               Rack
UN  10.10.1.94  171.02 KB  256     49.4%  fd2f76ae-8dcf-4e93-a37f-bf1e9088696e  rack1
DN  10.10.1.98     ?          256     50.6%  f2a48fc7-a362-43f5-9061-4bb3739fdeaf  rack1

I tried to get the token ID for the down node by doing a nodetool ring command, grepping for the IP and doing a head -1 to get the initial one.

[root@beta-new:/opt] #nodetool ring | grep 10.10.1.98 | head -1
10.10.1.98     rack1       Down   Normal  ?               50.59%              -9042969066862165996

I then started following this documentation on how to replace the node:

[http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_replace_node_t.html?scroll=task_ds_aks_15q_gk][1]

So I installed cassandra on a new node but did not start it.

Set the following options:

cluster_name: 'Jokefire Cluster'
seed_provider:
      - seeds: "10.10.1.94"
listen_address: 10.10.1.94
endpoint_snitch: SimpleSnitch

And set the initial token of the new install as the token -1 of the node I'm trying to replace in cssandra.yaml:

initial_token: -9042969066862165995

And after making sure there was no data yet in: /var/lib/cassandra

I started up the database:

[root@web2:/etc/alternatives/cassandrahome] #./bin/cassandra -f -Dcassandra.replace_address=10.10.1.98

The documentation I link to above says to use the replace_address directive on the command line rather than cassandra-env.sh if you have a tarball install (which we do) as opposed to a package install.

After I start it up, cassandra fails with the following message:

Exception encountered during startup: Cannot replace_address /10.10.10.98 because it doesn't exist in gossip

So I'm wondering at this point if I've missed any steps or if there is anything else I can try to replace this dead cassandra node?

Underthrust answered 1/6, 2014 at 16:52 Comment(0)
A
11

Has the rest of your cluster been restarted since the node failure, by chance? Most gossip information does not survive a full restart, so you may genuinely not have gossip information for the down node.

This issue was reported as a bug CASSANDRA-8138, and the answer was:

I think I'd much rather say that the edge case of a node dying, and then a full cluster restart (rolling would still work) is just not supported, rather than make such invasive changes to support replacement under such strange and rare conditions. If that happens, it's time to assassinate the node and bootstrap another one.

So rather than replacing your node, you need to remove the failed node from the cluster and start up a new one. If using vnodes, it's quite straightforward.

Discover the node ID of the failed node (from another node in the cluster)

nodetool status | grep DN

And remove it from the cluster:

nodetool removenode (node ID)

Now you can clear out the data directory of the failed node, and bootstrap it as a brand-new one.

Assignment answered 3/2, 2015 at 3:30 Comment(0)
H
1

Some less known issues of Cassandra dead node replacement has been captured in below link based on my experience:

https://github.com/laxmikant99/cassandra-single-node-disater-recovery-lessons

Honewort answered 12/3, 2018 at 4:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.