etcd cluster id mistmatch
Asked Answered
B

6

7

Hey I have a cluster id mismatch for some reason, i had it on 1 node then disapperead after clearing data dir few times , changing cluster token and node names, but apperead on another

here is the script i use

IP0=10.150.0.1
IP1=10.150.0.2
IP2=10.150.0.3
IP3=10.150.0.4
NODENAME0=node0
NODENAME1=node1
NODENAME2=node2
NODENAME3=node3

# changing these on each box
THISIP=$IP2
THISNODENAME=$NODENAME2

etcd --name $THISNODENAME --initial-advertise-peer-urls http://$THISIP:2380 \
 --data-dir /root/etcd-data \
 --listen-peer-urls http://$THISIP:2380 \
 --listen-client-urls http://$THISIP:2379,http://127.0.0.1:2379 \
 --advertise-client-urls http://$THISIP:2379 \
 --initial-cluster-token etcd-cluster-2 \
 --initial-cluster $NODENAME0=http://$IP0:2380,$NODENAME1=http://$IP1:2380,$NODENAME2=http://$IP2:2380,$NODENAME3=http://$IP3:2380 \
 --initial-cluster-state new

I get

2016-11-11 22:13:12.090515 I | etcdmain: etcd Version: 2.3.7   
2016-11-11 22:13:12.090643 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2016-11-11 22:13:12.090713 I | etcdmain: listening for peers on http://10.150.0.3:2380
2016-11-11 22:13:12.090745 I | etcdmain: listening for client requests on http://10.150.0.3:2379
2016-11-11 22:13:12.090771 I | etcdmain: listening for client requests on http://127.0.0.1:2379
2016-11-11 22:13:12.090960 I | etcdserver: name = node2
2016-11-11 22:13:12.090976 I | etcdserver: data dir = /root/etcd-data
2016-11-11 22:13:12.090983 I | etcdserver: member dir = /root/etcd-data/member
2016-11-11 22:13:12.090990 I | etcdserver: heartbeat = 100ms
2016-11-11 22:13:12.090995 I | etcdserver: election = 1000ms
2016-11-11 22:13:12.091001 I | etcdserver: snapshot count = 10000
2016-11-11 22:13:12.091011 I | etcdserver: advertise client URLs = http://10.150.0.3:2379
2016-11-11 22:13:12.091269 I | etcdserver: restarting member 7fbd572038b372f6 in cluster 4e73d7b9b94fe83b at commit index 4
2016-11-11 22:13:12.091317 I | raft: 7fbd572038b372f6 became follower at term 8
2016-11-11 22:13:12.091346 I | raft: newRaft 7fbd572038b372f6 [peers: [], term: 8, commit: 4, applied: 0, lastindex: 4, lastterm: 1]
2016-11-11 22:13:12.091516 I | etcdserver: starting server... [version: 2.3.7, cluster version: to_be_decided]
2016-11-11 22:13:12.091869 E | etcdmain: failed to notify systemd for readiness: No socket
2016-11-11 22:13:12.091894 E | etcdmain: forgot to set Type=notify in systemd service file?
2016-11-11 22:13:12.096380 N | etcdserver: added member 7508b3e625cfed5 [http://10.150.0.4:2380] to cluster 4e73d7b9b94fe83b
2016-11-11 22:13:12.099800 N | etcdserver: added member 14c76eb5d27acbc5 [http://10.150.0.1:2380] to cluster 4e73d7b9b94fe83b
2016-11-11 22:13:12.100957 N | etcdserver: added local member 7fbd572038b372f6 [http://10.150.0.2:2380] to cluster 4e73d7b9b94fe83b
2016-11-11 22:13:12.102711 N | etcdserver: added member d416fca114f17871 [http://10.150.0.3:2380] to cluster 4e73d7b9b94fe83b
2016-11-11 22:13:12.134330 E | rafthttp: request cluster ID mismatch (got cfd5ef74b3dcf6fe want 4e73d7b9b94fe83b)

the other memebers are not even running, how that's possible ?

Thank you

Bigamous answered 14/11, 2016 at 9:56 Comment(0)
F
11

For all those who stumble upon this from google:

The error is about peer member ID, that tries to join cluster with same name as another member (probably old instance) that already exists in cluster (with same peer name, but another ID, this is the problem).

you should delete the peer and re-add it like shown in this helpful post:

In order to fix this it was pretty simple, first we had to log into an existing working server on the rest of the cluster and remove server00 from its member list:

etcdctl member remove <UID>

This free's up the ability to allow the new server00 to join but we needed to simply tell the cluster it could by issuing the add command:

etcdctl member add server00 http://1.2.3.4:2380

It you follow the logs on server00 you'll then see that everything spring into life. You can confirm this with the commands:

etcdctl member list

etcdctl cluster-health

Use "etcdctl member list" to find what are the IDs of current members, and find the one which tries to join cluster with wrong ID, then delete that peer from "members" with "etcdctl member remove " and try to rejoin him. Hope it helps.

Flesher answered 12/3, 2017 at 11:52 Comment(0)
F
5

in my case i got the error

rafthttp: request cluster ID mismatch (got 1b3a88599e79f82b want b33939d80a381a57)

due to incorrect config on one node

two my nodes got in config

env ETCD_INITIAL_CLUSTER="etcd-01=http://172.16.50.101:2380,etcd-02=http://172.16.50.102:2380,etcd-03=http://172.16.50.103:2380"

and one node got

env ETCD_INITIAL_CLUSTER="etcd-01=http://172.16.50.101:2380"

to resolve the problem i stopped etcd on all nodes, edited incorrect config, deleted /var/lib/etcd/member folder in all nodes , restarted etcd on all nodes and voila !

p.s.

/var/lib/etcd - is the folder where etcd save its data in my case

Footlight answered 12/11, 2017 at 12:42 Comment(2)
prob the best answer lolGracegraceful
Key to clear old data directory - if this is not cleared, the node just ignores directives to join a cluster, and continues on its own. Most tutorials don't mention this as they assume clean sheet install.Ginn
K
3

My --data-dir=/var/etcd/data, remove and recreate it, that works for me. It seems that something of previous etcd cluster I made left in this directory, which may affect the etcd settings.

Knee answered 8/8, 2017 at 7:40 Comment(2)
This is also applicable when you run a node for even a moment in standalone mode (e.g. to test it) then reconfigure to join a cluster - it will simply continue on in standalone mode and ignore directives to join the cluster until data is cleared.Ginn
This works for me, I was first configure incorrect config and later correct it, I need to remove data directory before joining the node.Performing
G
3

I just ran into this same issue, 2 years later. Dmitry's answer is fine but misses what the OP likely did wrong in the first place when setting up an etcd cluster.

Running an etcd instance with "--cluster-state new" at any point, will generate a cluster ID in the data directory. If you try to then/later join an existing cluster, it will use that old generated cluster ID (which is when the mismatch error occurs). Yes, technically the OP had an "old cluster" but more likely, and 100% common, is when someone is trying to stand up their first cluster, they don't notice the procedure has to change. I find that etcd kind of generally fails in providing a good usage model.

So, removing the member (you don't really need to if the new node never joined successfully) and/or deleting the new node's data directory will "fix" the issue, but its how the OP setup the 2nd cluster node that the problem.

Here's an example of the setup nuance: (sigh... thanks for that etcd...)

# On the 1st node (I used Centos7 minimal, with etcd installed)
sudo firewall-cmd --permanent --add-port=2379/tcp
sudo firewall-cmd --permanent --add-port=2380/tcp
sudo firewall-cmd --reload

export CL_NAME=etcd1
export HOST=$(hostname)
export IP_ADDR=$(ip -4 addr show ens33 | grep -oP '(?<=inet\s)\d+(\.\d+){3}')

# turn on etcdctl v3 api support, why is this not default?!
export ETCDCTL_API=3

sudo etcd --name $CL_NAME --data-dir ~/data --advertise-client-urls=http://127.0.0.1:2379,https://$IP_ADDR:2379 --listen-client-urls=https://0.0.0.0:2379 --initial-advertise-peer-urls https://$IP_ADDR:2380 --listen-peer-urls https://$IP_ADDR:2380 --initial-cluster-state new

Ok, the first node is running. The cluster data is in the ~/data directory. In future runs you only need (note that cluster-state isn't needed):

sudo etcd --name $CL_NAME --data-dir ~/data --advertise-client-urls=http://127.0.0.1:2379,https://$IP_ADDR:2379 --listen-client-urls=https://0.0.0.0:2379 --initial-advertise-peer-urls https://$IP_ADDR:2380 --listen-peer-urls https://$IP_ADDR:2380

Next, add your 2nd node's expected cluster name and peer URLs:

etcdctl --endpoints="https://127.0.0.1:2379" member add etcd2 --peer-urls="http://<next node's IP address>:2380"

Adding the member is important. You won't be able to successfully join without doing it first.

# Next on the 2nd/new node
export CL_NAME=etcd1
export HOST=$(hostname)
export IP_ADDR=$(ip -4 addr show ens33 | grep -oP '(?<=inet\s)\d+(\.\d+){3}')

sudo etcd --name $CL_NAME --data-dir ~/data --advertise-client-urls=https://127.0.0.1:2379,https://$IP_ADDR:2379 --listen-client-urls=https://0.0.0.0:2379 --initial-advertise-peer-urls https://$IP_ADDR:2380 --listen-peer-urls https://$IP_ADDR:2380 --initial-cluster-state existing --initial-cluster="etcd1=http://<IP of 1st node>:2380,etcd2=http://$IP_ADD:2380"

Note the annoying extra arguments here. --initial-cluster must have 100% of all nodes in the cluster identified... which doesn't matter after you join the cluster because cluster data will be replicated anyways... Also "--initial-cluster existing" is needed.

Again, after the 1st time the 2nd node runs/joins, you can run it without any cluster arguments:

sudo etcd --name $CL_NAME --data-dir ~/data --advertise-client-urls=http://127.0.0.1:2379,https://$IP_ADDR:2379 --listen-client-urls=https://0.0.0.0:2379 --initial-advertise-peer-urls https://$IP_ADDR:2380 --listen-peer-urls https://$IP_ADDR:2380

Sure, you could keep running etcd with all the cluster settings in there, but they "might" get ignored for whats in the data directory. Remember that if you join a 3rd node, knowledge of the new node member is replicated to the remaining node, and those "initial" cluster settings could be completely false/misleading in the future when your cluster changes. So run your joined nodes with no initial cluster settings unless you are actually joining one.

Also, last bit to impart, you should/must run at least 3 nodes in a cluster, otherwise the RAFT leader election process will break everything. With 2 nodes, when 1 node goes down or they get disconnected, the node will not elect itself and spin in an election loop. Clients can't talk to an etcd service that's in election mode... Great availability! You need a minimum of 3 nodes to handle if 1 goes down.

Gunshy answered 21/8, 2019 at 17:38 Comment(0)
S
1

I have faced the same problem, our leader etcd server went down and after replacing it with new we were getting an error

 rafthttp: request sent was ignored (cluster ID mismatch)

It was looking for the old cluster-Id and generating some random local cluster with some misconfiguration.

Followed these steps to fix the issue.

  1. Login to other working cluster and remove unreachable member from the cluster

    etcdctl cluster-health etcdctl member remove member-id

  2. Login to new server and stop if etcd process is running systemctl etcd2 stop

  3. Remove data from the data directory rm -rf /var/etcd2/data Keep backup of this data somewhere in other folder before deleting it.

  4. Now start your cluster with --initial-cluster-state existing parameter, don't use --initial-cluster-state new if you are already adding server to existing cluster.

  5. Now go back to one of the running etcd server and add this new member to cluster etcdctl member add node0 http://$IP:2380

I have spent a lot of time on debugging this issue and now my cluster is running healthy with all members. Hope this information helps.

Salk answered 8/9, 2017 at 13:24 Comment(0)
S
1
  1. Add a new node to a existing etcd cluster.

    etcdctl member add <new_node_name> --peer-urls="http://<new_node_ip>:2380"

Attention, if you enable TLS, replace http with https

  1. Run etcd in new node. It is important to add "--initial-cluster-state existing", the purpose is telling new node that join the existing cluster, instead of creating a new cluster.

    etcd --name <new_node_name> --initial-cluster-state existing ...

  2. Check the result

    etcdctl member list

Savdeep answered 8/7, 2021 at 10:58 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.