Recovering from Consul "No Cluster leader" state

H

6

17

I have:

one mesos-master in which I configured a consul server;
one mesos-slave in which I configure consul client, and;
one bootstrap server for consul.

When I hit start I am seeing the following error:

2016/04/21 19:31:31 [ERR] agent: failed to sync remote state: rpc error: No cluster leader 2016/04/21 19:31:44 [ERR] agent: coordinate update error: rpc error: No cluster leader

How do I recover from this state?

Hypnoanalysis answered 21/4, 2016 at 14:5 Comment(0)

S

10

Did you look at the Consul docs ?

It looks like you have performed a ungraceful stop and now need to clean your raft/peers.json file by removing all entries there to perform an outage recovery. See the above link for more details.

Sociability answered 28/4, 2016 at 21:11 Comment(1)

Looks like this is a dead link now... I think the new version is learn.hashicorp.com/consul/day-2-operations/outage? – Palinode 5/7, 2019 at 13:8

O

9

As of Consul 0.7 things work differently from Keyan P's answer. raft/peers.json (in the Consul data dir) has become a manual recovery mechanism. It doesn't exist unless you create it, and then when Consul starts it loads the file and deletes it from the filesystem so it won't be read on future starts. There are instructions in raft/peers.info. Note that if you delete raft/peers.info it won't read raft/peers.json but it will delete it anyway, and it will recreate raft/peers.info. The log will indicate when it's reading and deleting the file separately.

Assuming you've already tried the bootstrap or bootstrap_expect settings, that file might help. The Outage Recovery guide in Keyan P's answer is a helpful link. You create raft/peers.json in the data dir and start Consul, and the log should indicate that it's reading/deleting the file and then it should say something like "cluster leadership acquired". The file contents are:

[ { "id": "<node-id>", "address": "<node-ip>:8300", "non_voter": false } ]

where <node-id> can be found in the node-id file in the data dir.

Outdoor answered 14/3, 2019 at 9:30 Comment(0)

S

2

I will update what I did: Little Background: We scaled down the AWS Autoscaling so lost the leader. But we had one server still running but without any leader.
What I did was:

I scaled up to 3 servers(don't make 2-4)
stopped consul in all 3 servers.sudo service consul stop(you can do status/stop/start)
created peers.json file and put it in old server(/opt/consul/data/raft)
start the 3 servers (peers.json should be placed on 1 server only)
For other 2 servers join it to leader using consul join 10.201.8.XXX
check peers are connected to leader using consul operator raft list-peers

Sample peers.json file

[
  {
    "id": "306efa34-1c9c-acff-1226-538vvvvvv",
    "address": "10.201.n.vvv:8300",
    "non_voter": false
  },
  {
    "id": "dbeeffce-c93e-8678-de97-b7",
    "address": "10.201.X.XXX:8300",
    "non_voter": false
  },
  {
    "id": "62d77513-e016-946b-e9bf-0149",
    "address": "10.201.X.XXX:8300",
    "non_voter": false
  }
]

These id you can get from each server in /opt/consul/data/

[root@ip-10-20 data]# ls
checkpoint-signature  node-id  raft  serf
[root@ip-10-1 data]# cat node-id

Some useful commands:

consul members
curl http://ip:8500/v1/status/peers
curl http://ip:8500/v1/status/leader
consul operator raft list-peers
cd opt/consul/data/raft/
consul info
sudo service consul status
consul catalog services

Schrader answered 14/9, 2021 at 20:18 Comment(0)

S

1

If u got raft version more than 2:

[ { "id": "e3a30829-9849-bad7-32bc-11be85a49200", "address": "10.88.0.59:8300", "non_voter": false }, { "id": "326d7d5c-1c78-7d38-a306-e65988d5e9a3", "address": "10.88.0.45:8300", "non_voter": false }, { "id": "a8d60750-4b33-99d7-1185-b3c6d7458d4f", "address": "10.233.103.119", "non_voter": false } ]

Shutin answered 12/12, 2019 at 6:26 Comment(0)

E

1

In my case I had 2 worker nodes in the k8s cluster, after adding another node the consul servers could elect a master and everything is up and running.

Est answered 3/1, 2021 at 21:39 Comment(0)

S

-2

You may also ensure that bootstrap parameter is set in your Consul configuration file config.json on the first node:

# /etc/consul/config.json
{
    "bootstrap": true,
    ...
}

or start the consul agent with the -bootstrap=1 option as described in the official Failure of a single server cluster Consul documentation.

Strawboard answered 1/10, 2021 at 18:4 Comment(0)

Recommended topics

Hot tags