I set up a Kubernetes cluster with a single master node and two worker nodes using kubeadm
, and I am trying to figure out how to recover from node failure.
When a worker node fails, recovery is straightforward: I create a new worker node from scratch, run kubeadm join
, and everything's fine.
However, I cannot figure out how to recover from master node failure (without interrupting the deployments running on the worker nodes). Do I need to backup and restore the original certificates or can I just run kubeadm init
to create a new master from scratch? How do I join the existing worker nodes?
kubeadm reset
, but after I do the steps to restore andkubeadm init
with the appropriate flags, the network seems to be broken. I can delete and then recreate the calico pods, so everything seems to be working, but the pods cannot reach anything over the network... Have you encountered this kind of issue before? – Terhune