how to recover from master failure with kubeadm

Asked 25/3, 2018 at 20:39 Answered 25/5, 2018 at 9:20

I set up a Kubernetes cluster with a single master node and two worker nodes using kubeadm, and I am trying to figure out how to recover from node failure.

When a worker node fails, recovery is straightforward: I create a new worker node from scratch, run kubeadm join, and everything's fine.

However, I cannot figure out how to recover from master node failure (without interrupting the deployments running on the worker nodes). Do I need to backup and restore the original certificates or can I just run kubeadm init to create a new master from scratch? How do I join the existing worker nodes?

Binford answered 25/3, 2018 at 20:39 Comment(0)

I ended up writing a Kubernetes CronJob backing up the etcd data. If you are interested: I wrote a blog post about it: https://labs.consol.de/kubernetes/2018/05/25/kubeadm-backup.html

In addition to that you may want to backup all of /etc/kubernetes/pki to avoid issues with secrets (tokens) having to be renewed.

For example, kube-proxy uses a secret to store a token and this token becomes invalid if only the etcd certificate is backed up.

Binford answered 25/5, 2018 at 9:20 Comment(6)

I tried following these steps and "simulate" a master crash with kubeadm reset, but after I do the steps to restore and kubeadm init with the appropriate flags, the network seems to be broken. I can delete and then recreate the calico pods, so everything seems to be working, but the pods cannot reach anything over the network... Have you encountered this kind of issue before? – Terhune 30/8, 2018 at 8:34

I am using flannel and it works fine. I don't know much about calico, but there could be multiple reasons for this. Maybe calico updates the etcd state very frequently and cannot deal with resetting etcd to the state of the previous backup. Or maybe calico stores state somewhere else on the master, outside of etcd. – Binford 2/9, 2018 at 19:0

Thank you for the response. Looks like it's not a calico issue, I looked into the logs and most pods were spitting out unauthorized errors. This led me to this: github.com/rancher/rancher/issues/8388 and I tried the described steps of deleting serviceaccount tokens for calico and other components that had these errors and it helped. I'm not sure why this happens though, and it gets annoying since I had to do this for quite a few apps. I don't suppose you know what might be causing such trouble? – Terhune 3/9, 2018 at 8:24

No, sorry. I'm not much into Callico. Glad that you found a workaround though. – Binford 5/9, 2018 at 14:39

wow you made that! I thought that was very cool when I saw it, (found it before this post). Btw another cool thing to checkout is Heptio Ark. (Unique way of backing up etcd and pv's) – Parlance 8/10, 2018 at 3:51

Danke für den Artikel, Fabian! Mein Master Node sollte nun sicher sein. – Ermina 4/2, 2021 at 22:44

As per your mention about Master's backup , actually if you mean backup procedures (like traditional/legacy backups tools/techs) isn't mentioned directly in the official documentation (as i know), but you can take your precautions by some Options/Workarounds :

Setup HA Masters (only for GCE)
Set up High-Availability Kubernetes Masters
Setup HA etcd cluster / Master Load Balancer
Setting-up-an-ha-etcd-cluster
Set up master Load Balancer
Operating etcd clusters for Kubernetes
OS file Systems Snapshot/backup

Eliseelisee answered 27/3, 2018 at 11:13 Comment(0)

kubeadm init will definitely not work out of the box, as that will create a new cluster altogether, credentials, ip space, etc.

At a minimum, restoring the master node will require a backup of your etcd data. This typically lives in /var/lib/etcd directory.

You will also need the kubeadm config from the cluster kubeadm config view should output this. (upward of v1.8)

The step-by-step to restore a master node really isn't so clean cut, which is why they introduce HA - High Availability. This is a much safer way of maintaining redundancy and uptime. Particularly because restoring anything from etcd can be a real pain (in my humble opinion and experience).

If I may go a bit off topic from your question, if you are still getting started with Kubernetes and not deeply invested in kubeadm, i would suggest you consider creating your cluster with kops instead. It supports HA already and I found kops to be more robust and easier to use to either kubeadm and kube-aws (the coreos cluster builder). https://kubernetes.io/docs/getting-started-guides/kops/

Lamarckism answered 25/3, 2018 at 21:54 Comment(0)

Recommended topics

Hot tags