Master can't connect to cluster

O

2

7

After a cluster upgrade, one of three masters can't connect back to the cluster. I have a HA cluster running in us-east-1a, us-east-1b and us-east-1c, my master that is running in us-east-1a can't join back to the cluster.

I tried to scale down the master-us-east-1a instance group to zero nodes and back it to one node but the EC2 machine starts with the same problem, can't join back to the cluster again, seems to starts with a backup or something.

I tried to connect to the master to restart the services, maybe protukube or docker, but I can't solve the problem too.

Connecting via ssh in the master I noticed that the flannel service is not running in this machine. I tried to run manually via docker without success. Seems that flannel is the network service that should be running and is not.

Can I reset the master of us-east-1a and create it from zero?
Any ideas about getting flannel service running in this master?

Thanks in advance.

attachments

> kubectl get nodes
NAME                             STATUS     ROLES    AGE   VERSION
ip-xxx-xxx-xxx-xxx.ec2.internal  Ready      node     33d   v1.11.9
ip-xxx-xxx-xxx-xxx.ec2.internal  Ready      master   33d   v1.11.9
ip-xxx-xxx-xxx-xxx.ec2.internal  Ready      node     33d   v1.11.9
ip-xxx-xxx-xxx-xxx.ec2.internal  Ready      master   33d   v1.11.9
ip-xxx-xxx-xxx-xxx.ec2.internal  Ready      node     33d   v1.11.9

-

> sudo systemctl status kubelet

Jan 10 21:00:55 ip-xxx-xxx-xxx-xxx kubelet[2502]: I0110 21:00:55.026553    2502 kubelet_node_status.go:441] Recording NodeHasSufficientPID event message for node ip-xxx-xxx-xxx-xxx.ec2.internal
Jan 10 21:00:55 ip-xxx-xxx-xxx-xxx kubelet[2502]: I0110 21:00:55.027005    2502 kubelet_node_status.go:79] Attempting to register node ip-xxx-xxx-xxx-xxx.ec2.internal
Jan 10 21:00:55 ip-xxx-xxx-xxx-xxx kubelet[2502]: E0110 21:00:55.027764    2502 kubelet_node_status.go:103] Unable to register node "ip-xxx-xxx-xxx-xxx.ec2.internal" with API server: Post https://127.0.0.1/api/v1/nodes: dial tcp 127.0.0.1:443: connect: connection refused

-

> sudo docker logs k8s_kube-apiserver_kube-apiserver-ip-xxx-xxx-xxx-xxx.ec2.internal_kube-system_134d55c1b1c3bf3583911989a14353da_16

F0110 20:59:35.581865       1 storage_decorator.go:57] Unable to create storage backend: config (&{etcd3 /registry [http://127.0.0.1:4001]    true false 1000 0xc42013c480 <nil> 5m0s 1m0s}), err (dial tcp 127.0.0.1:4001: connect: connection refused)

-

> sudo docker version

Client:
 Version:      17.03.2-ce
 API version:  1.27
 Go version:   go1.7.5
 Git commit:   f5ec1e2
 Built:        Tue Jun 27 02:31:19 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.2-ce
 API version:  1.27 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   f5ec1e2
 Built:        Tue Jun 27 02:31:19 2017
 OS/Arch:      linux/amd64
 Experimental: false

-

> kubectl version

Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.9", GitCommit:"16236ce91790d4c75b79f6ce96841db1c843e7d2", GitTreeState:"clean", BuildDate:"2019-03-25T06:40:24Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}
The connection to the server 127.0.0.1 was refused - did you specify the right host or port?

-

> sudo docker images

REPOSITORY                           TAG                 IMAGE ID            CREATED             SIZE
protokube                            1.15.0              6b00e7216827        7 weeks ago         288 MB
k8s.gcr.io/kube-proxy                v1.11.9             e18fcce798b8        9 months ago        98.1 MB
k8s.gcr.io/kube-controller-manager   v1.11.9             634ccbd18a0f        9 months ago        155 MB
k8s.gcr.io/kube-apiserver            v1.11.9             ef9a84756d40        9 months ago        187 MB
k8s.gcr.io/kube-scheduler            v1.11.9             e00d30bd3a71        9 months ago        56.9 MB
k8s.gcr.io/pause-amd64               3.0                 99e59f495ffa        3 years ago         747 kB
kopeio/etcd-manager                  3.0.20190930        7937b67f722f        50 years ago        656 MB

-

> sudo docker ps

CONTAINER ID        IMAGE                                                                                                        COMMAND                  CREATED             STATUS              PORTS               NAMES
b4eb0ec9e6a2        k8s.gcr.io/kube-scheduler@sha256:372ab1014701f60b67a65d94f94d30d19335294d98746edcdfcb8808ed5aee3c            "/bin/sh -c 'mkfif..."   15 hours ago        Up 15 hours                             k8s_kube-scheduler_kube-scheduler-ip-xxx-xxx-xxx-xxx.ec2.internal_kube-system_105cd5bac4edf48f265f31eb756b971a_0
8f827dc0eade        kopeio/etcd-manager@sha256:cb0ed7c56dadbc0f4cd515906d72b30094229d6e0a9fcb7aa44e23680bf9a3a8                  "/bin/sh -c 'mkfif..."   15 hours ago        Up 15 hours                             k8s_etcd-manager_etcd-manager-main-ip-xxx-xxx-xxx-xxx.ec2.internal_kube-system_a6a467f6b78a7c7bc15ec1f64799516d_0
5bebb169b8b3        k8s.gcr.io/kube-controller-manager@sha256:aa9b9dac085a65c47746fa8739cf70e9d7e9a356a836ad2ef073da0d7b136db2   "/bin/sh -c 'mkfif..."   15 hours ago        Up 15 hours                             k8s_kube-controller-manager_kube-controller-manager-ip-xxx-xxx-xxx-xxx.ec2.internal_kube-system_564bccf38cd14aa0f647593e69b159ab_0
4467d550824e        k8s.gcr.io/kube-proxy@sha256:a63c81fe4d3e9575cc0a29c4866a2975b01a07c0f473ab2cf1e88ebf78739f80                "/bin/sh -c 'mkfif..."   15 hours ago        Up 15 hours                             k8s_kube-proxy_kube-proxy-ip-xxx-xxx-xxx-xxx.ec2.internal_kube-system_22cd6fe287e6f4bae556504b3245f385_0
0a5c23006e18        kopeio/etcd-manager@sha256:cb0ed7c56dadbc0f4cd515906d72b30094229d6e0a9fcb7aa44e23680bf9a3a8                  "/bin/sh -c 'mkfif..."   15 hours ago        Up 15 hours                             k8s_etcd-manager_etcd-manager-events-ip-xxx-xxx-xxx-xxx.ec2.internal_kube-system_9f2a8de168741a0263161532f42e97b4_0
3efa9ae55618        k8s.gcr.io/pause-amd64:3.0                                                                                   "/pause"                 15 hours ago        Up 15 hours                             k8s_POD_kube-proxy-ip-xxx-xxx-xxx-xxx.ec2.internal_kube-system_22cd6fe287e6f4bae556504b3245f385_0
4e451bc007ac        k8s.gcr.io/pause-amd64:3.0                                                                                   "/pause"                 15 hours ago        Up 15 hours                             k8s_POD_kube-scheduler-ip-xxx-xxx-xxx-xxx.ec2.internal_kube-system_105cd5bac4edf48f265f31eb756b971a_0
7c5c301e034a        k8s.gcr.io/pause-amd64:3.0                                                                                   "/pause"                 15 hours ago        Up 15 hours                             k8s_POD_kube-apiserver-ip-xxx-xxx-xxx-xxx.ec2.internal_kube-system_134d55c1b1c3bf3583911989a14353da_0
d88f075fa61f        k8s.gcr.io/pause-amd64:3.0                                                                                   "/pause"                 15 hours ago        Up 15 hours                             k8s_POD_etcd-manager-main-ip-xxx-xxx-xxx-xxx.ec2.internal_kube-system_a6a467f6b78a7c7bc15ec1f64799516d_0
69e8844e9c14        k8s.gcr.io/pause-amd64:3.0                                                                                   "/pause"                 15 hours ago        Up 15 hours                             k8s_POD_kube-controller-manager-ip-xxx-xxx-xxx-xxx.ec2.internal_kube-system_564bccf38cd14aa0f647593e69b159ab_0
05e67c2e8f98        k8s.gcr.io/pause-amd64:3.0                                                                                   "/pause"                 15 hours ago        Up 15 hours                             k8s_POD_etcd-manager-events-ip-xxx-xxx-xxx-xxx.ec2.internal_kube-system_9f2a8de168741a0263161532f42e97b4_0
eee0a4d563c0        protokube:1.15.0                                                                                             "/usr/bin/protokub..."   15 hours ago        Up 15 hours                             hungry_shirley

Ozoniferous answered 10/1, 2020 at 18:43 Comment(0)

Y

2

The Kubelet is trying to register the master node us-east-1a with an API Server endpoint https://127.0.0.1:443. I believe this should be API server endpoint of any of the other two masters. Kubelet uses kubelet.conf file to talk to the API Server to register node.Change the server in kubelet.conf file located at /etc/kubernetes to point to one of the below:

Elastic IP or public IP of master node at us-east-1b or us-east-1c ex https://xx.xx.xx.xx:6443
Private IP of current master node us-east-1b or us-east-1c ex https://xx.xx.xx.xx:6443
FQDN of current master node if you have a Load balancer in-front of your master nodes running the kubernetes API server.

After changing kubelet.conf restart kubelet.

Edit: Since you are using etcd manager can you try the Kubernetes service unavailable / flannel issues troubleshooting step described here

Yugoslavia answered 11/1, 2020 at 4:48 Comment(9)

I'll try this solution, but shouldn't master node us-east-1a have a API Server running in the docker too? – Ozoniferous 13/1, 2020 at 12:47

kubelet runs the api server as a container but since kubelet is failing to start its not running. – Yugoslavia 13/1, 2020 at 13:3

I found a related issue in kops github repo github.com/kubernetes/kops/issues/6605. It seems to be the exact problem that I'm facing with my cluster. In kops release notes, the upgrade to 1.12 is talking about etcd migration and I think I should roll masters as quickly as possible. I remember when I upgrade my cluster I don't roll all master at once, my command roll each master one by one, but fails at first that never come back to the cluster. – Ozoniferous 13/1, 2020 at 13:17

Is kubelet the guy that will bring flannel container running? – Ozoniferous 13/1, 2020 at 19:36

I was unable to find the file /etc/kubernetes/kubelet.conf – Ozoniferous 13/1, 2020 at 19:48

Yes kubelet will run flannel..okay location maybe different for koos..can find the file – Yugoslavia 14/1, 2020 at 2:24

I found a yaml file located in ~/.kube/config that has server: https://127.0.0.1 configuration. Should I edit this line and replace with a master node IP? And them restart kubelet service? – Ozoniferous 14/1, 2020 at 12:43

It seems etcd is the issue...can you try the troubleshooting steps I added in the edited answer. – Yugoslavia 14/1, 2020 at 14:39

Hey, thanks for the help, your last tip works for me. – Ozoniferous 14/1, 2020 at 18:38

I

0

can you verify if etcd serivce is running and online on us-east-1a.

Inerney answered 14/1, 2020 at 9:47 Comment(1)

I update the question with the sudo docker ps output, seems to be two etcd containers running in this instance, a main and an events container. – Ozoniferous 14/1, 2020 at 12:38

Recommended topics

Hot tags