Kube flannel in CrashLoopBackOff status
Asked Answered
S

7

18

We just start to create our cluster on kubernetes.

Now we try to deploy tiller but we have en error:

NetworkPlugin cni failed to set up pod "tiller-deploy-64c9d747bd-br9j7_kube-system" network: open /run/flannel/subnet.env: no such file or directory

After that I call:

kubectl get pods --all-namespaces -o wide

And got response:

NAMESPACE     NAME                                   READY     STATUS              RESTARTS   AGE       IP              NODE          NOMINATED NODE
kube-system   coredns-78fcdf6894-ksdvt               1/1       Running             2          7d        192.168.0.4     kube-master   <none>
kube-system   coredns-78fcdf6894-p4l9q               1/1       Running             2          7d        192.168.0.5     kube-master   <none>
kube-system   etcd-kube-master                       1/1       Running             2          7d        10.168.209.20   kube-master   <none>
kube-system   kube-apiserver-kube-master             1/1       Running             2          7d        10.168.209.20   kube-master   <none>
kube-system   kube-controller-manager-kube-master    1/1       Running             2          7d        10.168.209.20   kube-master   <none>
kube-system   kube-flannel-ds-amd64-42rl7            0/1       CrashLoopBackOff    2135       7d        10.168.209.17   node5         <none>
kube-system   kube-flannel-ds-amd64-5fx2p            0/1       CrashLoopBackOff    2164       7d        10.168.209.14   node2         <none>
kube-system   kube-flannel-ds-amd64-6bw5g            0/1       CrashLoopBackOff    2166       7d        10.168.209.15   node3         <none>
kube-system   kube-flannel-ds-amd64-hm826            1/1       Running             1          7d        10.168.209.20   kube-master   <none>
kube-system   kube-flannel-ds-amd64-thjps            0/1       CrashLoopBackOff    2160       7d        10.168.209.16   node4         <none>
kube-system   kube-flannel-ds-amd64-w99ch            0/1       CrashLoopBackOff    2166       7d        10.168.209.13   node1         <none>
kube-system   kube-proxy-d6v2n                       1/1       Running             0          7d        10.168.209.13   node1         <none>
kube-system   kube-proxy-lcckg                       1/1       Running             0          7d        10.168.209.16   node4         <none>
kube-system   kube-proxy-pgblx                       1/1       Running             1          7d        10.168.209.20   kube-master   <none>
kube-system   kube-proxy-rnqq5                       1/1       Running             0          7d        10.168.209.14   node2         <none>
kube-system   kube-proxy-wc959                       1/1       Running             0          7d        10.168.209.15   node3         <none>
kube-system   kube-proxy-wfqqs                       1/1       Running             0          7d        10.168.209.17   node5         <none>
kube-system   kube-scheduler-kube-master             1/1       Running             2          7d        10.168.209.20   kube-master   <none>
kube-system   kubernetes-dashboard-6948bdb78-97qcq   0/1       ContainerCreating   0          7d        <none>          node5         <none>
kube-system   tiller-deploy-64c9d747bd-br9j7         0/1       ContainerCreating   0          45m       <none>          node4         <none>

We have some flannel pods in CrashLoopBackOff status. For example kube-flannel-ds-amd64-42rl7.

When I call:

kubectl describe pod -n kube-system kube-flannel-ds-amd64-42rl7

I've got status Running:

Name:               kube-flannel-ds-amd64-42rl7
Namespace:          kube-system
Priority:           0
PriorityClassName:  <none>
Node:               node5/10.168.209.17
Start Time:         Wed, 22 Aug 2018 16:47:10 +0300
Labels:             app=flannel
                    controller-revision-hash=911701653
                    pod-template-generation=1
                    tier=node
Annotations:        <none>
Status:             Running
IP:                 10.168.209.17
Controlled By:      DaemonSet/kube-flannel-ds-amd64
Init Containers:
  install-cni:
    Container ID:  docker://eb7ee47459a54d401969b1770ff45b39dc5768b0627eec79e189249790270169
    Image:         quay.io/coreos/flannel:v0.10.0-amd64
    Image ID:      docker-pullable://quay.io/coreos/flannel@sha256:88f2b4d96fae34bfff3d46293f7f18d1f9f3ca026b4a4d288f28347fcb6580ac
    Port:          <none>
    Host Port:     <none>
    Command:
      cp
    Args:
      -f
      /etc/kube-flannel/cni-conf.json
      /etc/cni/net.d/10-flannel.conflist
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 22 Aug 2018 16:47:24 +0300
      Finished:     Wed, 22 Aug 2018 16:47:24 +0300
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /etc/cni/net.d from cni (rw)
      /etc/kube-flannel/ from flannel-cfg (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from flannel-token-9wmch (ro)
Containers:
  kube-flannel:
    Container ID:  docker://521b457c648baf10f01e26dd867b8628c0f0a0cc0ea416731de658e67628d54e
    Image:         quay.io/coreos/flannel:v0.10.0-amd64
    Image ID:      docker-pullable://quay.io/coreos/flannel@sha256:88f2b4d96fae34bfff3d46293f7f18d1f9f3ca026b4a4d288f28347fcb6580ac
    Port:          <none>
    Host Port:     <none>
    Command:
      /opt/bin/flanneld
    Args:
      --ip-masq
      --kube-subnet-mgr
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 30 Aug 2018 10:15:04 +0300
      Finished:     Thu, 30 Aug 2018 10:15:08 +0300
    Ready:          False
    Restart Count:  2136
    Limits:
      cpu:     100m
      memory:  50Mi
    Requests:
      cpu:     100m
      memory:  50Mi
    Environment:
      POD_NAME:       kube-flannel-ds-amd64-42rl7 (v1:metadata.name)
      POD_NAMESPACE:  kube-system (v1:metadata.namespace)
    Mounts:
      /etc/kube-flannel/ from flannel-cfg (rw)
      /run from run (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from flannel-token-9wmch (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  run:
    Type:          HostPath (bare host directory volume)
    Path:          /run
    HostPathType:
  cni:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cni/net.d
    HostPathType:
  flannel-cfg:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      kube-flannel-cfg
    Optional:  false
  flannel-token-9wmch:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  flannel-token-9wmch
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  beta.kubernetes.io/arch=amd64
Tolerations:     node-role.kubernetes.io/master:NoSchedule
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/unreachable:NoExecute
Events:
  Type     Reason   Age                  From            Message
  ----     ------   ----                 ----            -------
  Normal   Pulled   51m (x2128 over 7d)  kubelet, node5  Container image "quay.io/coreos/flannel:v0.10.0-amd64" already present on machine
  Warning  BackOff  1m (x48936 over 7d)  kubelet, node5  Back-off restarting failed container

here kube-controller-manager.yaml:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    scheduler.alpha.kubernetes.io/critical-pod: ""
  creationTimestamp: null
  labels:
    component: kube-controller-manager
    tier: control-plane
  name: kube-controller-manager
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-controller-manager
    - --address=127.0.0.1
    - --allocate-node-cidrs=true
    - --cluster-cidr=192.168.0.0/24
    - --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt
    - --cluster-signing-key-file=/etc/kubernetes/pki/ca.key
    - --controllers=*,bootstrapsigner,tokencleaner
    - --kubeconfig=/etc/kubernetes/controller-manager.conf
    - --leader-elect=true
    - --node-cidr-mask-size=24
    - --root-ca-file=/etc/kubernetes/pki/ca.crt
    - --service-account-private-key-file=/etc/kubernetes/pki/sa.key
    - --use-service-account-credentials=true
    image: k8s.gcr.io/kube-controller-manager-amd64:v1.11.2
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 127.0.0.1
        path: /healthz
        port: 10252
        scheme: HTTP
      initialDelaySeconds: 15
      timeoutSeconds: 15
    name: kube-controller-manager
    resources:
      requests:
        cpu: 200m
    volumeMounts:
    - mountPath: /etc/ssl/certs
      name: ca-certs
      readOnly: true
    - mountPath: /etc/kubernetes/controller-manager.conf
      name: kubeconfig
      readOnly: true
    - mountPath: /usr/libexec/kubernetes/kubelet-plugins/volume/exec
      name: flexvolume-dir
    - mountPath: /etc/pki
      name: etc-pki
      readOnly: true
    - mountPath: /etc/kubernetes/pki
      name: k8s-certs
      readOnly: true
  hostNetwork: true
  priorityClassName: system-cluster-critical
  volumes:
  - hostPath:
      path: /etc/ssl/certs
      type: DirectoryOrCreate
    name: ca-certs
  - hostPath:
      path: /etc/kubernetes/controller-manager.conf
      type: FileOrCreate
    name: kubeconfig
  - hostPath:
      path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec
      type: DirectoryOrCreate
    name: flexvolume-dir
  - hostPath:
      path: /etc/pki
      type: DirectoryOrCreate
    name: etc-pki
  - hostPath:
      path: /etc/kubernetes/pki
      type: DirectoryOrCreate
    name: k8s-certs
status: {}

OS is CentOS Linux release 7.5.1804

logs from one of pods:

# kubectl logs --namespace kube-system kube-flannel-ds-amd64-5fx2p

main.go:475] Determining IP address of default interface
main.go:488] Using interface with name eth0 and address 10.168.209.14
main.go:505] Defaulting external address to interface address (10.168.209.14)
kube.go:131] Waiting 10m0s for node controller to sync
kube.go:294] Starting kube subnet manager
kube.go:138] Node controller sync successful
main.go:235] Created subnet manager: Kubernetes Subnet Manager - node2
main.go:238] Installing signal handlers
main.go:353] Found network config - Backend type: vxlan
vxlan.go:120] VXLAN config: VNI=1 Port=0 GBP=false DirectRouting=false
main.go:280] Error registering network: failed to acquire lease: node "node2" pod cidr not assigned
main.go:333] Stopping shutdownHandler...

Where error is?

Sheaves answered 30/8, 2018 at 13:23 Comment(3)
in the node where the pod is located check /etc/cni/net.d if you have more than one .conf then you can delete one and restart the node. source: github.com/kubernetes/kubeadm/issues/578. note this is one of the solutions.Lyonnais
To me, the error seems that somehow your nodes are not getting a pod cidr assigned to them . Error registering network: failed to acquire lease: node "node2" pod cidr not assigned some suggestions here coreos.com/flannel/docs/latest/kubernetes.html and here might help coreos.com/flannel/docs/latest/troubleshooting.htmlHusk
Thans for help. We just reset cluster and restart it.Sheaves
G
24

For flannel to work correctly, you must pass --pod-network-cidr=10.244.0.0/16 to kubeadm init.

Gerfalcon answered 17/4, 2019 at 14:9 Comment(1)
This is only true for the master node. Other worker nodes should not run this command. Look at the other answer, supplied by @pande below; that is what fixed this problem for me.Cultch
C
9

Try this:

Failed to acquire lease simply means, the pod didn't get the podCIDR. Happened with me as well although the manifest on master-node says podCIDR true but still it wasn't working and funnel going in crashbackloop. This is what i did to fix it.

From the master-node, first find out your funnel CIDR

sudo cat /etc/kubernetes/manifests/kube-controller-manager.yaml | grep -i cluster-cidr

Output:

- --cluster-cidr=172.168.10.0/24

Then run the following from the master node:

kubectl patch node slave-node-1 -p '{"spec":{"podCIDR":"172.168.10.0/24"}}'

where, slave-node-1 is your node where acquire lease is failing podCIDR is the cidr that you found in previous command

Hope this helps.

Cacie answered 30/10, 2019 at 3:43 Comment(0)
I
7

The reason is that

  1. flannel run with CIDR=10.244.0.0/16 NOT 10.244.0.0/24 !!!
  2. CNI Conflicts because the node installed multiple CNIs Plugin within /etc/cni/net.d/.
  3. The 2 Interface flannel.1 and cni0 did not match each other. Eg:

flannel.1=10.244.0.0 and cni0=10.244.1.1 will failed. It should be
flannel.1=10.244.0.0 and cni0=10.244.0.1

To fix this, please following the step below:

  • Step 0: Reset all Nodes within your Cluster. Run all nodes with

kubeadm reset --force;

  • Step 1: Down Interface cni0 and flannel.1.

sudo ifconfig cni0 down;
sudo ifconfig flannel.1 down;

  • Step 2: Delete Interface cni0 and flannel.1.

sudo ip link delete cni0;
sudo ip link delete flannel.1;

  • Step 3: Remove all items within /etc/cni/net.d/.

sudo rm -rf /etc/cni/net.d/;

  • Step 4: Re-Bootstrap your Kubernetes Cluster again.

kubeadm init --control-plane-endpoint="..." --pod-network-cidr=10.244.0.0/16;

  • Step 5: Re-deploy CNIs.

kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml;

  • Step 6: Restart your CNIs, here I used Container Daemon (Containerd).

systemctl restart containerd;


This will ensure your Core-DNS working nicely.

Insidious answered 27/6, 2022 at 9:55 Comment(2)
This should be marked as the right answerAphaeresis
This really works, but I'm not sure can use for production environment. But solve my issue. I just have a step number 7 to join to new cluster. Thanks.Rowena
F
3

I had a similar problem. I did the following steps to make it work:

  • Delete the nodes from the master by kubeadm reset on the worker node.

  • Clear the iptables rules by iptables -F && iptables -t nat -F && iptables -t mangle -F && iptables -X.

  • Clerar the config file by rm -rf $HOME/.kube/config.

  • Reboot the worker node.

  • Disable the Swap on the worker node by swapoff -a.

  • Join the master node, again.

Flatfish answered 8/9, 2019 at 14:24 Comment(0)
C
1

And also ensure SELinux set to Permissive or disabled.

 # getenforce
    Permissive
Cowman answered 26/7, 2020 at 20:54 Comment(0)
C
0

Had the same issue. When followed the solution mentioned by @PanDe, I got the following error.

[root@xxxxxx]# kubectl patch node myslavenode -p '{"spec":{"podCIDR":"10.244.0.0/16"}}' The Node "myslavenode" is invalid:

  • spec.podCIDRs: Forbidden: node updates may not change podCIDR except from "" to valid
  • []: Forbidden: node updates may only change labels, taints, or capacity (or configSource, if the DynamicKubeletConfig feature gate is enabled).

In the end, when selinux was checked,it was enabled. Setting it to permissive resolved the issue. Thanks @senthil murugan.

Regards, Vivek

Chaplin answered 2/6, 2022 at 14:1 Comment(0)
D
0

swapp off memory and reboot your machine might help. Also check node is ready. command: K get nodes

Deficient answered 16/7 at 22:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.