kube-dns keeps restarting with kubenetes on coreos
Asked Answered
K

2

8

I have Kubernetes installed on Container Linux by CoreOS alpha (1353.1.0) using hyperkube v1.5.5_coreos.0 using my fork of coreos-kubernetes install scripts at https://github.com/kfirufk/coreos-kubernetes.

I have two ContainerOS machines.

  • coreos-2.tux-in.com resolved as 192.168.1.2 as controller
  • coreos-3.tux-in.com resolved as 192.168.1.3 as worker

kubectl get pods --all-namespaces returns

NAMESPACE       NAME                                       READY     STATUS    RESTARTS   AGE
ceph            ceph-mds-2743106415-rkww4                  0/1       Pending   0          1d
ceph            ceph-mon-check-3856521781-bd6k5            1/1       Running   0          1d
kube-lego       kube-lego-3323932148-g2tf4                 1/1       Running   0          1d
kube-system     calico-node-xq6j7                          2/2       Running   0          1d
kube-system     calico-node-xzpp2                          2/2       Running   4560       1d
kube-system     calico-policy-controller-610849172-b7xjr   1/1       Running   0          1d
kube-system     heapster-v1.3.0-beta.0-2754576759-v1f50    2/2       Running   0          1d
kube-system     kube-apiserver-192.168.1.2                 1/1       Running   0          1d
kube-system     kube-controller-manager-192.168.1.2        1/1       Running   1          1d
kube-system     kube-dns-3675956729-r7hhf                  3/4       Running   3924       1d
kube-system     kube-dns-autoscaler-505723555-l2pph        1/1       Running   0          1d
kube-system     kube-proxy-192.168.1.2                     1/1       Running   0          1d
kube-system     kube-proxy-192.168.1.3                     1/1       Running   0          1d
kube-system     kube-scheduler-192.168.1.2                 1/1       Running   1          1d
kube-system     kubernetes-dashboard-3697905830-vdz23      1/1       Running   1246       1d
kube-system     monitoring-grafana-4013973156-m2r2v        1/1       Running   0          1d
kube-system     monitoring-influxdb-651061958-2mdtf        1/1       Running   0          1d
nginx-ingress   default-http-backend-150165654-s4z04       1/1       Running   2          1d

so I can see that kube-dns-782804071-h78rf keeps restarting.

kubectl describe pod kube-dns-3675956729-r7hhf --namespace=kube-system returns:

Name:       kube-dns-3675956729-r7hhf
Namespace:  kube-system
Node:       192.168.1.2/192.168.1.2
Start Time: Sat, 11 Mar 2017 17:54:14 +0000
Labels:     k8s-app=kube-dns
        pod-template-hash=3675956729
Status:     Running
IP:     10.2.67.243
Controllers:    ReplicaSet/kube-dns-3675956729
Containers:
  kubedns:
    Container ID:   rkt://f6480fe7-4316-4e0e-9483-0944feb85ea3:kubedns
    Image:      gcr.io/google_containers/kubedns-amd64:1.9
    Image ID:       rkt://sha512-c7b7c9c4393bea5f9dc5bcbe1acf1036c2aca36ac14b5e17fd3c675a396c4219
    Ports:      10053/UDP, 10053/TCP, 10055/TCP
    Args:
      --domain=cluster.local.
      --dns-port=10053
      --config-map=kube-dns
      --v=2
    Limits:
      memory:   170Mi
    Requests:
      cpu:      100m
      memory:       70Mi
    State:      Running
      Started:      Sun, 12 Mar 2017 17:47:41 +0000
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sun, 12 Mar 2017 17:46:28 +0000
      Finished:     Sun, 12 Mar 2017 17:47:02 +0000
    Ready:      False
    Restart Count:  981
    Liveness:       http-get http://:8080/healthz-kubedns delay=60s timeout=5s period=10s #success=1 #failure=5
    Readiness:      http-get http://:8081/readiness delay=3s timeout=5s period=10s #success=1 #failure=3
    Volume Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-zqbdp (ro)
    Environment Variables:
      PROMETHEUS_PORT:  10055
  dnsmasq:
    Container ID:   rkt://f6480fe7-4316-4e0e-9483-0944feb85ea3:dnsmasq
    Image:      gcr.io/google_containers/kube-dnsmasq-amd64:1.4.1
    Image ID:       rkt://sha512-8c5f8b40f6813bb676ce04cd545c55add0dc8af5a3be642320244b74ea03f872
    Ports:      53/UDP, 53/TCP
    Args:
      --cache-size=1000
      --no-resolv
      --server=127.0.0.1#10053
      --log-facility=-
    Requests:
      cpu:      150m
      memory:       10Mi
    State:      Running
      Started:      Sun, 12 Mar 2017 17:47:41 +0000
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sun, 12 Mar 2017 17:46:28 +0000
      Finished:     Sun, 12 Mar 2017 17:47:02 +0000
    Ready:      True
    Restart Count:  981
    Liveness:       http-get http://:8080/healthz-dnsmasq delay=60s timeout=5s period=10s #success=1 #failure=5
    Volume Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-zqbdp (ro)
    Environment Variables:  <none>
  dnsmasq-metrics:
    Container ID:   rkt://f6480fe7-4316-4e0e-9483-0944feb85ea3:dnsmasq-metrics
    Image:      gcr.io/google_containers/dnsmasq-metrics-amd64:1.0.1
    Image ID:       rkt://sha512-ceb3b6af1cd67389358be14af36b5e8fb6925e78ca137b28b93e0d8af134585b
    Port:       10054/TCP
    Args:
      --v=2
      --logtostderr
    Requests:
      memory:       10Mi
    State:      Running
      Started:      Sun, 12 Mar 2017 17:47:41 +0000
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sun, 12 Mar 2017 17:46:28 +0000
      Finished:     Sun, 12 Mar 2017 17:47:02 +0000
    Ready:      True
    Restart Count:  981
    Liveness:       http-get http://:10054/metrics delay=60s timeout=5s period=10s #success=1 #failure=5
    Volume Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-zqbdp (ro)
    Environment Variables:  <none>
  healthz:
    Container ID:   rkt://f6480fe7-4316-4e0e-9483-0944feb85ea3:healthz
    Image:      gcr.io/google_containers/exechealthz-amd64:v1.2.0
    Image ID:       rkt://sha512-3a85b0533dfba81b5083a93c7e091377123dac0942f46883a4c10c25cf0ad177
    Port:       8080/TCP
    Args:
      --cmd=nslookup kubernetes.default.svc.cluster.local 127.0.0.1 >/dev/null
      --url=/healthz-dnsmasq
      --cmd=nslookup kubernetes.default.svc.cluster.local 127.0.0.1:10053 >/dev/null
      --url=/healthz-kubedns
      --port=8080
      --quiet
    Limits:
      memory:   50Mi
    Requests:
      cpu:      10m
      memory:       50Mi
    State:      Running
      Started:      Sun, 12 Mar 2017 17:47:41 +0000
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sun, 12 Mar 2017 17:46:28 +0000
      Finished:     Sun, 12 Mar 2017 17:47:02 +0000
    Ready:      True
    Restart Count:  981
    Volume Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-zqbdp (ro)
    Environment Variables:  <none>
Conditions:
  Type      Status
  Initialized   True 
  Ready     False 
  PodScheduled  True 
Volumes:
  default-token-zqbdp:
    Type:   Secret (a volume populated by a Secret)
    SecretName: default-token-zqbdp
QoS Class:  Burstable
Tolerations:    CriticalAddonsOnly=:Exists
No events.

which shows that kubedns-amd64:1.9 is in Ready: false

this is my kude-dns-de.yaml file:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: kube-dns
  namespace: kube-system
  labels:
    k8s-app: kube-dns
    kubernetes.io/cluster-service: "true"
spec:
  strategy:
    rollingUpdate:
      maxSurge: 10%
      maxUnavailable: 0
  selector:
    matchLabels:
      k8s-app: kube-dns
  template:
    metadata:
      labels:
        k8s-app: kube-dns
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ''
        scheduler.alpha.kubernetes.io/tolerations: '[{"key":"CriticalAddonsOnly", "operator":"Exists"}]'
    spec:
      containers:
      - name: kubedns
        image: gcr.io/google_containers/kubedns-amd64:1.9
        resources:
          limits:
            memory: 170Mi
          requests:
            cpu: 100m
            memory: 70Mi
        livenessProbe:
          httpGet:
            path: /healthz-kubedns
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 60
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 5
        readinessProbe:
          httpGet:
            path: /readiness
            port: 8081
            scheme: HTTP
          initialDelaySeconds: 3
          timeoutSeconds: 5
        args:
        - --domain=cluster.local.
        - --dns-port=10053
        - --config-map=kube-dns
        # This should be set to v=2 only after the new image (cut from 1.5) has
        # been released, otherwise we will flood the logs.
        - --v=2
        env:
        - name: PROMETHEUS_PORT
          value: "10055"
        ports:
        - containerPort: 10053
          name: dns-local
          protocol: UDP
        - containerPort: 10053
          name: dns-tcp-local
          protocol: TCP
        - containerPort: 10055
          name: metrics
          protocol: TCP
      - name: dnsmasq
        image: gcr.io/google_containers/kube-dnsmasq-amd64:1.4.1
        livenessProbe:
          httpGet:
            path: /healthz-dnsmasq
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 60
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 5
        args:
        - --cache-size=1000
        - --no-resolv
        - --server=127.0.0.1#10053
        - --log-facility=-
        ports:
        - containerPort: 53
          name: dns
          protocol: UDP
        - containerPort: 53
          name: dns-tcp
          protocol: TCP
        # see: https://github.com/kubernetes/kubernetes/issues/29055 for details
        resources:
          requests:
            cpu: 150m
            memory: 10Mi
      - name: dnsmasq-metrics
        image: gcr.io/google_containers/dnsmasq-metrics-amd64:1.0.1
        livenessProbe:
          httpGet:
            path: /metrics
            port: 10054
            scheme: HTTP
          initialDelaySeconds: 60
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 5
        args:
        - --v=2
        - --logtostderr
        ports:
        - containerPort: 10054
          name: metrics
          protocol: TCP
        resources:
          requests:
            memory: 10Mi
      - name: healthz
        image: gcr.io/google_containers/exechealthz-amd64:v1.2.0
        resources:
          limits:
            memory: 50Mi
          requests:
            cpu: 10m
            memory: 50Mi
        args:
        - --cmd=nslookup kubernetes.default.svc.cluster.local 127.0.0.1 >/dev/null
        - --url=/healthz-dnsmasq
        - --cmd=nslookup kubernetes.default.svc.cluster.local 127.0.0.1:10053 >/dev/null
        - --url=/healthz-kubedns
        - --port=8080
        - --quiet
        ports:
        - containerPort: 8080
          protocol: TCP
      dnsPolicy: Default

and this is my kube-dns-svc.yaml:

apiVersion: v1
kind: Service
metadata:
  name: kube-dns
  namespace: kube-system
  labels:
    k8s-app: kube-dns
    kubernetes.io/cluster-service: "true"
    kubernetes.io/name: "KubeDNS"
spec:
  selector:
    k8s-app: kube-dns
  clusterIP: 10.3.0.10
  ports:
  - name: dns
    port: 53
    protocol: UDP
  - name: dns-tcp
    port: 53
    protocol: TCP

any information regarding the issue would be greatly appreciated!

update

rkt list --full 2> /dev/null | grep kubedns shows:

744a4579-0849-4fae-b1f5-cb05d40f3734    kubedns             gcr.io/google_containers/kubedns-amd64:1.9      sha512-c7b7c9c4393b running 2017-03-22 22:14:55.801 +0000 UTC   2017-03-22 22:14:56.814 +0000 UTC

journalctl -m _MACHINE_ID=744a45790849b1f5cb05d40f3734 provides:

Mar 22 22:17:58 kube-dns-3675956729-sthcv kubedns[8]: E0322 22:17:58.619254       8 reflector.go:199] pkg/dns/dns.go:145: Failed to list *api.Endpoints: Get https://10.3.0.1:443/api/v1/endpoints?resourceVersion=0: dial tcp 10.3.0.1:443: connect: network is unreachable

I tried to add - --proxy-mode=userspace to /etc/kubernetes/manifests/kube-proxy.yaml but the results are the same.

kubectl get svc --all-namespaces provides:

NAMESPACE       NAME                   CLUSTER-IP   EXTERNAL-IP   PORT(S)         AGE
ceph            ceph-mon               None         <none>        6789/TCP        1h
default         kubernetes             10.3.0.1     <none>        443/TCP         1h
kube-system     heapster               10.3.0.2     <none>        80/TCP          1h
kube-system     kube-dns               10.3.0.10    <none>        53/UDP,53/TCP   1h
kube-system     kubernetes-dashboard   10.3.0.116   <none>        80/TCP          1h
kube-system     monitoring-grafana     10.3.0.187   <none>        80/TCP          1h
kube-system     monitoring-influxdb    10.3.0.214   <none>        8086/TCP        1h
nginx-ingress   default-http-backend   10.3.0.233   <none>        80/TCP          1h

kubectl get cs provides:

NAME                 STATUS    MESSAGE              ERROR
controller-manager   Healthy   ok
scheduler            Healthy   ok
etcd-0               Healthy   {"health": "true"}

my kube-proxy.yaml has the following content:

apiVersion: v1
kind: Pod
metadata:
  name: kube-proxy
  namespace: kube-system
  annotations:
    rkt.alpha.kubernetes.io/stage1-name-override: coreos.com/rkt/stage1-fly
spec:
  hostNetwork: true
  containers:
  - name: kube-proxy
    image: quay.io/coreos/hyperkube:v1.5.5_coreos.0
    command:
    - /hyperkube
    - proxy
    - --cluster-cidr=10.2.0.0/16
    - --kubeconfig=/etc/kubernetes/controller-kubeconfig.yaml
    securityContext:
      privileged: true
    volumeMounts:
    - mountPath: /etc/ssl/certs
      name: "ssl-certs"
    - mountPath: /etc/kubernetes/controller-kubeconfig.yaml
      name: "kubeconfig"
      readOnly: true
    - mountPath: /etc/kubernetes/ssl
      name: "etc-kube-ssl"
      readOnly: true
    - mountPath: /var/run/dbus
      name: dbus
      readOnly: false
  volumes:
  - hostPath:
      path: "/usr/share/ca-certificates"
    name: "ssl-certs"
  - hostPath:
      path: "/etc/kubernetes/controller-kubeconfig.yaml"
    name: "kubeconfig"
  - hostPath:
      path: "/etc/kubernetes/ssl"
    name: "etc-kube-ssl"
  - hostPath:
      path: /var/run/dbus
    name: dbus

this is all the valuable information I could find. any ideas? :)

update 2

output of iptables-save on the controller ContainerOS at http://pastebin.com/2GApCj0n

update 3

I ran curl on the controller node

# curl https://10.3.0.1 --insecure
Unauthorized

means it can access it properly, i didn't add enough parameters for it to be authorized right ?

update 4

thanks to @jaxxstorm I removed calico manifests, updated their quay/cni and quay/node versions and reinstalled them.

now kubedns keeps restarting, but I think that now calico works. because for the first time it tries to install kubedns on the worker node and not on the controller node, and also when I rkt enter the kubedns pod and try to wget https://10.3.0.1 I get:

# wget https://10.3.0.1
Connecting to 10.3.0.1 (10.3.0.1:443)
wget: can't execute 'ssl_helper': No such file or directory
wget: error getting response: Connection reset by peer

which clearly shows that there is some kind of response. which is good right ?

now kubectl get pods --all-namespaces shows:

kube-system     kube-dns-3675956729-ljz2w                  4/4       Running             88         42m

so.. 4/4 ready but it keeps restarting.

kubectl describe pod kube-dns-3675956729-ljz2w --namespace=kube-system output at http://pastebin.com/Z70U331G

so it can't connect to http://10.2.47.19:8081/readiness, i'm guessing this is the ip of kubedns since it uses port 8081. don't know how to continue investigating this issue further.

thanks for everything!

Kynan answered 6/3, 2017 at 23:27 Comment(15)
That is the readiness check on the kube-dns pod. The pod isn't answering so it is marked as unhealthy and removed. Do the logs of the pod provide any additional information?Same
@Same - thanks. updated main postKynan
@Kynan We've had a similar problem before. Our SSL certs/token that kube-dns is picking up had expired. github.com/kubernetes/kubernetes/issues/32480. Can you check if 10.3.0.1 the IP of the kubernetes API server service? It seems that kube-dns can't connect to the API server.Burgher
@Burgher - the K8S_SERVICE_IP is set to 10.3.0.1while the lan ip address of the controller is 192.168.1.2). the SERVICE_IP_RANGE is set to 10.3.0.0/24 while my lan ip range is 192.168.0.0/16. is that ok ? have I misconfigured it ? i'll add my env settings to the kubernetes installation to the question.Kynan
Your certs need to include the service IP (10.3.0.1), but that doesn't seem to be the problem here. Your service IP isn't answering at all like ufk mentioned. There should be a kubernetes service with that IP. Should check to make sure it is in place and that pods can reach it. Could be a number of network issues at this point so it is hard to be more specific.Same
thanks for your support @lmickh, updated main post with more info.Kynan
Can you print the logs of one of the killed kube-dns containers? Login to the host it's scheduled on, find the killed container with docker ps -a and then do docker logs <containerid> - this'll give more indication of what's happening.Urata
@Frap - thanks for your response. I use rkt not docker, and can't figure out how to fetch the log information of that problematic pod.Kynan
@ukf check this github comment, seems to be the way to get the rkt logs after exited container: github.com/coreos/rkt/issues/1528#issuecomment-145462621Urata
@Frap - I read that docs with no success. created another stackoverflow question for that at https://mcmap.net/q/1471688/-viewing-log-of-exited-podKynan
@Frap - I'm finally able to view rkt logs. updated main postKynan
@Kynan can you add the output of iptables-save as well?Urata
@Urata - updated main post. thanksKynan
What happens when, from a host node, you try curl https://10.3.0.1Urata
@Urata - updated main post. thanksKynan
U
2

Lots of great debugging info here, thanks!

This is the clincher:

# curl https://10.3.0.1 --insecure
Unauthorized

You got an unauthorized response because you didn't pass a client cert, but that's fine, it's not what we're after. This proves that kube-proxy is working as expected and is accessible. Your rkt logs:

Mar 22 22:17:58 kube-dns-3675956729-sthcv kubedns[8]: E0322 22:17:58.619254       8 reflector.go:199] pkg/dns/dns.go:145: Failed to list *api.Endpoints: Get https://10.3.0.1:443/api/v1/endpoints?resourceVersion=0: dial tcp 10.3.0.1:443: connect: network is unreachable

Are indicating that the containers that there's network connectivity issues inside the containers, which indicates to me that you haven't configured container networking/CNI.

Please have a read through this document: https://coreos.com/rkt/docs/latest/networking/overview.html

You may also have to reconfigure calico, there's some more information that here: http://docs.projectcalico.org/master/getting-started/rkt/

Urata answered 24/3, 2017 at 13:34 Comment(3)
thank you so much, updated main post with new informationKynan
If you've deleted and reconfigured your pod networking, you'll need to relaunch the kube-dns pod. Delete it and it'll be recreated and it should hopefully work. If not, I think it'll be another question as it seems like a new problemUrata
i did reload kubedns, forgot to specify. i'll close it and open another one. thanksKynan
V
2

kube-dns has a readiness probe that tries resolving trough the Service IP of kube-dns. Is it possible that there is a problem with your Service network?

Check out the answer and solution here: kubernetes service IPs not reachable

Vesiculate answered 13/3, 2017 at 18:6 Comment(1)
hi! thanks for your response. still not able to resolve issue, updated main post with more information.Kynan
U
2

Lots of great debugging info here, thanks!

This is the clincher:

# curl https://10.3.0.1 --insecure
Unauthorized

You got an unauthorized response because you didn't pass a client cert, but that's fine, it's not what we're after. This proves that kube-proxy is working as expected and is accessible. Your rkt logs:

Mar 22 22:17:58 kube-dns-3675956729-sthcv kubedns[8]: E0322 22:17:58.619254       8 reflector.go:199] pkg/dns/dns.go:145: Failed to list *api.Endpoints: Get https://10.3.0.1:443/api/v1/endpoints?resourceVersion=0: dial tcp 10.3.0.1:443: connect: network is unreachable

Are indicating that the containers that there's network connectivity issues inside the containers, which indicates to me that you haven't configured container networking/CNI.

Please have a read through this document: https://coreos.com/rkt/docs/latest/networking/overview.html

You may also have to reconfigure calico, there's some more information that here: http://docs.projectcalico.org/master/getting-started/rkt/

Urata answered 24/3, 2017 at 13:34 Comment(3)
thank you so much, updated main post with new informationKynan
If you've deleted and reconfigured your pod networking, you'll need to relaunch the kube-dns pod. Delete it and it'll be recreated and it should hopefully work. If not, I think it'll be another question as it seems like a new problemUrata
i did reload kubedns, forgot to specify. i'll close it and open another one. thanksKynan

© 2022 - 2024 — McMap. All rights reserved.