What is the reason for Back-off restarting failed container for elasticsearch kubernetes pod?
Asked Answered
H

5

17

When I try to run my elasticsearch container through kubernetes deployments, my elasticsearch pod fails after some time, While it runs perfectly fine when directly run as docker container using docker-compose or Dockerfile. This is what I get as a result of kubectl get pods

NAME                  READY     STATUS    RESTARTS   AGE
es-764bd45bb6-w4ckn   0/1       Error     4          3m

below is the result of kubectl describe pod

Name:           es-764bd45bb6-w4ckn
Namespace:      default
Node:           administrator-thinkpad-l480/<node_ip>
Start Time:     Thu, 30 Aug 2018 16:38:08 +0530
Labels:         io.kompose.service=es
            pod-template-hash=3206801662
Annotations:    <none> 
Status:         Running
IP:             10.32.0.8
Controlled By:  ReplicaSet/es-764bd45bb6
Containers:
es:
Container ID:   docker://9be2f7d6eb5d7793908852423716152b8cefa22ee2bb06fbbe69faee6f6aa3c3
Image:          docker.elastic.co/elasticsearch/elasticsearch:6.2.4
Image ID:       docker-pullable://docker.elastic.co/elasticsearch/elasticsearch@sha256:9ae20c753f18e27d1dd167b8675ba95de20b1f1ae5999aae5077fa2daf38919e
Port:           9200/TCP
State:          Waiting
  Reason:       CrashLoopBackOff
Last State:     Terminated
  Reason:       Error
  Exit Code:    78
  Started:      Thu, 30 Aug 2018 16:42:56 +0530
  Finished:     Thu, 30 Aug 2018 16:43:07 +0530
Ready:          False
Restart Count:  5
Environment:
  ELASTICSEARCH_ADVERTISED_HOST_NAME:  es
  ES_JAVA_OPTS:                        -Xms2g -Xmx2g
  ES_HEAP_SIZE:                        2GB
Mounts:
  /var/run/secrets/kubernetes.io/serviceaccount from default-token-nhb9z (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  default-token-nhb9z:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-nhb9z
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
             node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age               From           Message
  ----     ------     ----              ----           -------
 Normal   Scheduled  6m                default-scheduler                     Successfully assigned default/es-764bd45bb6-w4ckn to administrator-thinkpad-l480
 Normal   Pulled     3m (x5 over 6m)   kubelet, administrator-thinkpad-l480  Container image "docker.elastic.co/elasticsearch/elasticsearch:6.2.4" already present on machine
 Normal   Created    3m (x5 over 6m)   kubelet, administrator-thinkpad-l480  Created container
 Normal   Started    3m (x5 over 6m)   kubelet, administrator-thinkpad-l480  Started container
 Warning  BackOff    1m (x15 over 5m)  kubelet, administrator-thinkpad-l480  Back-off restarting failed container

Here is my elasticsearc-deployment.yaml:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    kompose.cmd: kompose convert
    kompose.version: 1.1.0 (36652f6)
  creationTimestamp: null
  labels:
    io.kompose.service: es
  name: es
spec:
  replicas: 1
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        io.kompose.service: es
    spec:
      containers:
      - env:
        - name: ELASTICSEARCH_ADVERTISED_HOST_NAME
          value: es
        - name: ES_JAVA_OPTS
          value: -Xms2g -Xmx2g
        - name: ES_HEAP_SIZE
          value: 2GB
        image: docker.elastic.co/elasticsearch/elasticsearch:6.2.4
        name: es
        ports:
        - containerPort: 9200
        resources: {}
      restartPolicy: Always
 status: {}

When i try to get logs using kubectl logs -f es-764bd45bb6-w4ckn, I get

Error from server: Get https://<slave node ip>:10250/containerLogs/default/es-764bd45bb6-w4ckn/es?previous=true: dial tcp <slave node ip>:10250: i/o timeout 

What could be the reason and solution for this problem ?

Horatius answered 30/8, 2018 at 11:24 Comment(14)
try print out some logs, I don't see that much info through the pod describe.Logo
@Logo .. i get this response when i try to get logs Error from server: Get https://<slave node ip>:10250/containerLogs/default/es-764bd45bb6-w4ckn/es?previous=true: dial tcp <slave node ip>:10250: i/o timeoutHoratius
Please update your question with the output from kubectl logs -f es-764bd45bb6-w4cknAlansen
@UroshT. already did.Horatius
can you see something from this: kubectl logs -f <yourpod> --previousLogo
@Logo no, the output is same as kubectl logs -f <podname>Horatius
apiserver tried to connect the kubelet of the host that is running the es pod, but failed, you can login into the host and use command docker logs to get the logs.Pollerd
See if this helps github.com/kubernetes/kubernetes/issues/4891Graziano
@KunLi it worked thanks.. though it is a workaround and i am able to see the logs. could you explain why i am not able to see the logs through kubectl logs command ?Horatius
@HarshalShah naah ... didn't helpHoratius
@Lakshya slave node is abnormal, I suppose, kubelet is not listening at 10250 port ? check the log of kubelet for more detail.Pollerd
@KunLi yeah kubelet is not listening on port 10250 but there is no reason showing for that in logs either.Horatius
CrashLoopBackOff just means the pod keeps crashing and k8s has given up on it. You need to determine what is causing the crash. You can use "watch kubectl describe [pod_name]" to view events as the pod is being created, this is useful if there is an issue during creation. If the pod crashes after it starts up, you'll need to get the container logs, which you can get using docker as mentioned above.Fullmouthed
We had similar issue with es, sometimes not starting properly. It was diagnosed to be tied to liveness/readiness probe, since reloading of indexes took too much time and pod was deemed not ready mid-flight and restarted. Tweaking of probes parameters (initialDelaySeconds and such, see Probes) to postpone probes until index is properly loaded helped in our case, can you give it a go?Insidious
H
1

I found the logs using docker logs for the es container and found that es was not starting because of the vm.max_map_count set to very low value. I changed the vm.max_map_count to desired value using sysctl -w vm.max_map_count=262144 and the pod has started after that.

Horatius answered 5/9, 2018 at 10:29 Comment(1)
sysctl: unknown oid 'vm.max_map_count'Strict
C
33

I had the same problem, there can be couple of reasons for this issue. In my case the jar file was missing. @Lakshya has already answered this problem, I would like to add the steps that you can take to troubleshoot it.

  1. Get the pod status, Command - kubectl get pods
  2. Describe pod to have further look - kubectl describe pod "pod-name" The last few lines of output gives you events and where your deployment failed
  3. Get logs for more details - kubectl logs "pod-name"
  4. Get container logs - kubectl logs "pod-name" -c "container-name" Get the container name from the output of describe pod command

If your container is up, you can use the kubectl exec -it command to further analyse the container

Hope it helps community members in future issues.

Cycle answered 14/11, 2018 at 8:34 Comment(1)
thanks. I guess you did not notice that I have answered my own question lately when I found the solution. Nevertheless, the steps above might help people in future.Horatius
H
1

I found the logs using docker logs for the es container and found that es was not starting because of the vm.max_map_count set to very low value. I changed the vm.max_map_count to desired value using sysctl -w vm.max_map_count=262144 and the pod has started after that.

Horatius answered 5/9, 2018 at 10:29 Comment(1)
sysctl: unknown oid 'vm.max_map_count'Strict
B
1

In my case, I just run kubectl run ubuntu --image=ubuntu get similar err and kubectl logs is empty

I guess the reason is ubuntu image without command will auto poweroff, so the solution is:

output k8s ubuntu conf yaml

in command make container command with don't container poweroff(for ex, add "sleep infinity", following is work conf yaml

{
  "kind": "Pod",
  "apiVersion": "v1",
  "metadata": {
    "name": "ubuntu",
    "creationTimestamp": null,
    "labels": {
      "run": "ubuntu"
    }
  },
  "spec": {
    "containers": [
      {
        "name": "ubuntu",
        "image": "ubuntu:20.04",
        "command": [
          "sleep",
          "infinity"
        ],
        "resources": {},
        "imagePullPolicy": "IfNotPresent"
      }
    ],
    "restartPolicy": "Always"
  },
  "status": {}
}
Butyraceous answered 25/12, 2022 at 9:26 Comment(0)
O
-1

For me it was simple memory issue as many pods were running. i deleted old pods and able to run new deployment

Outrageous answered 1/10, 2023 at 13:45 Comment(2)
I dont think this will help.Collimator
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Teratogenic
A
-2

Maybe config incorrect but valid, read pod logs and find error message. Fix configs and redeploy app

Avail answered 11/1, 2022 at 13:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.