Kubernetes Pods Terminated - Exit Code 137
Asked Answered
T

8

35

I need some advise on an issue I am facing with k8s 1.14 and running gitlab pipelines on it. Many jobs are throwing up exit code 137 errors and I found that it means that the container is being terminated abruptly.


Cluster information:

Kubernetes version: 1.14 Cloud being used: AWS EKS Node: C5.4xLarge


After digging in, I found the below logs:

**kubelet: I0114 03:37:08.639450**  4721 image_gc_manager.go:300] [imageGCManager]: Disk usage on image filesystem is at 95% which is over the high threshold (85%). Trying to free 3022784921 bytes down to the low threshold (80%).

**kubelet: E0114 03:37:08.653132**  4721 kubelet.go:1282] Image garbage collection failed once. Stats initialization may not have completed yet: failed to garbage collect required amount of images. Wanted to free 3022784921 bytes, but freed 0 bytes

**kubelet: W0114 03:37:23.240990**  4721 eviction_manager.go:397] eviction manager: timed out waiting for pods runner-u4zrz1by-project-12123209-concurrent-4zz892_gitlab-managed-apps(d9331870-367e-11ea-b638-0673fa95f662) to be cleaned up

**kubelet: W0114 00:15:51.106881**   4781 eviction_manager.go:333] eviction manager: attempting to reclaim ephemeral-storage

**kubelet: I0114 00:15:51.106907**   4781 container_gc.go:85] attempting to delete unused containers

**kubelet: I0114 00:15:51.116286**   4781 image_gc_manager.go:317] attempting to delete unused images

**kubelet: I0114 00:15:51.130499**   4781 eviction_manager.go:344] eviction manager: must evict pod(s) to reclaim ephemeral-storage 

**kubelet: I0114 00:15:51.130648**   4781 eviction_manager.go:362] eviction manager: pods ranked for eviction:

 1. runner-u4zrz1by-project-10310692-concurrent-1mqrmt_gitlab-managed-apps(d16238f0-3661-11ea-b638-0673fa95f662)
 2. runner-u4zrz1by-project-10310692-concurrent-0hnnlm_gitlab-managed-apps(d1017c51-3661-11ea-b638-0673fa95f662)

 3. runner-u4zrz1by-project-13074486-concurrent-0dlcxb_gitlab-managed-apps(63d78af9-3662-11ea-b638-0673fa95f662)

 4. prometheus-deployment-66885d86f-6j9vt_prometheus(da2788bb-3651-11ea-b638-0673fa95f662)

 5. nginx-ingress-controller-7dcc95dfbf-ld67q_ingress-nginx(6bf8d8e0-35ca-11ea-b638-0673fa95f662)

And then the pods get terminated resulting in the exit code 137s.

Can anyone help me understand the reason and a possible solution to overcome this?

Thank you :)

Toddle answered 14/1, 2020 at 8:24 Comment(4)
>> Exit code 137 - represents "Out of memory" From above log garbage collection is being called, where defaultthreshold is being breached --image-gc-high-threshold=90 and --image-gc-low-threshold=80Shornick
Hey @D.T. . Yes. Could you explain how to avoid the pods from being terminated? I checked the memory and they have 20G of space and I checked the memory and disk pressure of the nodes and they have plenty of space. I am not understanding why the pods are being terminated to reclaim ephemeral space.Toddle
Disk usage on image filesystem is at 95% which is over the high threshold (85%). Trying to free 3022784921 bytes down to the low threshold (80%). > Failed to garbage collect required amount of images. Wanted to free 3022784921 bytes, but freed 0 bytes. Can you add some disk space? Also do you have any quotas? kubectl describe quotaKristoforo
@Kristoforo No quotas or Limitranges have been applied. I already increased the disk space to 50GB. I confirmed that there is no disk pressure by looking at the "taints" and "events" in the output of "kubectl describe nodes". I checked the output of "kubectl top nodes" to check if memory and CPU are under stress but they seemed under controlToddle
T
8

Was able to solve the problem.

The nodes initially had 20G of ebs volume and on a c5.4xlarge instance type. I increased the ebs to 50 and 100G but that did not help as I kept seeing the below error:

"Disk usage on image filesystem is at 95% which is over the high threshold (85%). Trying to free 3022784921 bytes down to the low threshold (80%). "

I then changed the instance type to c5d.4xlarge which had 400GB of cache storage and gave 300GB of EBS. This solved the error.

Some of the gitlab jobs were for some java applications that were eating away lot of cache space and writing lot of logs.

Toddle answered 16/1, 2020 at 6:12 Comment(0)
P
45

Exit Code 137 does not necessarily mean OOMKilled. It indicates failure as container received SIGKILL (some interrupt or ‘oom-killer’ [OUT-OF-MEMORY])

If pod got OOMKilled, you will see below line when you describe the pod

      State:        Terminated
      Reason:       OOMKilled

Edit on 2/2/2022 I see that you added **kubelet: I0114 03:37:08.639450** 4721 image_gc_manager.go:300] [imageGCManager]: Disk usage on image filesystem is at 95% which is over the high threshold (85%). Trying to free 3022784921 bytes down to the low threshold (80%). and must evict pod(s) to reclaim ephemeral-storage from the log. It usually happens when application pods are writing something to disk like log files. Admins can configure when (at what disk usage %) to do eviction.

Preheat answered 16/1, 2020 at 6:20 Comment(3)
Hey Rocks! Yes. I agree that the state would show was OOMKilled but the weird part was the evicted pods were no longer visible to inspect the state. The eviction manager is terminating and deleting the pods to reclaim ephemeral storage. The thing which I did wrong was to assume ephemeral storage to be RAM. So that lead to me thinking that if it is reclaiming memory, it could be OOM termination. But upon further inspection of logs, it said Disk usage as show in the first log. That helped me try the above mentioned solution.Toddle
Even I got the same issue- Last State: Terminated, Reason: Error, Exit Code: 137 Where we can find what is the actual reason for this interrupt?Hyde
In my experience it looks like Kubernetes, or at least k3s, doesn't set the reason to OOMKilled when the app is getting OOMKilled. It is showing Error as the reason but, after checking the logs with journalctl, it does show Memory cgroup out of memory: Killed process 3481761 (java)Mayotte
B
22

137 mean that k8s kill container for some reason (may be it didn't pass liveness probe)

Cod 137 is 128 + 9(SIGKILL) process was killed by external signal

Bleachers answered 23/10, 2020 at 7:12 Comment(1)
Do you have any further insight on why a container might fail the liveness probe?Sweeten
R
10

The typical causes for this error code can be system out of RAM, or a health check has failed

Rovelli answered 13/10, 2020 at 1:30 Comment(0)
S
10

Detailed Exit code 137

  1. It denotes that the process was terminated by an external signal.
  2. The number 137 is a sum of two numbers: 128+x, # where x is the signal number sent to the process that caused it to terminate.
  3. In the example, x equals 9, which is the number of the SIGKILL signal, meaning the process was killed forcibly.

Hope this helps better.

Sporule answered 23/9, 2022 at 14:6 Comment(0)
T
8

Was able to solve the problem.

The nodes initially had 20G of ebs volume and on a c5.4xlarge instance type. I increased the ebs to 50 and 100G but that did not help as I kept seeing the below error:

"Disk usage on image filesystem is at 95% which is over the high threshold (85%). Trying to free 3022784921 bytes down to the low threshold (80%). "

I then changed the instance type to c5d.4xlarge which had 400GB of cache storage and gave 300GB of EBS. This solved the error.

Some of the gitlab jobs were for some java applications that were eating away lot of cache space and writing lot of logs.

Toddle answered 16/1, 2020 at 6:12 Comment(0)
T
0

Check Jenkins's master node memory and CPU profile. in my case, it was a master under high memory and CPU utilization, and slaves were getting restarted with 137.

Towney answered 19/8, 2021 at 10:45 Comment(0)
O
0

I encounter the problem: Last state: Terminated with 137: Error and noticed that in the Recent Events, there was a failure for liveness probe : Liveness probe failed: Get "http://<ip:port>/actuator/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers). So that means it restarted because of the health check failure, which happened because i was debugging the service and everything blocked including the health check interface :D

Oleviaolfaction answered 29/5 at 2:50 Comment(0)
G
0

I also got 'command terminated with exit code 137' when running python3 from pod. The problem was related to the Anti Virus that was killing the process when python script files were edited.

Grosgrain answered 4/6 at 13:2 Comment(1)
Did you fix the problem or did you just attribute it to the virus scanner? You might be more specific as to which process got killed and what type of text-file edit triggered this behaviour by the virus scanner.Elijah

© 2022 - 2024 — McMap. All rights reserved.