Gunicorn worker terminated with signal 9
Asked Answered
H

11

88

I am running a Flask application and hosting it on Kubernetes from a Docker container. Gunicorn is managing workers that reply to API requests.

The following warning message is a regular occurrence, and it seems like requests are being canceled for some reason. On Kubernetes, the pod is showing no odd behavior or restarts and stays within 80% of its memory and CPU limits.

[2021-03-31 16:30:31 +0200] [1] [WARNING] Worker with pid 26 was terminated due to signal 9

How can we find out why these workers are killed?

Hollister answered 21/5, 2021 at 12:37 Comment(8)
Did you manage to find out why? Having the same issue, and tried specifying the --shm-size - but no avail.Easterner
Our problems seem to have gone away since we started using --worker-class gevent. I suspect Simon is right and this was either an out of memory error, or a background process running for too long and the main process (1) decided to kill it.Hollister
Meta: I'm not sure why this question is being downvoted. Please drop a comment if you feel it needs further clarification.Hollister
I have the same problem, and gevent did not solve it. does anyone knows why this started all of a sudden? was there a change in gunicorn or in kube?Backgammon
also related to a non answered question: #57745600Backgammon
@Backgammon - my issue was OOM-related. I had to use a larger instance with more RAM, and gave the docker container access to that RAM.Easterner
@Easterner ye, eventually that's exactly what I did as well. just adding another 1GB fixed the problem. no need to change to gevent.Backgammon
I faced the same issue and solved it by switching from python 3.8 to python 3.7Melva
Y
81

I encountered the same warning message.

[WARNING] Worker with pid 71 was terminated due to signal 9

I came across this faq, which says that "A common cause of SIGKILL is when OOM killer terminates a process due to low memory condition."

I used dmesg realized that indeed it was killed because it was running out of memory.

Out of memory: Killed process 776660 (gunicorn)
Yukikoyukio answered 27/5, 2021 at 10:1 Comment(2)
Our problems seem to have gone away since we started using --worker-class gevent. I can't verify this answer, but it seems that dmesg is a good way to get more information and diagnose the problem. Thanks for your answer!Hollister
I noticed this happen when I didn't provide enough memory to Docker Desktop, which was running Gunicorn workers within a container. Increasing the memory to Docker Desktop solved the problem.Chemist
N
38

In our case application was taking around 5-7 minutes to load ML models and dictionaries into memory. So adding timeout period of 600 seconds solved the problem for us.

gunicorn main:app \
   --workers 1 \
   --worker-class uvicorn.workers.UvicornWorker \
   --bind 0.0.0.0:8443 \
   --timeout 600
Nucleotidase answered 12/2, 2022 at 14:15 Comment(4)
that was it in my case as well. many thanks for the pointer.Tipper
While this solves the immediate issue, you might want to consider using a worker queue service such as celery for long running tasks.Ballyhoo
And by "you" I mean future readers.Ballyhoo
+1 for this. In my case I was importing large CSV datasets to a database, and simply had my timeouts set too low. After trying other things like refactoring my parsers for better memory performance, it was the timeouts that helped.Mitrailleuse
A
4

It may be that your liveness check in kubernetes is killing your workers.

If your liveness check is configured as an http request to an endpoint in your service, your main request may block the health check request, and the worker gets killed by your platform because the platform thinks that the worker is unresponsive.

That was my case. I have a gunicorn app with a single uvicorn worker, which only handles one request at a time. It worked fine locally but would have the worker sporadically killed when deployed to kubernetes. It would only happen during a call that takes about 25 seconds, and not every time.

It turned out that my liveness check was configuredto hit the /health route every 10 seconds, time out in 1 second, and retry 3 times. So this call would time out some times but not always.

If this is your case, a possible solution is to reconfigure your liveness check (or whatever health check mechanism your platform uses) so it can wait until your typical request finishes. Or allow for more threads - something that makes sure that the health check is not blocked for long enough to trigger worker kill.

You can see that adding more workers may help with (or hide) the problem.

Also, see this reply to a similar question: https://mcmap.net/q/243355/-why-are-my-gunicorn-python-flask-workers-exiting-from-signal-term

Ambitendency answered 10/10, 2022 at 16:8 Comment(0)
T
3

I encountered the same warning message when I limit the docker's memory, use like -m 3000m.

see docker-memory

and

gunicorn-Why are Workers Silently Killed?

The simple way to avoid this is set a high memory for docker or not set.

Terti answered 9/2, 2022 at 3:28 Comment(1)
I also encountered the same error after changing HPA metrics it started working fineAnnaleeannaliese
Z
2

I was using AWS Beanstalk to deploy my flask application and I had a similar error.

  • In the log I saw:
  • web: MemoryError
  • [CRITICAL] WORKER TIMEOUT
  • [WARNING] Worker with pid XXXXX was terminated due to signal 9

I was using the t2.micro instance and when I changed it to t2.medium my app worked fine. In addition to this I changed to the timeout in my nginx config file.

Ziegler answered 20/4, 2022 at 18:25 Comment(2)
Mind sharing the timeout variable name?Bethannbethanne
Below is the contents on my timeout.conf file under the nginx>conf.d folder keepalive_timeout 600s; proxy_connect_timeout 600s; proxy_send_timeout 600s; proxy_read_timeout 600s; fastcgi_send_timeout 600s; fastcgi_read_timeout 600s; client_max_body_size 20M;Ziegler
T
2

In my case, I first noticed that decreasing the number of workers from 4 to 2 worked. However, I believe that the problem is related to the connection to the db, I tried with -w4 but I restarted my server that contains the db and it worked perfectly.

Tempa answered 31/3, 2023 at 19:42 Comment(1)
thanks a lot for your comment, it work in my projectsMima
D
1

I encountered the same problem too. and it was because docker memory usage was limited to 2GB. If you are using docker desktop you just need to go to resources and increase the memory docker dedicated portion (if not you need to find the docker command line to do that).

If that doesn't solve the problem, then it might be the timeout that kill the worker, you will need to add timeout arg to the gunicorn command:

CMD ["gunicorn","--workers", "3", "--timeout", "1000", "--bind", "0.0.0.0:8000", "wsgi:app"]
Dowser answered 17/10, 2022 at 14:53 Comment(0)
N
1

Check memory usage

In my case, I can not use dmesg command. so I check memory usage as docker command:

sudo docker stats <container-id>

CONTAINER ID   NAME               CPU %     MEM USAGE / LIMIT   MEM %     NET I/O        BLOCK I/O         PIDS
289e1ad7bd1d   funny_sutherland   0.01%     169MiB / 1.908GiB   8.65%     151kB / 96kB   8.23MB / 21.5kB   5

In my case, terminating workers are not caused by memory.

Noellanoelle answered 8/2, 2023 at 7:5 Comment(2)
Hey. Did you find anything else than memory that could kill your workers ?Professionalize
@SamiBoudoukha actually my case was not because memory issue. I use Django and it failed connect with database internally with no failure log. nothing elseNoellanoelle
Y
1

In my case. I need to connect to a remote databse on private network that requires me to connect to a VPN first, and I forgot that.

So, check your database connection or anything that cause your app waiting for a long time.

Yulan answered 26/3, 2023 at 3:54 Comment(2)
Please phrase this as an explained conditional answer, in order to avoid the impression of asking a clarification question instead of answering (for which a comment should be used instead of an answer, compare meta.stackexchange.com/questions/214173/… ). For example like "If your problem is ... then the solution is to .... because .... ."Verniavernice
This does not provide an answer to the question. Once you have sufficient reputation you will be able to comment on any post; instead, provide answers that don't require clarification from the asker. - From ReviewRoband
P
0

In my case the problem was in long application startup caused by ml model warm-up (over 3s)

Prevot answered 13/12, 2021 at 15:29 Comment(2)
how did you fix it ?Clougher
Got rid of warm-up. Looking for ways to do it right after app start now.Prevot
L
0

I had the same error with pods restart due to signal 9. HPA wasn't working correct. I missed adding pod resources limit for memory and CPU in deployment.yaml. Once i added them in deployment yaml , HPA was working correct and pods were abe to scale .

The reason for this error is your pod running out of memory(at least in my case)

Luke answered 15/9, 2023 at 18:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.