gunicorn doesn't use all CPU resulting in lot of timed out requests
Asked Answered
A

2

4

I am load-testing a gunicorn server(uses Uvicorn workers with fastapi) on AWS EC2 machine that I sshed into and port mapped to(doing ssh -L 8000:localhost:8000), for all requests at port 8000 on my local machine to be routed to the EC2 machine.

And I am using k6 to generate artificial traffic(load-test) for gunicorn server in EC2 instance from my local machine. With ONLY 500-800 vus, upwards of 46% requests always fail, but the CPU usage of EC2 machine never goes past 30% for any of the 8 cores(from htop). I am using c5a.2xlarge machine(has 4cores or 8threads).

Here's how I am lauching the gunicorn from terminal(because of the config, gunicorn launches with 4 workers):

$ gunicorn api.main:app --worker-class uvicorn.workers.UvicornWorker --user dockerd --capture-output --keep-alive 0 --port 8000

and the configuration file I am using is from tiangolo's uvicorn-gunicorn-docker

This is a fastapi app, serving a scikit-learn model without any calls to database or anything like that. So, this is a completely cpu-bound app.

I am happy to provide more information as required.

Where and what changes do I make in uvicorn or gunicorn to be able to serve lots of requests with as less failure rate as possible, while using all resources to the maximum(or to the extent needed).

Acetum answered 30/1, 2022 at 8:3 Comment(0)
M
3

I think problem is UvicornWorker. With a scikit-learn model, It better to use CPU-bound. Change it to --worker-class=gthread

Messing answered 30/8, 2022 at 4:30 Comment(0)
A
2

Please check load average on your instance. It is possible that the CPU is not being maxed out because you have the disk that is becoming the bottleneck. If your loadaverage indicates several jobs are piled up but the CPU % is not going up, it might mean disk latency.

It is better to remove all logs. load average is available in "sudo htop". you can look that up. I am quite sure now that your problem is the disk

Affaire answered 31/1, 2022 at 9:6 Comment(2)
after running the tests, load average of last 15mins has been more than 88%(3rd number from uptime command), but % of requests fail is still mroe or less the same. is that an indication of disk latency?Acetum
i am writing gunicorn logs with info level to a log file, i also get 'too many files open' warning printed to console multiple times. is that an indication of slower disk?Acetum

© 2022 - 2024 — McMap. All rights reserved.