Why isn't work being distributed across gunicorn workers evenly?
Asked Answered
O

1

7

I'm running my large, public-facing web-application. It's a python HTTP back-end server that responds to thousands of HTTP requests per minute. It is written with Flask & SQLAlchemy. The application running on an EC2 in AWS. The instance type is c3.2xlarge (it has 8 CPUs).

I'm using Gunicorn as my webserver. Gunicorn has 17 worker processes and 1 master process. Below you can see the 17 gunicorn workers:

$ sudo ps -aefF | grep gunicorn | grep worker | wc -l
17

$ sudo ps -aefF --sort -rss | grep gunicorn | grep worker
UID       PID  PPID  C      SZ     RSS PSR STIME TTY     TIME                           CMD
my-user 15708 26468  6 1000306 3648504   1 Oct06   ? 08:46:19 gunicorn: worker [my-service]
my-user 23004 26468  1  320150  927524   0 Oct07   ? 02:07:55 gunicorn: worker [my-service]
my-user 26564 26468  0  273339  740200   3 Oct04   ? 01:43:20 gunicorn: worker [my-service]
my-user 26562 26468  0  135113  260468   4 Oct04   ? 00:29:40 gunicorn: worker [my-service]
my-user 26558 26468  0  109946  159696   7 Oct04   ? 00:15:14 gunicorn: worker [my-service]
my-user 26556 26468  0  125294  148180   6 Oct04   ? 00:13:07 gunicorn: worker [my-service]
my-user 26554 26468  0  120434  128016   5 Oct04   ? 00:10:13 gunicorn: worker [my-service]
my-user 26552 26468  0   99233  116832   5 Oct04   ? 00:08:24 gunicorn: worker [my-service]
my-user 26550 26468  0   94334   96784   0 Oct04   ? 00:05:28 gunicorn: worker [my-service]
my-user 26548 26468  0   92865   90512   2 Oct04   ? 00:04:47 gunicorn: worker [my-service]
my-user 27887 26468  1   91945   86564   0 17:44   ? 00:02:57 gunicorn: worker [my-service]
my-user 26546 26468  0  127841   84464   5 Oct04   ? 00:03:39 gunicorn: worker [my-service]
my-user 26544 26468  0   90290   80736   2 Oct04   ? 00:03:12 gunicorn: worker [my-service]
my-user 26540 26468  0  107669   78176   5 Oct04   ? 00:02:33 gunicorn: worker [my-service]
my-user 26542 26468  0   89446   76616   5 Oct04   ? 00:02:49 gunicorn: worker [my-service]
my-user 26538 26468  0   88056   72028   5 Oct04   ? 00:02:02 gunicorn: worker [my-service]
my-user 26510 26468  0  106046   70836   2 Oct04   ? 00:01:49 gunicorn: worker [my-service]

I'm examining logs of all the HTTP requests that came in over the past 7 days. I have grouped and summed the requests by the Process ID which you can see in me ps command above. Below you can see the resultant graph.

As you can see, 5 gunicorn workers are doing almost 100% of the work. The remaining 12 are basically idle. And out of those 5, one worker (PID #15708) is doing by far the most work.

Why is this happening? I would like to understand the algorithm that gunicorn uses to distribute the work amongst its workers. It's definitely not round-robin? Where can I see the strategy it uses and how can I tweak it? What might explain the rises and falls in this graph? (For example PID #332 was doing the most work until October 7th when it started declining and was overtaken by rising PID #15708)

A clear explanation would be helpful and/or links to relevant documentation.

enter image description here

Overtrick answered 12/10, 2016 at 21:8 Comment(4)
What is the average response time? If I'm not wrong on the 8th Oct it works out avg 11.5 rps if they are reasonably short lived requests then the load could be taken by a couple of workers. Also as per the question below are these just sync workers? If not then that would change which each worker is handling.Entrance
What is rps? Is it requests per second? I calculate 14.1 requests per second on October 8th: (628929 + 150725 + 7317 + 7949 + 11581 + 13532 + 13972 + 84253 + 285848 + 14405) / (60 * 60 * 24). Over the duration of the graph shown, the average response time was 0.041 seconds. The workers' class is gevent_pywsgi.Overtrick
I must have missed a number out when adding them up sorry. But yea that's what it is requests per second. A couple of workers could easily handle the based on the time scale we are looking at. Bursts in the requests is probably what is causing the others to be used.Entrance
I also experienced totally same. sometimes the master process delivers several requests to a single worker at the same time. is there someone that found the solution or workaround?Chiarra
S
2

According to the documentation:

The default synchronous workers assume that your application is resource-bound in terms of CPU and network bandwidth.

And:

Gunicorn relies on the operating system to provide all of the load balancing when handling requests.

Based on those two statements I would say that the single worker doing the majority of the work is hardly ever resource bound. Once it is resource bound, the other four workers are enough to handle the additional load without needing to call on the others.

You can probably safely lower the number of workers ((2 x num cores) + 1 is only the starting recommendation). This will reduce the possibility of resource thrashing and could improve the performance of your application.

Svelte answered 12/10, 2016 at 21:21 Comment(1)
The workers' class is gevent_pywsgi -- not default synchoronous workers. When they say "resource-bound" what exactly does that mean? I'm also unclear what it means to say "the OS provides all the load balancing". Does that mean when a job is waiting, the worker that takes it is decided by the OS completely independent of gunicorn code & configuration?Overtrick

© 2022 - 2024 — McMap. All rights reserved.