Why isn't work being distributed across gunicorn workers evenly?

I'm running my large, public-facing web-application. It's a python HTTP back-end server that responds to thousands of HTTP requests per minute. It is written with Flask & SQLAlchemy. The application running on an EC2 in AWS. The instance type is c3.2xlarge (it has 8 CPUs).

I'm using Gunicorn as my webserver. Gunicorn has 17 worker processes and 1 master process. Below you can see the 17 gunicorn workers:

$ sudo ps -aefF | grep gunicorn | grep worker | wc -l
17

$ sudo ps -aefF --sort -rss | grep gunicorn | grep worker
UID       PID  PPID  C      SZ     RSS PSR STIME TTY     TIME                           CMD
my-user 15708 26468  6 1000306 3648504   1 Oct06   ? 08:46:19 gunicorn: worker [my-service]
my-user 23004 26468  1  320150  927524   0 Oct07   ? 02:07:55 gunicorn: worker [my-service]
my-user 26564 26468  0  273339  740200   3 Oct04   ? 01:43:20 gunicorn: worker [my-service]
my-user 26562 26468  0  135113  260468   4 Oct04   ? 00:29:40 gunicorn: worker [my-service]
my-user 26558 26468  0  109946  159696   7 Oct04   ? 00:15:14 gunicorn: worker [my-service]
my-user 26556 26468  0  125294  148180   6 Oct04   ? 00:13:07 gunicorn: worker [my-service]
my-user 26554 26468  0  120434  128016   5 Oct04   ? 00:10:13 gunicorn: worker [my-service]
my-user 26552 26468  0   99233  116832   5 Oct04   ? 00:08:24 gunicorn: worker [my-service]
my-user 26550 26468  0   94334   96784   0 Oct04   ? 00:05:28 gunicorn: worker [my-service]
my-user 26548 26468  0   92865   90512   2 Oct04   ? 00:04:47 gunicorn: worker [my-service]
my-user 27887 26468  1   91945   86564   0 17:44   ? 00:02:57 gunicorn: worker [my-service]
my-user 26546 26468  0  127841   84464   5 Oct04   ? 00:03:39 gunicorn: worker [my-service]
my-user 26544 26468  0   90290   80736   2 Oct04   ? 00:03:12 gunicorn: worker [my-service]
my-user 26540 26468  0  107669   78176   5 Oct04   ? 00:02:33 gunicorn: worker [my-service]
my-user 26542 26468  0   89446   76616   5 Oct04   ? 00:02:49 gunicorn: worker [my-service]
my-user 26538 26468  0   88056   72028   5 Oct04   ? 00:02:02 gunicorn: worker [my-service]
my-user 26510 26468  0  106046   70836   2 Oct04   ? 00:01:49 gunicorn: worker [my-service]

I'm examining logs of all the HTTP requests that came in over the past 7 days. I have grouped and summed the requests by the Process ID which you can see in me ps command above. Below you can see the resultant graph.

As you can see, 5 gunicorn workers are doing almost 100% of the work. The remaining 12 are basically idle. And out of those 5, one worker (PID #15708) is doing by far the most work.

Why is this happening? I would like to understand the algorithm that gunicorn uses to distribute the work amongst its workers. It's definitely not round-robin? Where can I see the strategy it uses and how can I tweak it? What might explain the rises and falls in this graph? (For example PID #332 was doing the most work until October 7th when it started declining and was overtaken by rising PID #15708)

A clear explanation would be helpful and/or links to relevant documentation.

According to the documentation:

The default synchronous workers assume that your application is resource-bound in terms of CPU and network bandwidth.

And:

Gunicorn relies on the operating system to provide all of the load balancing when handling requests.

Based on those two statements I would say that the single worker doing the majority of the work is hardly ever resource bound. Once it is resource bound, the other four workers are enough to handle the additional load without needing to call on the others.

You can probably safely lower the number of workers ((2 x num cores) + 1 is only the starting recommendation). This will reduce the possibility of resource thrashing and could improve the performance of your application.

Recommended topics

Hot tags