Flask application scaling on Kubernetes and Gunicorn
Asked Answered
N

2

17

We have a Flask application that is served via gunicorn, using the eventlet worker. We're deploying the application in a kubernetes pod, with the idea of scaling the number of pods depending on workload.

The recommended settings for the number of workers in gunicorn is 2 - 4 x $NUM_CPUS. See docs. I've previously deployed services on dedicated physical hardware where such calculations made sense. On a 4 core machine, having 16 workers sounds OK and we eventually bumped it to 32 workers.

Does this calculation still apply in a kubernetes pod using an async worker particularly as:

  1. There could be multiple pods on a single node.
  2. The same service will be run in multiple pods.

How should I set the number of gunicorn workers?

  1. Set it to -w 1 and let kubernetes handle the scaling via pods?
  2. Set it to 2-4 x $NUM_CPU on the kubernetes nodes. On one pod or multiple?
  3. Something else entirely?

Update

We decided to go with the 1st option, which is our current approach. Set the number of gunicorn works to 1, and scale horizontally by increasing the number of pods. Otherwise there will be too many moving parts plus we won't be leveraging Kubernetes to its full potential.

Nasal answered 25/6, 2019 at 7:21 Comment(1)
You can also set the number of worker Pods per one node using scheduler topology feature to avoid resources overcommiting: kubernetes.io/docs/concepts/workloads/pods/…Make
I
2

For better visibility of the final solution chosen by original author of this question as of 2019 year

Set the number of gunicorn works to 1 (-w 1), and scale horizontally by increasing the number of pods (using Kubernetes HPA).

and the fact it might be not applicable in the close future, taking into account fast growth of workload related features in Kubernetes platform, e.g. some distributions of Kubernetes propose beside HPA, Vertical Pod Autoscaling (VPA) and Multidimensional Pod autoscaling (MPA) too, so I propose to continue this thread in form of community wiki post.

Inland answered 25/6, 2019 at 7:21 Comment(0)
M
-3

I'am not developer and it seems not simple task, but for your considerations please follow bests practices for Better performance by optimizing Gunicorn config.

In addition in kubernetes there are different mechanisms in order to scale your deployment like HPA due to CPU utilization and (How is Python scaling with Gunicorn and Kubernetes?)

You can use also Resource requests and limits of Pod and Container.

As per Gunicorn documentation

DO NOT scale the number of workers to the number of clients you expect to have. Gunicorn should only need 4-12 worker processes to handle hundreds or thousands of requests per second. Gunicorn relies on the operating system to provide all of the load balancing when handling requests. Generally we recommend (2 x $num_cores) + 1 as the number of workers to start off with. While not overly scientific, the formula is based on the assumption that for a given core, one worker will be reading or writing from the socket while the other worker is processing a request.

# update:

Depending on your approach you can choose different solution (deployment, daemonset) all above statements you can achieve in kubernetes by handling according Assigning CPU Resources to Containers and Pods

  1. Using deployment with resources (limits,requests) give you possibility to resize your app into multiple pods on a single node based on your hardware limits but depending on your "app load" it can not be good enough solution.

CPU requests and limits are associated with Containers, but it is useful to think of a Pod as having a CPU request and limit. The CPU request for a Pod is the sum of the CPU requests for all the Containers in the Pod. Likewise, the CPU limit for a Pod is the sum of the CPU limits for all the Containers in the Pod.

Note:

The CPU resource is measured in CPU units. One CPU, in Kubernetes, is equivalent to: f.e. 1 GCP Core.

  1. As mentioned in the post the second approach (scaling your app into multiple nodes) it's also good choice. In this case you can cosnider using f.e. Statefulset or deployment in addition on GKE using "cluster austoscaler" you can achieve more extendable solution when you try to create new pods that don't have enough capacity to run inside the cluster. In this case cluster autoscaler automatically add additional resources.

On the other hand you can consider using different other solutions like Cerebral it gives you the possibility to create user-defined policies in order to increasing or decreasing the size of pools of nodes inside your cluster.

GKE's cluster autoscaler automatically resizes clusters based on the demands of the workloads you want to run. With autoscaling enabled, GKE automatically adds a new node to your cluster if you've created new Pods that don't have enough capacity to run; conversely, if a node in your cluster is underutilized and its Pods can be run on other nodes, GKE can delete the node.

Please keep in mind that the question is very general and there is no one good answer for this topic. You should consider all prons and cons based on your requirements, load, activity, capacity, costs ...

Hope this help.

Milden answered 25/6, 2019 at 11:8 Comment(2)
I'm familiar with Gunicorn scaling and with Kubernetes horizontal autoscaling. The question is what happens when both technologies intersect, which is not anything any of these documents address.Nasal
The answer was updated about kuberenets solutions based on the consideration inside the post. Please share with your findingsMilden

© 2022 - 2024 — McMap. All rights reserved.