I am trying to determine a reliable setup to use with K8S to scale one of my deployments using an HPA and an autoscaler. I want to minimize the amount of resources overcommitted but allow it to scale up as needed.
I have a deployment that is managing a REST API service. Most of the time the service will have very low usage (0m-5m cpu). But periodically through the day or week it will spike to much higher usage on the order of 5-10 CPUs (5000m-10000m).
My initial pass as configuring this is:
- Deployment: 1 replica
"resources": {
"requests": {
"cpu": 0.05
},
"limits": {
"cpu": 1.0
}
}
- HPA:
"spec": {
"maxReplicas": 25,
"metrics": [
{
"resource": {
"name": "cpu",
"target": {
"averageValue": 0.75,
"type": "AverageValue"
}
},
"type": "Resource"
}
],
"minReplicas": 1,
...
}
This is running on an AWS EKS cluster with autoscaler running. All instances have 2 CPUs. The goal is that as the CPU usage goes up the HPA will allocate a new pod that will be unschedulable and then the autoscaler will allocate a new node. As I add load on the service, the CPU usage for the first pod spikes up to approximately 90-95% at max.
I am running into two related problems:
- Small request size
By using such a small request value (cpu: 0.05), the newly requested pods can be easily scheduled on the current node even when it is under high load. Thus the autoscaler never find a pod that can't be scheduled and doesn't allocate a new node. I could increase the small request size and overcommit, but this then means that for the vast majority of the time when there is no load I will be wasting resources I don't need.
- Average CPU reduces as more pods are allocated
Because the pods all get allocated on the same node, once a new pod is allocated it starts sharing the node's available 2 CPUs. This in turn reduces the amount of CPU used by the pod and thus keeps the average value below the 75% target.
(ex: 3 pods, 2 CPUs ==> max 66% Average CPU usage per pod)
I am looking for guidance here on how I should be thinking about this problem. I think I am missing something simple.
My current thought is that what I am looking for is a way for the Pod resource request value to increase under heavier load and then decrease back down when the system doesn't need it. That would point me toward using something like a VPA, but everything I have read says that using HPA and VPA at the same time leads to very bad things.
I think increasing the request from 0.05 to something like 0.20 would probably let me handle the case of scaling up. But this will in turn waste a lot of resources and could suffer issues if the scheduler find space on an existing pod. My example is about one service but there are many more services in the production deployment. I don't want to have nodes sitting empty with committed resources but no usage.
What is the best path forward here?