How to use K8S HPA and autoscaler when Pods normally need low CPU but periodically scale
Asked Answered
G

1

2

I am trying to determine a reliable setup to use with K8S to scale one of my deployments using an HPA and an autoscaler. I want to minimize the amount of resources overcommitted but allow it to scale up as needed.

I have a deployment that is managing a REST API service. Most of the time the service will have very low usage (0m-5m cpu). But periodically through the day or week it will spike to much higher usage on the order of 5-10 CPUs (5000m-10000m).

My initial pass as configuring this is:

  • Deployment: 1 replica
"resources": {
   "requests": {
     "cpu": 0.05
   },
   "limits": {
      "cpu": 1.0
   }
}
  • HPA:
"spec": {
   "maxReplicas": 25,
   "metrics": [
      {
         "resource": {
         "name": "cpu",
         "target": {
            "averageValue": 0.75,
            "type": "AverageValue"
         }
         },
         "type": "Resource"
      }
   ],
   "minReplicas": 1,
   ...
}

This is running on an AWS EKS cluster with autoscaler running. All instances have 2 CPUs. The goal is that as the CPU usage goes up the HPA will allocate a new pod that will be unschedulable and then the autoscaler will allocate a new node. As I add load on the service, the CPU usage for the first pod spikes up to approximately 90-95% at max.

I am running into two related problems:

  1. Small request size

By using such a small request value (cpu: 0.05), the newly requested pods can be easily scheduled on the current node even when it is under high load. Thus the autoscaler never find a pod that can't be scheduled and doesn't allocate a new node. I could increase the small request size and overcommit, but this then means that for the vast majority of the time when there is no load I will be wasting resources I don't need.

  1. Average CPU reduces as more pods are allocated

Because the pods all get allocated on the same node, once a new pod is allocated it starts sharing the node's available 2 CPUs. This in turn reduces the amount of CPU used by the pod and thus keeps the average value below the 75% target.

(ex: 3 pods, 2 CPUs ==> max 66% Average CPU usage per pod)

I am looking for guidance here on how I should be thinking about this problem. I think I am missing something simple.

My current thought is that what I am looking for is a way for the Pod resource request value to increase under heavier load and then decrease back down when the system doesn't need it. That would point me toward using something like a VPA, but everything I have read says that using HPA and VPA at the same time leads to very bad things.

I think increasing the request from 0.05 to something like 0.20 would probably let me handle the case of scaling up. But this will in turn waste a lot of resources and could suffer issues if the scheduler find space on an existing pod. My example is about one service but there are many more services in the production deployment. I don't want to have nodes sitting empty with committed resources but no usage.

What is the best path forward here?

Garnettgarnette answered 30/3, 2021 at 22:19 Comment(0)
P
2

Sounds like you need a Scheduler that take actual CPU utilization into account. This is not supported yet.

There seem to be work on a this feature: KEP - Trimaran: Real Load Aware Scheduling using TargetLoadPackin plugin. Also see New scheduler priority for real load average and free memory.

In the meanwhile, if the CPU limit is 1 Core, and the Nodes autoscale under high CPU utilization, it sounds like it should work if the nodes is substantially bigger than the CPU limits for the pods. E.g. try with nodes that has 4 Cores or more and possibly slightly larger CPU request for the Pod?

Pelops answered 30/3, 2021 at 23:40 Comment(5)
thanks for the quick response. I am having trouble thinking through why having 4 cores helps out. I will give it a try today, but could you explain that part a bit further to help me know what I am looking for when testing?Garnettgarnette
The idea is that node-autoscaling will happen some time before pod-autoscaling need a new node, and hopefully will be scheduled to a node with less allocation.Pelops
That makes sense. Toward your comment above I also found references to this custom scheduler that may help. github.com/IBM/kube-safe-scheduler. I can't find any articles that describe exactly how to use a custom scheduler though so I think I will still with the standard scheduler and see what I can do. The custom scheduler you pointed at does seem like the long term solution.Garnettgarnette
There is a filed in the Pod template schedulerName: that you use for pods that should be scheduled with a different scheduler. Also see kubernetes.io/docs/tasks/extend-kubernetes/…Pelops
I saw that part, what I didn't see were any articles showing better schedulers and how to get them installed and running. It is something I will be on the look out for though if this method doesn't work out well.Garnettgarnette

© 2022 - 2024 — McMap. All rights reserved.