Kubernetes pod distribution amongst nodes

Asked 15/12, 2016 at 8:47 Answered 29/7 at 10:45

Is there any way to make kubernetes distribute pods as much as possible? I have "Requests" on all deployments and global Requests as well as HPA. all nodes are the same.

Just had a situation where my ASG scaled down a node and one service became completely unavailable as all 4 pods were on the same node that was scaled down.

I would like to maintain a situation where each deployment must spread its containers on at least 2 nodes.

Apo answered 15/12, 2016 at 8:47 Comment(0)

Here I leverage Anirudh's answer adding example code.

My initial kubernetes yaml looked like this:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: say-deployment
spec:
  replicas: 6
  template:
    metadata:
      labels:
        app: say
    spec:
      containers:
      - name: say
        image: gcr.io/hazel-champion-200108/say
        ports:
        - containerPort: 8080
---
kind: Service
apiVersion: v1
metadata:
  name: say-service
spec:
  selector:
    app: say
  ports:
    - protocol: TCP
      port: 8080
  type: LoadBalancer
  externalIPs:
    - 192.168.0.112

At this point, kubernetes scheduler somehow decides that all the 6 replicas should be deployed on the same node.

Then I added requiredDuringSchedulingIgnoredDuringExecution to force the pods beeing deployed on different nodes:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: say-deployment
spec:
  replicas: 3
  template:
    metadata:
      labels:
        app: say
    spec:
      containers:
      - name: say
        image: gcr.io/hazel-champion-200108/say
        ports:
        - containerPort: 8080
      affinity:
              podAntiAffinity:
                requiredDuringSchedulingIgnoredDuringExecution:
                  - labelSelector:
                      matchExpressions:
                        - key: "app"
                          operator: In
                          values:
                          - say
                    topologyKey: "kubernetes.io/hostname"
---
kind: Service
apiVersion: v1
metadata:
  name: say-service
spec:
  selector:
    app: say
  ports:
    - protocol: TCP
      port: 8080
  type: LoadBalancer
  externalIPs:
    - 192.168.0.112

Now all the pods are run on different nodes. And since I have 3 nodes and 6 pods, other 3 pods (6 minus 3) can't be running (pending). This is because I required it: requiredDuringSchedulingIgnoredDuringExecution.

kubectl get pods -o wide 

NAME                              READY     STATUS    RESTARTS   AGE       IP            NODE
say-deployment-8b46845d8-4zdw2   1/1       Running            0          24s       10.244.2.80   night
say-deployment-8b46845d8-699wg   0/1       Pending            0          24s       <none>        <none>
say-deployment-8b46845d8-7nvqp   1/1       Running            0          24s       10.244.1.72   gray
say-deployment-8b46845d8-bzw48   1/1       Running            0          24s       10.244.0.25   np3
say-deployment-8b46845d8-vwn8g   0/1       Pending            0          24s       <none>        <none>
say-deployment-8b46845d8-ws8lr   0/1       Pending            0          24s       <none>        <none>

Now if I loosen this requirement with preferredDuringSchedulingIgnoredDuringExecution:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: say-deployment
spec:
  replicas: 6
  template:
    metadata:
      labels:
        app: say
    spec:
      containers:
      - name: say
        image: gcr.io/hazel-champion-200108/say
        ports:
        - containerPort: 8080
      affinity:
              podAntiAffinity:
                preferredDuringSchedulingIgnoredDuringExecution:
                  - weight: 100
                    podAffinityTerm:
                      labelSelector:
                        matchExpressions:
                          - key: "app"
                            operator: In
                            values:
                            - say
                      topologyKey: "kubernetes.io/hostname"
---
kind: Service
apiVersion: v1
metadata:
  name: say-service
spec:
  selector:
    app: say
  ports:
    - protocol: TCP
      port: 8080
  type: LoadBalancer
  externalIPs:
    - 192.168.0.112

First 3 pods are deployed on 3 different nodes just like in the previous case. And the rest 3 (6 pods minus 3 nodes) are deployed on various nodes according to kubernetes internal considerations.

NAME                              READY     STATUS    RESTARTS   AGE       IP            NODE
say-deployment-57cf5fb49b-26nvl   1/1       Running   0          59s       10.244.2.81   night
say-deployment-57cf5fb49b-2wnsc   1/1       Running   0          59s       10.244.0.27   np3
say-deployment-57cf5fb49b-6v24l   1/1       Running   0          59s       10.244.1.73   gray
say-deployment-57cf5fb49b-cxkbz   1/1       Running   0          59s       10.244.0.26   np3
say-deployment-57cf5fb49b-dxpcf   1/1       Running   0          59s       10.244.1.75   gray
say-deployment-57cf5fb49b-vv98p   1/1       Running   0          59s       10.244.1.74   gray

Toombs answered 18/4, 2018 at 12:47 Comment(3)

What if I only have one node available and rest two are full? will it deploy it all pods on one single node(in which case it is not fulfilling our requirement) or will spin up new nodes and deploy on different nodes? – Miniskirt 4/11, 2019 at 13:39

@Maxin : Can you please check and anwer : #58718526 – Miniskirt 7/11, 2019 at 13:55

This answer has become outdated. The right answer for k8s 1.19 is to use topologySpreadConstraints (docs). See https://mcmap.net/q/259173/-how-can-i-distribute-a-deployment-across-nodes – Embellish 27/9, 2021 at 11:12

Sounds like what you want is Inter-Pod Affinity and Pod Anti-affinity.

Inter-pod affinity and anti-affinity were introduced in Kubernetes 1.4. Inter-pod affinity and anti-affinity allow you to constrain which nodes your pod is eligible to schedule on based on labels on pods that are already running on the node rather than based on labels on nodes. The rules are of the form “this pod should (or, in the case of anti-affinity, should not) run in an X if that X is already running one or more pods that meet rule Y.” Y is expressed as a LabelSelector with an associated list of namespaces (or “all” namespaces); unlike nodes, because pods are namespaced (and therefore the labels on pods are implicitly namespaced), a label selector over pod labels must specify which namespaces the selector should apply to. Conceptually X is a topology domain like node, rack, cloud provider zone, cloud provider region, etc. You express it using a topologyKey which is the key for the node label that the system uses to denote such a topology domain, e.g. see the label keys listed above in the section “Interlude: built-in node labels.”

Anti-affinity can be used to ensure that you are spreading your pods across failure domains. You can state these rules as preferences, or as hard rules. In the latter case, if it is unable to satisfy your constraint, the pod would fail to get scheduled.

Pearsall answered 15/12, 2016 at 17:26 Comment(6)

I guess this is more of kubernetes problem. The scheduler must be intelligent enough to spread the pods in such a way that it provides high availability, even in case of as node down. (Provided there is more than one node). Affinity works, but it is more of use when we want stateful containers to be scheduled in node that has SSD disk space. Just my thoughts, what do you guys think ? – Huffman 16/12, 2016 at 0:44

That is not strictly true. A preferred affinity rule can specify relations between pods using labels, irrespective of whether they have attached storage. – Pearsall 16/12, 2016 at 2:26

I don't believe that this is the right approach, The fact that I want the pods to be spread doesn't mean that I don't want to have 2 on the same node. This solution is like telling me, Create a daemon set, that's not what I am looking for. or there's a way to make a such an intelligent pod affinity that will decide if there is enough availability in the deployment, to start sharing more of the same pod on the same node – Apo 16/12, 2016 at 12:44

@Gleeb, the affinity can be preferences or hard rules. If it's preferred, it will fall back to using the same node if it cannot find different nodes to schedule on. – Pearsall 16/12, 2016 at 22:36

Thanks, I will check this out. – Apo 20/12, 2016 at 10:34

@AnirudhRamanathan : Can you please check and see if you can answer this query of mine : #58718526 – Miniskirt 7/11, 2019 at 13:56

Is there any way to make kubernetes distribute pods as much as possible?

Yes, use topologyKey it's more advanced and preferred.

I would like to maintain a situation where each deployment must spread its containers

CO-LOCATING PODS IN THE SAME AVAILABILITY ZONE

If one need to do to run the frontend pods in the same zone as the backend pod would be to change the topologyKey property to failure-domain.beta.kubernetes.io/zone.

CO-LOCATING PODS IN THE SAME GEOGRAPHICAL REGION

If one need the pods to be deployed in the same region instead of the same zone, the topologyKey would be set to failure-domain.beta.kubernetes.io/region.

UNDERSTANDING HOW TOPOLOGYKEY WORKS It is simple. If you want, you can easily use your own topologyKey, such as rack, to have the pods scheduled to the same server rack. The only prerequisite is to add a rack label to your nodes.

For example, if you had 20 nodes, with 10 in each rack, you’d label the first ten as rack=rack1 and the others as rack=rack2. Then, when defining a pod’s podAffinity, you’d set the toplogyKey to rack.

Working Process

When the Scheduler is deciding where to deploy a pod, it checks the pod’s pod- Affinity config, finds the pods that match the label selector, and looks up the nodes they’re running on. Specifically, it looks up the nodes’ label whose key matches the topologyKey field specified in podAffinity. Then it selects all the nodes whose label matches the values of the pods it found earlier. In figure 16.5, the label selector matched the backend pod, which runs on Node 12. The value of the rack label on that node equals rack2, so when scheduling a frontend pod, the Scheduler will only select among the nodes that have the rack=rack2 label.

Helio answered 2/6 at 17:49 Comment(0)

Instead of podAntiAffinity please review Pod Topology spread constraints:

You can use topology spread constraints to control how Pods are spread across your cluster among failure-domains such as regions, zones, nodes, and other user-defined topology domains. This can help to achieve high availability as well as efficient resource utilization.

Example:

kind: Pod
apiVersion: v1
metadata:
  name: mypod
  labels:
    foo: bar
spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        foo: bar
  containers:
  - name: pause
    image: registry.k8s.io/pause:3.1

New features in K8s 1.27: More fine-grained pod topology spread policies reached beta.

Cleaves answered 29/7 at 10:45 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags