How can I distribute a deployment across nodes?
Asked Answered
J

6

80

I have a Kubernetes deployment that looks something like this (replaced names and other things with '....'):

# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "3"
    kubernetes.io/change-cause: kubectl replace deployment ....
      -f - --record
  creationTimestamp: 2016-08-20T03:46:28Z
  generation: 8
  labels:
    app: ....
  name: ....
  namespace: default
  resourceVersion: "369219"
  selfLink: /apis/extensions/v1beta1/namespaces/default/deployments/....
  uid: aceb2a9e-6688-11e6-b5fc-42010af000c1
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ....
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: ....
    spec:
      containers:
      - image: gcr.io/..../....:0.2.1
        imagePullPolicy: IfNotPresent
        name: ....
        ports:
        - containerPort: 8080
          protocol: TCP
        resources:
          requests:
            cpu: "0"
        terminationMessagePath: /dev/termination-log
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      securityContext: {}
      terminationGracePeriodSeconds: 30
status:
  availableReplicas: 2
  observedGeneration: 8
  replicas: 2
  updatedReplicas: 2

The problem I'm observing is that Kubernetes places both replicas (in the deployment I've asked for two) on the same node. If that node goes down, I lose both containers and the service goes offline.

What I want Kubernetes to do is to ensure that it doesn't double up containers on the same node where the containers are the same type - this only consumes resources and doesn't provide any redundancy. I've looked through the documentation on deployments, replica sets, nodes etc. but I couldn't find any options that would let me tell Kubernetes to do this.

Is there a way to tell Kubernetes how much redundancy across nodes I want for a container?

EDIT: I'm not sure labels will work; labels constrain where a node will run so that it has access to local resources (SSDs) etc. All I want to do is ensure no downtime if a node goes offline.

Julius answered 23/8, 2016 at 3:56 Comment(0)
P
22

I think you're looking for the Affinity/Anti-Affinity Selectors.

Affinity is for co-locating pods, so I want my website to try and schedule on the same host as my cache for example. On the other hand, Anti-affinity is the opposite, don't schedule on a host as per a set of rules.

So for what you're doing, I would take a closer look at this two links: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#never-co-located-in-the-same-node

https://kubernetes.io/docs/tutorials/stateful-application/zookeeper/#tolerating-node-failure

Pissed answered 29/10, 2017 at 17:21 Comment(1)
This is no longer the best answer for this problem. As of Kubernetes 1.19, you can use topology constraints to do this. The accepted answer should be changed to Anton Blomström's answer.Vulgarize
D
164

There is now a proper way of doing this. You can use the label in "kubernetes.io/hostname" if you just want to spread it out across all nodes. Meaning if you have two replicas of a pod, and two nodes, each should get one if their names aren't the same.

Example:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-service
  labels:
    app: my-service
spec:
  replicas: 2
  selector:
    matchLabels:
      app: my-service
  template:
    metadata:
      labels:
        app: my-service
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: my-service
      containers:
      - name: pause
        image: k8s.gcr.io/pause:3.1
Dumbarton answered 22/11, 2020 at 19:17 Comment(3)
The accepted answer should be changed to this one. And this needs to be upvoted more. Since K8s 1.19, this is the way.Vulgarize
I agree that this is an accepted answer.Max
This answer should be an accepted answer too!Retarded
P
22

I think you're looking for the Affinity/Anti-Affinity Selectors.

Affinity is for co-locating pods, so I want my website to try and schedule on the same host as my cache for example. On the other hand, Anti-affinity is the opposite, don't schedule on a host as per a set of rules.

So for what you're doing, I would take a closer look at this two links: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#never-co-located-in-the-same-node

https://kubernetes.io/docs/tutorials/stateful-application/zookeeper/#tolerating-node-failure

Pissed answered 29/10, 2017 at 17:21 Comment(1)
This is no longer the best answer for this problem. As of Kubernetes 1.19, you can use topology constraints to do this. The accepted answer should be changed to Anton Blomström's answer.Vulgarize
I
12

If you create a Service for that Deployment, before creating the said Deployment, Kubernetes will spread your pods across nodes. This behavior comes from the Scheduler, it is provided on a best-effort basis, providing that you have enough resources available on both nodes.

From the Kubernetes documentation (Managing Resources):

it’s best to specify the service first, since that will ensure the scheduler can spread the pods associated with the service as they are created by the controller(s), such as Deployment.

Also related: Configuration best practices - Service.

Incidence answered 23/8, 2016 at 7:59 Comment(5)
If I have both definitions in single yaml, this does not work. If I create a service and then deployment in a while, scheduler will spread pods across nodesMillstream
Because your Service comes after your Deployment in the yaml file. kubectl creates them in the order they are defined in the file.Incidence
No, it service comes first, i think when I apply both svc and deployment from single file, it is too fast. I had to declarations in 2 differebt yaml files. And apply them with 2 apply commands to distribute my pods.Millstream
This doesn't seems like working: k8s still creates 2 pods on the same node.Bronson
@Bronson it works on a best effort basis. If your other nodes are unsuitable for spreading replicas, those replicas may still be collocated. The only way to avoid that entirely is to use hard anti-affinity, which didn't exist when I wrote my answer in 2016.Incidence
B
7

I agree with Antoine Cotten to use a service for your deployment. A service always keeps any service up by creating a new pod if, for some reason, one pod is dying in a certain node. However, if you just want to distribute a deployment among all nodes then you can use pod anti affinity in your pod manifest file. I put an example on my gitlab page that you can also find in Kubernetes Blog. For your convenience, I'm providing the example here as well.

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: nginx
spec:
  replicas: 3
  template:
    metadata:
      labels:
        app: nginx
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - nginx
            topologyKey: kubernetes.io/hostname
      containers:
      - name: nginx
        image: gcr.io/google_containers/nginx-slim:0.9
        ports:
        - containerPort: 8080

In this example, each Deployment has a label which is app and the value of this label is nginx. In pod spec, you have podAntiAffinity that will restrict to have two same pods (label app:nginx) in one node. You can also use podAffinity if you would like to place multiple Deployments in one node.

Biernat answered 14/7, 2017 at 17:10 Comment(5)
a service does not create or restart a pod. this is done by the replicaset that is started by the deployment. The service only collects available endpoints.Tondatone
In the above yaml snippet the indentation of the topologyKey element is off, it should be on the same level as the labelSelector, see example in the docs: kubernetes.io/docs/concepts/configuration/assign-pod-node/…Irina
This is almost perfect solution, but how to allow run more pods then you have worker nodes? By this way we can run as many pod as we have nodes and the rest is in Pending state.Cryptogam
if my replicas scaling it will get into pending state if all node running one replicas of application. is there way we can set soft rule kind of way ?Signac
Won't it lead to doubling of the load on the remaining running pods if half (e.g. one of two) nodes is down?Miran
P
1

If a node goes down, any pods running on it would be restarted automatically on another node.

If you start specifying exactly where you want them to run, then you actually loose the capability of Kubernetes to reschedule them on a different node.

The usual practice therefore is to simply let Kubernetes do its thing.

If however you do have valid requirements to run a pod on a specific node, due to requirements for certain local volume type etc, have a read of:

Precession answered 23/8, 2016 at 4:12 Comment(1)
I'm not sure labels will work; labels constrain where a node will run so that it has access to local resources (SSDs) etc. All I want to do is ensure no downtime if a node goes offline.Julius
R
0

Maybe a DaemonSet will work better. I'm using DaemonStets with nodeSelector to run pods on specific nodes and avoid duplication.

http://kubernetes.io/docs/admin/daemons/

Romalda answered 23/8, 2016 at 6:55 Comment(2)
That would kind of work, but make scaling up or down close to impossible.Incidence
Depending on situation. You can also play with resource limits and force a pod to be always scheduled to another node.Romalda

© 2022 - 2024 — McMap. All rights reserved.