Setting up a readiness, liveness or startup probe

Asked 22/1, 2020 at 0:8 Answered 22/1, 2020 at 19:48

Solved docker kubernetes skaffold readinessprobe

I'm having difficulty understanding which would be best for my situation and how to actually implement it.

In a nutshell, the problem is this:

I'm spinning up my DB (Postgres), BE (Django), and FE (React) deployments with Skaffold
About 50% of the time the BE spins up before the DB
One of the first things Django tries to do is connect to the DB
It only tries once (by design and can't be changed), if it can't, it fails and the application is broken

Thus, I need to make sure every single time I spin up my deployments, the DB deployment is running before starting the BE deployment

I came across readiness, liveness, and starup probes. I've read it a couple times and readiness probes sound like what I need: I don't want the BE deployment to start until the DB deployment is ready to accept connections.

I guess I'm not understanding how to set it up. This is what I've tried, but I still run into instances where one is being loaded before another.

postgres.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      component: postgres
  template:
    metadata:
      labels:
        component: postgres
    spec:
      containers:
        - name: postgres
          image: testappcontainers.azurecr.io/postgres
          ports:
            - containerPort: 5432
          env: 
            - name: POSTGRES_DB
              valueFrom:
                secretKeyRef:
                  name: testapp-secrets
                  key: PGDATABASE
            - name: POSTGRES_USER
              valueFrom:
                secretKeyRef:
                  name: testapp-secrets
                  key: PGUSER
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: testapp-secrets
                  key: PGPASSWORD
            - name: POSTGRES_INITDB_ARGS
              value: "-A md5"
          volumeMounts:
            - name: postgres-storage
              mountPath: /var/lib/postgresql/data
              subPath: postgres
      volumes:
        - name: postgres-storage
          persistentVolumeClaim:
            claimName: postgres-storage
---
apiVersion: v1
kind: Service
metadata:
  name: postgres-cluster-ip-service
spec:
  type: ClusterIP
  selector:
    component: postgres
  ports:
    - port: 1423
      targetPort: 5432

api.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      component: api
  template:
    metadata:
      labels:
        component: api
    spec:
      containers:
        - name: api
          image: testappcontainers.azurecr.io/testapp-api
          ports:
            - containerPort: 5000
          env:
            - name: PGUSER
              valueFrom:
                secretKeyRef:
                  name: testapp-secrets
                  key: PGUSER
            - name: PGHOST
              value: postgres-cluster-ip-service
            - name: PGPORT
              value: "1423"
            - name: PGDATABASE
              valueFrom:
                secretKeyRef:
                  name: testapp-secrets
                  key: PGDATABASE
            - name: PGPASSWORD
              valueFrom:
                secretKeyRef:
                  name: testapp-secrets
                  key: PGPASSWORD
            - name: SECRET_KEY
              valueFrom:
                secretKeyRef:
                  name: testapp-secrets
                  key: SECRET_KEY
            - name: DEBUG
              valueFrom:
                secretKeyRef:
                  name: testapp-secrets
                  key: DEBUG
          readinessProbe:
            httpGet:
              host: postgres-cluster-ip-service
              port: 1423
            initialDelaySeconds: 10
            periodSeconds: 5
            timeoutSeconds: 2
---
apiVersion: v1
kind: Service
metadata:
  name: api-cluster-ip-service
spec:
  type: ClusterIP
  selector:
    component: api
  ports:
    - port: 5000
      targetPort: 5000

client.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: client-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      component: client
  template:
    metadata:
      labels:
        component: client
    spec:
      containers:
        - name: client
          image: testappcontainers.azurecr.io/testapp-client
          ports:
            - containerPort: 3000
          readinessProbe:
            httpGet:
              path: api-cluster-ip-service
              port: 5000
            initialDelaySeconds: 10
            periodSeconds: 5
            timeoutSeconds: 2
---
apiVersion: v1
kind: Service
metadata:
  name: client-cluster-ip-service
spec:
  type: ClusterIP
  selector:
    component: client
  ports:
    - port: 3000
      targetPort: 3000

I don't think the ingress.yaml and the skaffold.yaml will be helpful, but let me know if I should add those.

So what am I doing wrong here?

Edit:

So I've tried out a few things based on David Maze's response. This helped me understand what is going on better, but I am still running into issues I'm not quite understanding how to resolve.

The first problem is that even with a default restartPolicy: Always, and even though Django fails, the Pods themselves don't fail. The Pods think they are perfectly healthy even though Django has failed.

The second problem is that apparently the Pods need to be made aware of Django's status. That is the part I'm not quite wrapping my brain around, particularly should probes be checking the status of other deployments or themselves?

Yesterday my thinking was the former, but today I'm thinking it is the latter: the Pod needs to know the program contained in it has failed. However, everything I've tried just results in a failed probe, connection refused, etc.:

# referring to itself
host: /health
port: 5000

host: /healthz
port: 5000

host: /api
port: 5000

host: /
port: 5000

host: /api-cluster-ip-service
port: 5000

host: /api-deployment
port: 5000

# referring to the DB deployment
host: /health
port: 1423 #or 5432

host: /healthz
port: 1423 #or 5432

host: /api
port: 1423 #or 5432

host: /
port: 1423 #or 5432

host: /postgres-cluster-ip-service
port: 1423 #or 5432

host: /postgres-deployment
port: 1423 #or 5432

So apparently I'm setting up the probe wrong, despite it being a "super-easy" implementation (as a few blogs have described it). For example, the /health and /healthz routes: are these built into Kubernetes or do these need to be setup? Rereading the docs to hopefully clarify this.

Graveclothes answered 22/1, 2020 at 0:8 Comment(1)

When the service determines that it can’t contact the database, what does it do? Does your image have something like a tail -f /dev/null “to keep the container alive”? HTTP paths like /healthz are routes your service needs to provide itself, just doing GET / can be enough to get started. – Roturier 22/1, 2020 at 21:41

Actually, think I might have sorted it out.

Part of the problem is that even though restartPolicy: Always is the default, the Pods are not aware the Django has failed so it thinks they are healthy.

My thinking was wrong in that I originally assumed I needed to refer to the DB deployment to see if it had start before starting the API deployment. Instead I needed to check if Django had failed and redeploy it if it had.

Doing the following accomplished this for me:

livenessProbe:
  tcpSocket:
    port: 5000
  initialDelaySeconds: 2
  periodSeconds: 2
readinessProbe:
  tcpSocket:
    port: 5000
  initialDelaySeconds: 2
  periodSeconds: 2

I'm learning Kubernetes so please correct me if there is a better way to do this or if this is just plain wrong. I just know it accomplishes what I want.

Graveclothes answered 22/1, 2020 at 19:48 Comment(0)

You're just not waiting long enough.

The deployment artifacts you're showing here look pretty normal. It's even totally normal for your application to fail fast if it can't reach the database, say because it hasn't started up yet. Every pod has a restart policy, though, which defaults to Always. So, when the pod fails, Kubernetes will restart it; and when it fails again, it will get restarted again; and when it keeps failing, Kubernetes will pause tens of seconds between restarts (the dreaded CrashLoopBackOff state).

Eventually if you're in this wait-and-restart loop, the database will actually come up, and then Kubernetes will restart your application pods, at which point the application will start up normally.

The only thing that I'd change here is that your readiness probes for the two pods should probe the services themselves, not some other service. You probably want the path to be something like / or /healthz or something else that is an actual HTTP request path in the service. That can return 503 Service Unavailable if it detects its dependency isn't available, or you can just crash. Just crashing is fine.

This is a totally normal setup in Kubernetes; there's no way to more directly say that pod A can't start until service B is ready. The flip side of this is that the pattern is actually pretty generic: if your application crashes and restarts whenever it can't reach its database, it doesn't matter if the database is hosted outside the cluster, or if it crashes sometime well after startup time; the same logic will try to restart your application until it works again.

Roturier answered 22/1, 2020 at 0:49 Comment(3)

Fully agree, and I still don't really understand why someone always take care of db deployment, but not take database deployment out of the whole process to make the design simplier. But that's other discussion. – Taco 22/1, 2020 at 1:12

@Taco So instead of automating the whole deployment, so developers just have to run skaffold dev to get up an running, I should have them setup a local dev database, make sure it is running at all times, and then spin up the server and client services with skaffold dev so they can then start developing? Runs contrary to just about every devops person I've talked to, who say you want your developers developing, not setting up and maintaining dev databases. – Graveclothes 22/1, 2020 at 15:33

Hi David, thanks for the response. It helped my understand a few things, but still having a few issues I've added to the bottom of my original question. – Graveclothes 22/1, 2020 at 18:59

Actually, think I might have sorted it out.

Part of the problem is that even though restartPolicy: Always is the default, the Pods are not aware the Django has failed so it thinks they are healthy.

Doing the following accomplished this for me:

livenessProbe:
  tcpSocket:
    port: 5000
  initialDelaySeconds: 2
  periodSeconds: 2
readinessProbe:
  tcpSocket:
    port: 5000
  initialDelaySeconds: 2
  periodSeconds: 2

I'm learning Kubernetes so please correct me if there is a better way to do this or if this is just plain wrong. I just know it accomplishes what I want.

Graveclothes answered 22/1, 2020 at 19:48 Comment(0)

Recommended topics

Hot tags