I'm having difficulty understanding which would be best for my situation and how to actually implement it.
In a nutshell, the problem is this:
- I'm spinning up my DB (Postgres), BE (Django), and FE (React) deployments with Skaffold
- About 50% of the time the BE spins up before the DB
- One of the first things Django tries to do is connect to the DB
- It only tries once (by design and can't be changed), if it can't, it fails and the application is broken
- Thus, I need to make sure every single time I spin up my deployments, the DB deployment is running before starting the BE deployment
I came across readiness, liveness, and starup probes. I've read it a couple times and readiness probes sound like what I need: I don't want the BE deployment to start until the DB deployment is ready to accept connections.
I guess I'm not understanding how to set it up. This is what I've tried, but I still run into instances where one is being loaded before another.
postgres.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: postgres-deployment
spec:
replicas: 1
selector:
matchLabels:
component: postgres
template:
metadata:
labels:
component: postgres
spec:
containers:
- name: postgres
image: testappcontainers.azurecr.io/postgres
ports:
- containerPort: 5432
env:
- name: POSTGRES_DB
valueFrom:
secretKeyRef:
name: testapp-secrets
key: PGDATABASE
- name: POSTGRES_USER
valueFrom:
secretKeyRef:
name: testapp-secrets
key: PGUSER
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: testapp-secrets
key: PGPASSWORD
- name: POSTGRES_INITDB_ARGS
value: "-A md5"
volumeMounts:
- name: postgres-storage
mountPath: /var/lib/postgresql/data
subPath: postgres
volumes:
- name: postgres-storage
persistentVolumeClaim:
claimName: postgres-storage
---
apiVersion: v1
kind: Service
metadata:
name: postgres-cluster-ip-service
spec:
type: ClusterIP
selector:
component: postgres
ports:
- port: 1423
targetPort: 5432
api.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-deployment
spec:
replicas: 3
selector:
matchLabels:
component: api
template:
metadata:
labels:
component: api
spec:
containers:
- name: api
image: testappcontainers.azurecr.io/testapp-api
ports:
- containerPort: 5000
env:
- name: PGUSER
valueFrom:
secretKeyRef:
name: testapp-secrets
key: PGUSER
- name: PGHOST
value: postgres-cluster-ip-service
- name: PGPORT
value: "1423"
- name: PGDATABASE
valueFrom:
secretKeyRef:
name: testapp-secrets
key: PGDATABASE
- name: PGPASSWORD
valueFrom:
secretKeyRef:
name: testapp-secrets
key: PGPASSWORD
- name: SECRET_KEY
valueFrom:
secretKeyRef:
name: testapp-secrets
key: SECRET_KEY
- name: DEBUG
valueFrom:
secretKeyRef:
name: testapp-secrets
key: DEBUG
readinessProbe:
httpGet:
host: postgres-cluster-ip-service
port: 1423
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 2
---
apiVersion: v1
kind: Service
metadata:
name: api-cluster-ip-service
spec:
type: ClusterIP
selector:
component: api
ports:
- port: 5000
targetPort: 5000
client.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: client-deployment
spec:
replicas: 3
selector:
matchLabels:
component: client
template:
metadata:
labels:
component: client
spec:
containers:
- name: client
image: testappcontainers.azurecr.io/testapp-client
ports:
- containerPort: 3000
readinessProbe:
httpGet:
path: api-cluster-ip-service
port: 5000
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 2
---
apiVersion: v1
kind: Service
metadata:
name: client-cluster-ip-service
spec:
type: ClusterIP
selector:
component: client
ports:
- port: 3000
targetPort: 3000
I don't think the ingress.yaml
and the skaffold.yaml
will be helpful, but let me know if I should add those.
So what am I doing wrong here?
Edit:
So I've tried out a few things based on David Maze's response. This helped me understand what is going on better, but I am still running into issues I'm not quite understanding how to resolve.
The first problem is that even with a default restartPolicy: Always
, and even though Django fails, the Pods themselves don't fail. The Pods think they are perfectly healthy even though Django has failed.
The second problem is that apparently the Pods need to be made aware of Django's status. That is the part I'm not quite wrapping my brain around, particularly should probes be checking the status of other deployments or themselves?
Yesterday my thinking was the former, but today I'm thinking it is the latter: the Pod needs to know the program contained in it has failed. However, everything I've tried just results in a failed probe, connection refused, etc.:
# referring to itself
host: /health
port: 5000
host: /healthz
port: 5000
host: /api
port: 5000
host: /
port: 5000
host: /api-cluster-ip-service
port: 5000
host: /api-deployment
port: 5000
# referring to the DB deployment
host: /health
port: 1423 #or 5432
host: /healthz
port: 1423 #or 5432
host: /api
port: 1423 #or 5432
host: /
port: 1423 #or 5432
host: /postgres-cluster-ip-service
port: 1423 #or 5432
host: /postgres-deployment
port: 1423 #or 5432
So apparently I'm setting up the probe wrong, despite it being a "super-easy" implementation (as a few blogs have described it). For example, the /health
and /healthz
routes: are these built into Kubernetes or do these need to be setup? Rereading the docs to hopefully clarify this.
tail -f /dev/null
“to keep the container alive”? HTTP paths like/healthz
are routes your service needs to provide itself, just doingGET /
can be enough to get started. – Roturier