Based on the comment from Suresh Vishnoi:
it might be helpful awesome-prometheus-alerts.grep.to/rules.html#kubernetes
yes, this could be very helpful. On this site you can find templates for failed pods (not healthy):
Pod has been in a non-ready state for longer than 15 minutes.
- alert: KubernetesPodNotHealthy
expr: min_over_time(sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[15m:1m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: Kubernetes Pod not healthy (instance {{ $labels.instance }})
description: "Pod has been in a non-ready state for longer than 15 minutes.\n V
or for crash looping:
Pod {{ $labels.pod }} is crash looping
- alert: KubernetesPodCrashLooping
expr: increase(kube_pod_container_status_restarts_total[1m]) > 3
for: 2m
labels:
severity: warning
annotations:
summary: Kubernetes pod crash looping (instance {{ $labels.instance }})
description: "Pod {{ $labels.pod }} is crash looping\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
See also this good guide about monitoring kubernetes cluster with Prometheus:
The Kubernetes API and the kube-state-metrics (which natively uses prometheus metrics) solve part of this problem by exposing Kubernetes internal data, such as the number of desired / running replicas in a deployment, unschedulable nodes, etc.
Prometheus is a good fit for microservices because you just need to expose a metrics port, and don’t need to add too much complexity or run additional services. Often, the service itself is already presenting a HTTP interface, and the developer just needs to add an additional path like /metrics
.
If it comes to unschedulable nodes, you can use the metric kube_node_spec_unschedulable
. It is described here or here:
kube_node_spec_unschedulable
- Whether a node can schedule new pods or not.
Look also at this guide.
Basically, you need to find the metric you want to monitor and set it appropriately in Prometheus. Alternatively, you can use templates, as I showed at the beginning of the answer.