Alerts in K8s for Pod failing

Asked 16/11, 2021 at 7:20 Answered 10/7, 2024 at 9:58

Solved kubernetes prometheus grafana grafana-alerts

I wanted to create alerts in Grafana for My Kubernetes Clusters. I have configured Prometheus, Node exporter, Kube-Metrics, Alert Manager in my k8s Cluster. I wanted to setup Alerting on Unschedulable or Failed Pods.

Cause of unschedulable or failed pods
Generating an alert after a while
Creating another alert to notify us when pods fail. Can You guide me how to achieve this??

Noreen answered 16/11, 2021 at 7:20 Comment(2)

Hi Sayali, it might be helpful awesome-prometheus-alerts.grep.to/rules.html#kubernetes – Pegboard 16/11, 2021 at 13:3

Where exactly is the problem and what did you try? – Dougherty 16/11, 2021 at 13:6

Based on the comment from Suresh Vishnoi:

it might be helpful awesome-prometheus-alerts.grep.to/rules.html#kubernetes

yes, this could be very helpful. On this site you can find templates for failed pods (not healthy):

Pod has been in a non-ready state for longer than 15 minutes.

  - alert: KubernetesPodNotHealthy
    expr: min_over_time(sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[15m:1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes Pod not healthy (instance {{ $labels.instance }})
      description: "Pod has been in a non-ready state for longer than 15 minutes.\n  V

or for crash looping:

Pod {{ $labels.pod }} is crash looping

  - alert: KubernetesPodCrashLooping
    expr: increase(kube_pod_container_status_restarts_total[1m]) > 3
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes pod crash looping (instance {{ $labels.instance }})
      description: "Pod {{ $labels.pod }} is crash looping\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

The Kubernetes API and the kube-state-metrics (which natively uses prometheus metrics) solve part of this problem by exposing Kubernetes internal data, such as the number of desired / running replicas in a deployment, unschedulable nodes, etc.

Prometheus is a good fit for microservices because you just need to expose a metrics port, and don’t need to add too much complexity or run additional services. Often, the service itself is already presenting a HTTP interface, and the developer just needs to add an additional path like /metrics.

If it comes to unschedulable nodes, you can use the metric kube_node_spec_unschedulable. It is described here or here: kube_node_spec_unschedulable - Whether a node can schedule new pods or not.

Look also at this guide. Basically, you need to find the metric you want to monitor and set it appropriately in Prometheus. Alternatively, you can use templates, as I showed at the beginning of the answer.

Lidda answered 16/11, 2021 at 7:20 Comment(1)

Thank You Sharing Answer. – Noreen 19/11, 2021 at 12:53

The following rule was too noisy:

min_over_time(sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[15m:1m]) > 0

It is triggering alert when any pod is in pending state during at least one 1m period during 15m time frame, and that can generate many false positive alerts especially if you have cron tasks in your cluster or active pod scaling.

I think it is better to send alerts when pod is in not running state for some period of time (e.g. I'd like to be notified when pod cannot start during 5 minutes or more) and here is the rule that can help:

(sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[5m] offset 5m) + on(namespace, pod) (sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[5m]) > 1

So we send alert only if pod was in not running state for at least 5 minutes and ignore cases when pods are normally starting.

Alderete answered 10/7, 2024 at 9:58 Comment(0)

Recommended topics

Hot tags