How can I alert for container restarted?

Asked 3/1, 2017 at 21:48 Answered 1/4, 2022 at 14:34

I like to monitor the containers using Prometheus and cAdvisor so that when a container restart, I get an alert. I wonder if anyone have sample Prometheus alert for this.

May answered 3/1, 2017 at 21:48 Comment(0)

I used the following Prometheus alert rule for finding container restarts in an hour(can be modified to max time), It may be helpful for you.

Prometheus Alert Rule Sample

ALERT ContainerRestart/PodRestart
IF rate(kube_pod_container_status_restarts[1h]) * 3600 > 1
FOR 5s
LABELS {action_required = "true", severity="critical/warning/info"}
ANNOTATIONS {DESCRIPTION="Pod {{$labels.namespace}}/{{$labels.pod}} restarting more than once during last one hours.",
SUMMARY="Container {{ $labels.container }} in Pod {{$labels.namespace}}/{{$labels.pod}} restarting more than once times during last one hours."}

rate()

rate(v range-vector) calculates the per-second average rate of increase of the time series in the range vector. Breaks in monotonicity (such as counter resets due to target restarts) are automatically adjusted for. Also, the calculation extrapolates to the ends of the time range, allowing for missed scrapes or imperfect alignment of scrape cycles with the range's time period. The following example expression returns the per-second rate of HTTP requests as measured over the last 5 minutes, per time series in the range vector:

rate(http_requests_total{job="api-server"}[5m])

rate should only be used with counters. It is best suited for alerting, and for graphing of slow-moving counters.

Note that when combining rate() with an aggregation operator (e.g. sum()) or a function aggregating over time (any function ending in _over_time), always take a rate() first, then aggregate. Otherwise rate() cannot detect counter resets when your target restarts.

kube_pod_container_status_restarts_total

Metric Type: Counter

Labels/Tags: container=container-name, namespace=pod-namespace,pod=pod-name

Description: The number of container restarts per pod

Romola answered 9/7, 2018 at 12:6 Comment(2)

FYI if you are going to do rate(...[1h]) * 3600, you can just do delta(...[1h]) for the same number. – Kisumu 11/8, 2018 at 9:1

Trying the same, but since kube_pod_container_status_restarts does not exist here I'm using kube_pod_container_status_restarts_total. Does that have the same meaning? – Projector 1/11, 2020 at 11:34

The following PromQL query returns containers, which were restarted during the last 10 minutes. It also shows the number of restarts during the last 10 minutes per each returned container:

(sum(increase(kube_pod_container_status_restarts_total[10m])) by (container)) > 0

The lookbehind window in square brackets (10m in the query above) can be tuned for a particular needs. See these docs for possible values the lookbehind window accepts.

The query works in the following way:

The kube_pod_container_status_restarts_total metric is exposed by kube-state-metrics, which is included by default in Kubernetes. See these docs for the exposed pod-level metrics.
The inner increase(kube_pod_container_status_restarts_total[10m]) calculates the number of container restarts during the last 10 minutes. See docs for increase() function.
The outer sum(...) by (container) is used solely for removing all the labels except the container label from the result. See docs for sum().
Then the result is compared to zero with > 0. This filters out containers with zero restarts during the last 10 minutes. See docs for comparison operators.

Groundless answered 1/4, 2022 at 14:34 Comment(0)

If you are running in Kubernetes you can deploy the kube-state-metrics container that publishes the restart metric for pods: https://github.com/kubernetes/kube-state-metrics

Tangential answered 15/1, 2017 at 21:7 Comment(2)

I installed this and it exposes 12k metrics, but none of them are for pod restarts :( – Vinificator 16/11, 2018 at 15:48

kube_pod_container_status_restarts_total should monitor restarts. See github.com/kubernetes/kube-state-metrics/blob/master/docs/… for reference. – Casual 9/7, 2020 at 14:12

I use Compose and Swarm deployments, so Kubernetes answers are not an option. So I came to this rules.

- alert: Container (Compose) Too Many Restarts
  expr: count by (instance, name) (count_over_time(container_last_seen{name!="", container_label_restartcount!=""}[15m])) - 1 >= 5
  for: 5m
  annotations:
    summary: "Too many restarts ({{ $value }}) for container \"{{ $labels.name }}\""

- alert: Container (Swarm) Too Many Restarts
  expr: count by (instance, container_label_com_docker_swarm_service_name) (count_over_time(container_last_seen{container_label_com_docker_swarm_service_name!=""}[15m])) - 1 >= 5
  for: 5m
  annotations:
    summary: "Too many restarts ({{ $value }}) for container \"{{ $labels.container_label_com_docker_swarm_service_name }}\""

Basically, both works the same way. There are multiple records for each service but with different labels.

Compose ones are the same except container_label_restartcount label

{instance="instance1",name="service1",container_label_restartcount="1",...}
{instance="instance1",name="service1",container_label_restartcount="2",...}
{instance="instance1",name="service1",container_label_restartcount="3",...}

Swarm looks a bit different, because new container is created when service is restared (e.g. from failed healthcheck). name label is changed, container_label_com_docker_swarm_service_name acts as service name.

{instance="instance1",name="service1.1.<hash1>",container_label_com_docker_swarm_service_name="service1",...}
{instance="instance1",name="service1.1.<hash2>",container_label_com_docker_swarm_service_name="service1",...}
{instance="instance1",name="service1.1.<hash3>",container_label_com_docker_swarm_service_name="service1",...}

So the idea is just to count unique records for each instance and name. I personally think that sending alert for each restart is wrong and not useful. I chose to alert if there are more than 5 restarts over 15m period. In my rules I used container_last_seen metric randomly, it actually doesn't matter, because counting is done by difference in labels. We just need a persistent metric. Also, note the - 1 at the end of the expression. We have to substruct 1, because we are counting unique records, so there are always at least one, if your container is running.

You may need to adapt this example for swarm services with multiple replicas, but you got the idea how to count unique labels.

Athodyd answered 7/9, 2020 at 18:56 Comment(5)

Do you have docker-compose for this. I want to use the same as I am using compose. Kubernetes rule is not helping me – Vexation 9/10, 2021 at 10:29

What docker-compose are you talking about? I provided prometheus rules for compose/swarm deployments in the answer. Check and adjust them on your prometheus instance. – Athodyd 10/10, 2021 at 2:5

Sorry, What I wanted to say is cadvisor is not having the metric which exposes "container_label_restartcount". I checked in Prometheus "container_last_seen" metric don't have "container_label_restartcount" label. could you please recheck – Vexation 21/10, 2021 at 16:39

To prevent the alert firing if you are doing an update, you might want to include "image" in the grouping – Pachston 14/9, 2022 at 19:55

Since you could have more than one container with the same "name" on same "instance" inside of different docker-compose groups, it's maybe better to act like this: count by (instance, name, container_label_com_docker_compose_project_working_dir) – Until 13/10, 2023 at 6:39

Prometheus Alert Rule Sample

rate()

Recommended topics

Hot tags