Missing labels in prometheus alerts
Asked Answered
F

1

5

I'm having issues with Prometheus alerting rules. I have various cAdvisor specific alerts set up, for example:

- alert: ContainerCpuUsage
  expr: (sum(rate(container_cpu_usage_seconds_total[3m])) BY (instance, name) * 100) > 80
  for: 2m
  labels:
    severity: warning
  annotations:
    title: 'Container CPU usage (instance {{ $labels.instance }})'
    description: 'Container CPU usage is above 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}'

When the condition is met, I can see the alert in the "Alerts" tab in Prometheus, however some labels are missing thus not allowing alertmanager to send a notification via Slack. To be specific, I attach custom "env" label to each target:

 {
  "targets": [
   "localhost:8080",
  ],
  "labels": {
   "job": "cadvisor",
   "env": "production",
   "__metrics_path__": "/metrics"
  }
 }

But when the alert based on cadvisor metrics is firing, the labels are: alertname, instance and severity - no job label, no env label. All the other alerts from other exporters (f.e. node-exporter) work just fine and the label is present.

Fluoroscopy answered 26/4, 2021 at 20:46 Comment(0)
S
15

This is due to the sum function that you use; it gathered all the time series present and added them together, groping BY (instance, name). If you run the same query in Prometheus, you will see that sum left only grouping labels:

{instance="foo", name="bar"}    135.38819037447163

Other aggregation methods like avg, max, min, etc, work in the same fashion. To bring the label back simply add env to the grouping list: by (instance, name, env).

Subsoil answered 26/4, 2021 at 21:32 Comment(4)
Thanks! I've modified my query to this: (sum(rate(container_cpu_usage_seconds_total{name!=""}[3m])) BY (instance, name,env) * 100) > 80 and it looks like it's working fine. Is this query okay? To be honest, I do not fully understand this: "But this way you'll get CPU utilisation per instance per name per environment." - why is that an issue?Fluoroscopy
Suppose you have a container with env=prod and another one with env=dev both on a single machine (instance). By running the query you'll get a distinct CPU utilisation for env=dev and env=prod. Since you made it so that only env=prod can trigger an alert, you won't get notified in case env=dev took all CPU resources on the machine. In other words, machine CPU Utilisation will be split between various env label values. Whether this is a problem depends on how things run in your environment, if there can be no other env except prod on production machines, then this is okay.Subsoil
oh an one more thing @anemyte, this env label is attached to the specific target (which is cadvisor) and not to the containers themselves. It would become a problem if I ran two cadvisor containers, with different env label values. At least that's how I understand it.Fluoroscopy
@Fluoroscopy if it's explicitly defined in job configuration for production instances, then I suppose it should be fine.Subsoil

© 2022 - 2024 — McMap. All rights reserved.