How to silence Prometheus Alertmanager using config files?
Asked Answered
A

4

18

I'm using the official stable/prometheus-operator chart do deploy Prometheus with helm.

It's working good so far, except for the annoying CPUThrottlingHigh alert that is firing for many pods (including the own Prometheus' config-reloaders containers). This alert is currently under discussion, and I want to silence its notifications for now.

The Alertmanager has a silence feature, but it is web-based:

Silences are a straightforward way to simply mute alerts for a given time. Silences are configured in the web interface of the Alertmanager.

There is a way to mute notifications from CPUThrottlingHigh using a config file?

Accrete answered 21/2, 2019 at 11:45 Comment(4)
#53277694Orderly
@Orderly thanks, I read about the meaning of cfs and the throttle metric, but the alert itself and its threshold is still controversial and diverges opinions...For now, I just want to silence it without depending on the AlertManager web interface.Accrete
delete the rule for promethues configOrderly
@Orderly The prometheus-operator chart imports the k8s rules/alerts from kubernetes-mixin. There is no suitable way to disable only the CPUThrottlingHigh rule; it’s all or nothing (via defaultRules.rules.k8s helm config parameter)Accrete
A
16

Well, I managed it to work by configuring a hackish inhibit_rule:

inhibit_rules:
- target_match:
     alertname: 'CPUThrottlingHigh'
  source_match:
     alertname: 'DeadMansSwitch'
  equal: ['prometheus']

The DeadMansSwitch is, by design, an "always firing" alert shipped with prometheus-operator, and the prometheus label is a common label for all alerts, so the CPUThrottlingHigh ends up inhibited forever. It stinks, but works.

Pros:

  • This can be done via the config file (using the alertmanager.config helm parameter).
  • The CPUThrottlingHigh alert is still present on Prometheus for analysis.
  • The CPUThrottlingHigh alert only shows up in the Alertmanager UI if the "Inhibited" box is checked.
  • No annoying notifications on my receivers.

Cons:

  • Any changes in DeadMansSwitch or the prometheus label design will break this (which only implies the alerts firing again).

Update: My Cons became real...

The DeadMansSwitch altertname just changed in the stable/prometheus-operator 4.0.0. If using this version (or above), the new alertname is Watchdog.

Accrete answered 21/2, 2019 at 18:37 Comment(1)
To circumvent the changing alertname (Watchdog), you could add a recording rule with expression vector(1) and use that in the inhibit configuration.Knipe
H
23

One option is to route alerts you want silenced to a "null" receiver. In alertmanager.yaml:

route:
  # Other settings...
  group_wait: 0s
  group_interval: 1m
  repeat_interval: 1h

  # Default receiver.
  receiver: "null"

  routes:
  # continue defaults to false, so the first match will end routing.
  - match:
      # This was previously named DeadMansSwitch
      alertname: Watchdog
    receiver: "null"
  - match:
      alertname: CPUThrottlingHigh
    receiver: "null"
  - receiver: "regular_alert_receiver"

receivers:
  - name: "null"
  - name: regular_alert_receiver
    <snip>
Hotfoot answered 5/3, 2019 at 6:35 Comment(0)
A
16

Well, I managed it to work by configuring a hackish inhibit_rule:

inhibit_rules:
- target_match:
     alertname: 'CPUThrottlingHigh'
  source_match:
     alertname: 'DeadMansSwitch'
  equal: ['prometheus']

The DeadMansSwitch is, by design, an "always firing" alert shipped with prometheus-operator, and the prometheus label is a common label for all alerts, so the CPUThrottlingHigh ends up inhibited forever. It stinks, but works.

Pros:

  • This can be done via the config file (using the alertmanager.config helm parameter).
  • The CPUThrottlingHigh alert is still present on Prometheus for analysis.
  • The CPUThrottlingHigh alert only shows up in the Alertmanager UI if the "Inhibited" box is checked.
  • No annoying notifications on my receivers.

Cons:

  • Any changes in DeadMansSwitch or the prometheus label design will break this (which only implies the alerts firing again).

Update: My Cons became real...

The DeadMansSwitch altertname just changed in the stable/prometheus-operator 4.0.0. If using this version (or above), the new alertname is Watchdog.

Accrete answered 21/2, 2019 at 18:37 Comment(1)
To circumvent the changing alertname (Watchdog), you could add a recording rule with expression vector(1) and use that in the inhibit configuration.Knipe
H
10

I doubt there exists a way to silence alerts via configuration (other than routing said alerts to a /dev/null receiver, i.e. one with no email or any other notification mechanism configured, but the alert would still show up in the Alertmanager UI).

You can apparently use the command line tool amtool that comes with alertmanager to add a silence (although I can't see a way to set an expiration time for the silence).

Or you can use the API directly (even though it is not documented and in theory it may change). According to this prometheus-users thread this should work:

curl https://alertmanager/api/v1/silences -d '{
      "matchers": [
        {
          "name": "alername1",
          "value": ".*",
          "isRegex": true
        }
      ],
      "startsAt": "2018-10-25T22:12:33.533330795Z",
      "endsAt": "2018-10-25T23:11:44.603Z",
      "createdBy": "api",
      "comment": "Silence",
      "status": {
        "state": "active"
      }

}'
Hierarchy answered 21/2, 2019 at 15:17 Comment(1)
thanks for the tips. Routing to /dev/null not works for me because I join all firing alerts together to receive them in a single Slack message (like this). I created a hackish inhibit_rule to managed it by config file. Please, read my answer and give me your thoughts if you can :)Accrete
R
1

You can silence it by sending your alerts through Robusta. (Disclaimer: I wrote Robusta.)

Here is an example:

- triggers:
  - on_prometheus_alert: {}
  actions:
  - name_silencer:
      names: ["Watchdog", "CPUThrottlingHigh"]

However, this is probably not what you want to do!

Some CPUThrottlingHigh alerts are spammy and can't be fixed like the one for metrics-server on GKE..

However, in general the alert is meaningful and can indicate a real problem. Typically the best-practice is to change or remove the pod's CPU limit..

I've spent more hours of my life than I care to admit looking at CPUThrottlingHigh as I wrote an automated playbook for Robusta which analyzes each CPUThrottlingHigh and recommends the best practice.

Rimbaud answered 28/12, 2021 at 11:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.