Find exact CPU percentage from the metrics exported by prometheus-node-exporter
Asked Answered
S

2

5

I use the node_cpu_seconds_total metrics for this.

Basically, I want to subtract mode="idle" from the total CPU usage and then take the avg rate of the result, then a percentage calculation.

I tried something like:

100 - (avg(rate(node_cpu_seconds_total{instance="ip-X-X-X-X.eu-west-1.compute.internal:9100",job="rabbitmq-prod-node-exporter",replica="prometheus-prod"} - node_cpu_seconds_total{instance="ip-X-X-X-X.eu-west-1.compute.internal:9100",mode="idle",job="rabbitmq-prod-node-exporter",replica="prometheus-aws-prod"}))[1m] * 100)

But does not seem to be proper and also shows a parse error:

Error executing query: parse error at char 177: range specification must be preceded by a metric selector, but follows a *promql.AggregateExpr instead
Smaltite answered 15/2, 2022 at 12:24 Comment(1)
I tried it, but ends up in no data error.Smaltite
A
5

To fix your PromQL change it to the following:

100 - (avg(rate(node_cpu_seconds_total{instance="INSTANCE",job="JOB",replica="REPLICA"}[1m])) - avg(rate(node_cpu_seconds_total{instance="INSTANCE",mode="idle",job="JOB",replica="REPLICA"}[1m])) * 100)

But it's better to use "irate" instead of "rate" and use the following simpler PromQL:

100 - 100 * (avg(irate(node_cpu_seconds_total{instance="INSTANCE",job="JOB",replica="REPLICA",mode="idle"}[1m])))
Alphonsealphonsine answered 15/2, 2022 at 19:27 Comment(3)
Thanks, for the second one with irate, so you are taking the average only for the CPU with mode="idle" ?Smaltite
Yes, it's enough. Note that the final result is 100% minus idle, so it's the used time.Pyromagnetic
I'd avoid using irate instead of rate - see valyala.medium.com/…Fini
F
6

The following PromQL query returns the average number of CPU cores used during the last 5 minutes per each host (aka instance) with the installed node_exporter:

sum(1 - rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)

This query works in the following way:

  1. The rate(node_cpu_seconds_total{mode="idle"}[5m]) calculates the average per-second change rate for every time series, which matches node_cpu_seconds_total{mode="idle"}. This is basically the average idle CPU time per each CPU core during the last 5 minutes (see 5m lookbehind window in square brackets). See rate() function docs.
  2. The 1 - rate(...) subtracts per-core idle CPU time from 1, so the end result is the average per-core busy CPU time over the last 5 minutes.
  3. The sum(...) by (instance) sums busy CPU time per each instance (aka host with the installed node_exporter). See sum() function docs.

The node_cpu_seconds_total metric may contain additional labels, which you may want to save in the result. For example, env, datacenter, tenant, etc. Then just substitute sum(...) by (instance) with sum(...) without (cpu, mode) in order to save all these labels:

sum(1 - rate(node_cpu_seconds_total{mode="idle"}[5m])) without (cpu, mode)
Fini answered 22/12, 2022 at 5:56 Comment(0)
A
5

To fix your PromQL change it to the following:

100 - (avg(rate(node_cpu_seconds_total{instance="INSTANCE",job="JOB",replica="REPLICA"}[1m])) - avg(rate(node_cpu_seconds_total{instance="INSTANCE",mode="idle",job="JOB",replica="REPLICA"}[1m])) * 100)

But it's better to use "irate" instead of "rate" and use the following simpler PromQL:

100 - 100 * (avg(irate(node_cpu_seconds_total{instance="INSTANCE",job="JOB",replica="REPLICA",mode="idle"}[1m])))
Alphonsealphonsine answered 15/2, 2022 at 19:27 Comment(3)
Thanks, for the second one with irate, so you are taking the average only for the CPU with mode="idle" ?Smaltite
Yes, it's enough. Note that the final result is 100% minus idle, so it's the used time.Pyromagnetic
I'd avoid using irate instead of rate - see valyala.medium.com/…Fini

© 2022 - 2024 — McMap. All rights reserved.