CPU usage for each node in prometheus

Asked 12/2, 2021 at 10:5 Answered 19/10, 2022 at 16:38

Ideally I have to find out the CPU usage of pods on each node in percentage. But I have tried to find out the CPU usage of each node. I have written the query but it gives me more than 100 % (it can be 150% - 200%) even though the case with multiple cpus is included (I took avg). Could you please help me to understand what is wrong in the query below.

(1 - avg(irate(node_cpu_seconds_total{mode="idle"}[1m])) by (instance)) * 100 / scalar(sum(machine_cpu_cores))

By reading multiple books and solutions I have found also the query that works only with several nodes (container_spec_cpu_quota isn't available for certain instances on AWS ECS)

avg(rate(container_cpu_usage_seconds_total{name!~".*prometheus.*", image!="", instance=""}[1m])) by (pod) / scalar(sum(container_spec_cpu_quota{name!~".*prometheus.*", image!="", instance=""} / container_spec_cpu_period{name!~".*prometheus.*", image!="", instance=""}))

Symbology answered 12/2, 2021 at 10:5 Comment(1)

What is the value of "scalar(sum(machine_cpu_cores))"? How is "machine_cpu_cores" calculated? – Latin 12/2, 2021 at 21:54

The following query returns the average CPU usage in percentage [0 ... 100]% per each node for the last 5 minutes:

100 * avg(1 - rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)

This query assumes that a node_exporter runs per each monitored host and Prometheus is properly configured for scraping all these node exporters.

The query works in the following way:

The rate(node_cpu_seconds_total{mode="idle}[5m]) calculates per-second increase rate for idle CPU usage per each CPU core over the last 5 minutes (see [5m] in square brackets). This is basically the average number of seconds the given CPU core was idle per each second during the last 5 minutes. E.g. the value is in the range [0 .. 1], where 0 means that the CPU core was 100% busy during the last 5 minutes, while 1 means that the CPU core was 100% idle during the last 5 minutes.
The 1 - rate(...) calculates CPU usage per each CPU core per each host.
The avg(...) by (instance) calculates the average CPU usage per each instance (aka host in Prometheus ecosystem).
The 100 * ... multiplies the the average CPU usage per each host by 100 in order to get percentage on the range [0 ... 100]%.

Modern hosts usually have more than one CPU core. Sometimes the load among CPU cores may be uneven. For example, if an app, which can utilize only a single CPU core, runs on a host with 2 CPU cores, then CPU usage for this host will never exceed 50%, since the second CPU core is always idle, while the app cannot scale more. In these cases it may be useful to monitor the maximum CPU usage among available CPU cores per host:

100 * max(1 - rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)

This query may help you identifying hosts with uneven CPU load, where some apps cannot scale to more CPU cores.

Reiff answered 18/2, 2021 at 16:4 Comment(2)

This rate is actually the average of the user, system etc. CPU usage rather than the sum of all the non-idle CPU usage. – Censurable 26/7, 2023 at 9:30

@VictorWong, thanks for the pointer! I updated the answer to make it more useful – Reiff 26/7, 2023 at 20:53

The following query return CPU usage per each node:

100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100)

Lubricate answered 19/10, 2022 at 16:38 Comment(0)

Recommended topics

Hot tags