How to get overall uptime of a server with prometheus and node_exporter
Asked Answered
S

2

12

I'm looking for a query to get the average uptime of the server on which prometheus runs over the last week. It should be about 15h/week, so about 8-10 %.

I'm using Prometheus 2.5.0 with node_exporter on CentOS 7.6.1810. My most promising experiments would be:

1 - avg_over_time(up{job="prometheus"}[7d])

This is what I've found when looking for ways to get average uptimes, but it gives me exactly 1. (My guess is it ignores the times in which no scrapes happened?)

2 - sum_over_time(up{job="prometheus"}[7d]) * 15 / 604800

This technically works, but is dependent on the scrape interval, which is 15s in my case. I can't seem to find a way to get said interval from prometheus' config, so I have to hardcode it into the query.

I've also tried to find ways to get all start and end times of a job, but to no avail thus far.

Salter answered 24/9, 2019 at 12:19 Comment(2)
Are you running the prometheus resver on the same node? The up metric gives if the probe was successful or not, so if the monitoring server is down, and not scraping then you won't get 0-s for up.Ortrud
Yes, the server is basically supposed to check it's own uptimes.Salter
C
21

Here you go. Don't ask. (o:

avg_over_time(
  (
    sum without() (up{job="prometheus"})
      or
    (0 * sum_over_time(up{job="prometheus"}[7d]))
  )[7d:5m]
)

To explain that bit by bit:

  1. sum without() (up{job="prometheus"}): take the up metric (the sum without() part is there to get rid of the metric name while keeping all other labels);
  2. 0 * sum_over_time(up{job="prometheus"}[7d]): produces a zero-valued vector for each of the up{job="prometheus"} label combinations seen over the past week (e.g. in case you have multiple Prometheus instances);
  3. or the two together, so you get the actual value where available, zero where missing;
  4. [7d:5m]: PromQL subquery, produces a range vector spanning 7 days, with 5 minute resolution based on the expression preceding it;
  5. avg_over_time: takes an average over time of the up metric with zeroes filled in as defaults, where missing.

You may also want to tack on an and sum_over_time(up{job="prometheus"}[7d] to the end of that expression, to only get a result for label combinations that existed at some point over the previous 7 days. Else, because of the combination of 7 days range and 7 days subquery, you'll get results for all combinations over the previous 14 days.

It is not an efficient query by any stretch of the imagination, but it does not require you to hardcode your scrape interval into the query. As requested. (o:

Casual answered 24/9, 2019 at 14:17 Comment(1)
Thanks. I had to upgrade to prometheus 2.12, but this does the job.Salter
F
3

There are two useful metrics named node_time_seconds and node_boot_time_seconds, You can get server uptime as follow:

node_time_seconds - node_boot_time_seconds

source: https://github.com/prometheus/node_exporter/issues/1895

BUT these two are gauge metrics not counters, for example server reseting changes node_boot_time_seconds to new boot time. I was able to somehow using increase function overcome this and considering resets get it to work as a counter, for example overall uptime of my server in 1 hour:

increase((node_time_seconds - node_boot_time_seconds{instance="gateway01"})[1h:1m])
# result ==>
{address="192.168.1.45:9100", instance="gateway01", job="node_exporter"}
3504.9516251127598

And for getting overall uptime of server in one weak I think this would work:

increase((node_time_seconds - node_boot_time_seconds{instance="gateway01"})[7d:1m]) / 24 / 3600 / 7
Felony answered 17/2, 2023 at 8:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.