Calculating average time a value was set to 0 before transitioning to 1
Asked Answered
V

2

10

I have set up Prometheus monitoring and I'm generating an 'uptime' report based on a criteria such as: 'error rates < x%'. The corresponding PromQL is

( 
  sum(increase(errors[5m]))
  / sum(increase(requests[5m]))
) <= bool 0.1

This gets displayed in a single-stat panel in Grafana.

What I want to achieve now is an average of how long it took to recover from a 'downtime' state. Graphically, I need the average duration of the intervals marked 1 and 2 below.

Uptime graph

How can I calculate this measure in Prometheus?


Update: I am not looking for the average duration when the stat was 0, but instead for the average of the durations when the stat was 0.

As an example, consider the following time series ( assume value is sampled once per minute):

1 1 1 0 0 1 1 1 1 1 0 0 0 1 

We basically have two "down" intervals: 0 0 and 0 0 0. Durations are by definition 2 minutes and 3 minutes, therefore the mean time to recovery is (2+3)/2 = 2.5.

My understanding based on reading the documents and experimentation is that avg_over_time will calculate an arithmetic team, e.g. sum(up)/count(up) = 9/14 =~ 0.64

I need to calculate the first measure, not the second.

Vidicon answered 30/11, 2018 at 14:38 Comment(3)
If datapoints are coming at regular and known interval you can count the number of zeros and compute duration. Not elegant but may work.Leena
@YuriLachin - and how would I do that? Sorry, it may seem obvious, but I need the uninterrupted counts, so in the graph above not count(1+2) but count(1), count(2) .Vidicon
I'm not familiar with PromQL, sorry.Leena
F
7

TLDR;

You need to convert it to 0 or 1 via a Recording rule which you define in rules file add the path of a file to read rules from to your prometheus.yml .

my_metric_below_threshold = (sum(increase(errors[5m])) / sum(increase(requests[5m]))) <= bool 0.1

And then you can do avg_over_time(my_metric_below_threshold[5m])

The full details:

Basically what you need is avg_over_time of values 0 or 1. However the result of the bool modifier is instant vector. However, avg_over_time expects type range vector in its call. instant vector Vs. range vector is.

Instant vector - a set of time series containing a single sample for each time series, all sharing the same timestamp

Range vector - a set of time series containing a range of data points over time for each time series

The solution for this is using Recording rules. You can see the conversation about this Prometheus github, this Stack question and in this explanation https://www.robustperception.io/composing-range-vector-functions-in-promql.

There are two general types of functions in PromQL that take timeseries as input, those that take a vector and return a vector (e.g. abs, ceil, hour, label_replace), and those that take a range vector and return a vector (e.g. rate, deriv, predict_linear, *_over_time).

There are no functions that take a range vector and return a range vector, nor is there a way to do any form of subquery. Even with support for subqueries, you wouldn't want to use them regularly as they'd be expensive. So what to do instead?

The answer is to use a recording rule for the inner function, and then you can use the outer function on the time series it creates.

So, as I explained above and from the quotes above - taken from a Core developer on Prometheus - you should be able to get what you need.


Added after question edit:

Doing this is not straight forward since you need a "memory" of the last samples. However it can be done using Textfile Collector and Prometheus Http API.

  1. Define the my_metric_below_threshold using Recording rule as described above.

  2. Install Node exporter with Textfile Collector.

    The textfile collector is similar to the Pushgateway, in that it allows exporting of statistics from batch jobs. It can also be used to export static metrics, such as what role a machine has. The Pushgateway should be used for service-level metrics. The textfile module is for metrics that are tied to a machine. To use it, set the --collector.textfile.directory flag on the Node exporter. The collector will parse all files in that directory matching the glob *.prom using the text format.

  3. Write a script (i.e. successive_zeros.py)py/bash which run anywhere to query this metric using the Prometheus Http API GET /api/v1/query.

  4. Save successive zeros as an environment parameter and clear or increment this parameter.

  5. Write the result in the requested format described in the Textfile Collector documentation - than you have your successive_zeros_metrics in Prometheus.

  6. Do avg_over_time() over successive_zeros_metrics

This is pseudo code of the concept I talk about:

#!/usr/bin/python

# Run as the node-exporter user like so:
# 0 1 * * * node-exporter /path/to/runner successive_zeros.py

r = requests.get('prometheus/api/v1/query'))
j = r.json()

......

if(j.get('isUp') == 0)
    successive_zeros = os.environ['successive_zeros']
else
   successive_zeros = os.environ['successive_zeros']+
   os.environ['successive_zeros'] = successive_zeros

......
print 'successive_zeros_metrics %d' % successive_zeros
Fiddle answered 5/12, 2018 at 20:49 Comment(8)
Thanks for taking the time to write this comprehensive reply. I don't think avg_over_time is what I'm looking for, please see the update I made.Vidicon
@Robert Munteanu Ok, that is not that simple to do. However, I have an idea for you. First you define new metric in recording rule as I explained. Than you can use the textfile collector from a shell/py script example1 example2 and there are more. You can query the prometheus using HTTP API and count the successive zeros.Fiddle
Than report this number to Prometheus using the textfile collector. So eventually, you would have a new metric reporting the number of successive zeros and you can do avg_over_time() over this function, Is this clear? You need me to elaborate?Fiddle
Thanks, I need to try this myself and will reply once I know if this works for me or not.Vidicon
I think running an out-of-process exporter is the way forward, so the design direction is the one you proposed. About the successive_zeros metric, I don't think it will work. The reason is that I will reset the metric to 0 and that value will skew the value of avg_over_time. The only way I can see this now is calculating instant values for mttr_1d, mttr_7d, etc. But this is close enough for me so thank you for your effort - I will accept the answer.Vidicon
Any updates on this? I'm in the same boat. What did you end up with? @RobertMunteanuInglebert
@LeoUfimtsev - I did not purse this any further. I think an out-of-process exporter will help here, even though from an architectural point of view it does not look optimal.Vidicon
Thank you for your reply SirInglebert
R
0

The following query must return the average duration the m value was set to 0 before transitioning to 1 over the last 7 days:

(count_over_time((m == 0)[7d:1m]) * 60) / resets((m !=bool 0)[7d:1m])

The query assumes that the interval between samples (aka scrape_interval equals to one minute (see 1m in square brackets). It uses Prometheus subquery alongside the following functions:

  • count_over_time - it returns the number of samples in m with zero values. This number is multiplied by the number of seconds in one minute - 60. The result is the total duration when m was 0 over the last 7 days.
  • resets - it returns the number of times m !=bool 0 was reset from 1 to 0. This roughly matches the number of spans with zeroes for m over the last 7 days.

The m !=bool 0 uses bool modifier for == operation.

Now it's time to expand m into (sum(increase(errors[5m])) / sum(increase(requests[5m]))) <= bool 0.1:

(count_over_time((
  ((sum(increase(errors[5m])) / sum(increase(requests[5m]))) <= bool 0.1) == 0
)[7d:1m]) * 60)
  /
resets((
  ((sum(increase(errors[5m])) / sum(increase(requests[5m]))) <= bool 0.1) !=bool 0
)[7d:1m])

P.S. This monstrous query can be simplified somehow by using WITH templates from VictoriaMetrics:

with (
  m = (sum(increase(errors)) / sum(increase(requests))) <= bool 0.1
)
(count_over_time((m == 0)[7d:1m]) * 1m) / resets((m !=bool 0)[7d:1m])
Ricoriki answered 28/3, 2022 at 15:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.