How to do percentiles on custom metrics in Azure AppInsights?
Asked Answered
A

2

6

I've used Prometheus to store performance metrics and query the results as percentiles (ex. 95th percentile response timing). I used prometheus-net to emit them.

What is the equivalent in Azure AppInsights?

I see there are percentile functions in AppInsights/Kusto but when I use GetMetric("blah").TrackValue(42) it stores Count, Min, Max, Sum, and StdDev, which isn't the histogram bucketing approach I'm used to in Prometheus.

for(int i=0; i < 500; i++) {
  //Write some metrics
  telemetryClient.GetMetric("blah").TrackValue(42); //real data isn't constant
}
customMetrics
| where name == "blah" 
//| summarize avg(value), percentiles(value, 50, 95)  by bin(timestamp, 2m)

Here is some data I logged with randomized values. The value column is the sum, which is not correct, so I don't see how I can properly do percentiles on this data. enter image description here

Apache answered 26/9, 2019 at 20:25 Comment(0)
S
1

Each individual value is not stored when GetMetric().TrackValue() API is used with the default aggregations, one value is produced after 1 minute and that value is sent to AI with sum/count/min/max/... distribution. Therefore, it's not possible to plot percentiles of the original data points in Analytics later on.

There are only few aggregations currently available for GetMetric().TrackValue() API and histogram / tdigest is not one of them. You can submit a feature request (or a contribution) on AI SDK GitHub repository.

The workaround at the time being would be to use older API that submits point-in-time metric by default without the aggregation: TrackMetric() or a series of measurements in TrackEvent(). This will increase the amount of telemetry items sent (each metric will be sent separately without 1 minute aggregation of the values), but this will provide you with each value to perform percentiles aggregation in Analytics if necessary.

Sadyesaechao answered 26/9, 2019 at 23:12 Comment(4)
Thanks for the details, that answered my question. Although it's really unfortunate because it severely limits the usefulness of these metrics. I'm surprised an offering like AppInsights doesn't cover what I'd consider a core metrics scenario. I'll log an enhancement but in the interim I expect we'll have to move to a non-Azure solution because of this miss. Thanks for the quick response!Apache
Follow-up question @dmitry-matveev, is this just an AppInsights SDK shortcoming? I'm wondering if LogAnalytics and the percentile function support the necessary data structures/logic or if there are issues across the stack that would need to be addressed.Apache
You can always use percentiles on non-aggregated data with percentile function in Analytics portal: you can send non-aggregated data to Application Insights; you can send non-aggregated data to Log Analytics. Then, querying on top of this data with percentile function will yield the correct value, so it's not restricted to Log Analytics. However, as you correctly pointed out, using default experience in AI leads to pre-aggregated metrics being sent and you'd need to switch to using non-default TrackMetric API instead.Sadyesaechao
Non-aggregated just won't scale well (unless I'm missing something) since we're planning to process billions of events a day. My previous question was asking if LogAnalytics does the necessary bucketing (like Prometheus does or something equivalent) so you can get percentiles from aggregate metrics?Apache
T
0

AFAIK, this is a common Statistics problem. One can get the percentile values given mean, standard deviation, only if it's a normal distribution.

Also, calculating percentile values is bit expensive compared to sum, count, min, max, std dev values, which can be done in a running fashion. So, I'm guessing that's why application insights does this.

Here is the formula,

Percentile Value = μ + zσ

where

μ: Mean
z: z-score from z table that corresponds to percentile value
σ: Standard deviation

Ref: https://www.statology.org/calculate-percentile-from-mean-standard-deviation/

The z-score value for P95 is 1.645, and for P99 is 2.326.

Ref: https://www.mymathtables.com/statistic/z-score-percentile-normal-distribution.html

So, here is the kusto query. Note that I do a percentile() aggregation in summarize, but you could choose min(), max(), or avg() depending on your needs (for >1m bin intervals).

customMetrics
| where name == "<METRIC_NAME>"
| extend mean = value / valueCount
| extend p95_zscore = 1.645
| extend p95calc = mean + (p95_zscore * valueStdDev)
| extend p99_zscore = 2.326
| extend p99calc = mean + (p99_zscore * valueStdDev)
| summarize
    avg = sum(value) / sum(valueCount),
    p95 = percentile(p95calc, 95),
    p99 = percentile(p99calc, 99)
    by ts = bin(timestamp, 1m)
| render timechart

Update 1: To figure out if the metric is a normal distribution, take few sample 1m intervals with all data points, and plot them. In my case, it was not normal distribution, so the metric is useless for percentiles. I hope AppInsights would have pre-aggregated P95, P99 values too. I guess I'll have to handroll my own impl.

PS: I'm not a stats person.

Trihedron answered 29/11, 2022 at 6:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.