Set stackdriver alerts for specific error messages
Asked Answered
F

3

3

Cannot find a clean way to set Stackdriver alert notifications on errors in cloud functions

I am using a cloud function to process data to cloud data store. There are 2 types of errors that I want to be alerted on:

  1. Technical exceptions which might cause function to 'crash'
  2. Custom errors that we are logging from the cloud function

I have done the below,

  • Created a log metric searching for specific errors (although this will not work for 'crash' as the error message can be different each time)
  • Created an alert for this metric in Stackdriver monitoring with parameters as in below code section

This is done as per the answer to the question, how to create alert per error in stackdriver

For the first trigger of the condition I receive an email. However, on subsequent triggers lets say on the next day, I don't. Also the incident is in 'opened' state.

Resource type: cloud function
Metric:from point 2 above
Aggregation: Aligner: count, Reducer: None, Alignment period: 1m
Configuration: Condition triggers if: Any time series violates, Condition: 
is above, Threshold: 0.001, For: 1 min

So I have 3 questions,

  1. Is this the right way to do to satisfy my requirement of creating alerts?

  2. How can I still receive alert notifications for subsequent errors?

  3. How to set the incident to 'resolved' either automatically/ manually?

Fda answered 5/2, 2019 at 11:52 Comment(0)
P
4

Normally, alerts resolve themselves once the alerting policy stops firing. The problem you're having with your alerts not resolving is because your metric only writes non-zero points - if there are no errors, it doesn't write zero. That means that the policy never gets an unambiguous signal that everything is fine, so the alerts just sit there (they'll automatically close after 7 days, but I imagine that's not all that useful for you).

This is a common problem and it's a tricky one to solve. One possibility is to write your policy as a ratio of errors to something non-zero, like request count. As long as the request count is non-zero, the ratio will compute zero if there are no errors, and so an alert on the ratio will automatically resolve. You need to be a bit careful about rounding errors, though - if your request count is high enough, you might potentially miss a single error because the ratio could round to zero.

Aaron Sher, Stackdriver engineer

Prod answered 11/2, 2019 at 19:19 Comment(2)
for now we will have the support team to manually go and acknowledge the error in Stackdriver.Fda
With the tools available it is tricky to solve, but realistically it seems very easy for Google to solve - just allow a variable timeout, for logging incidents than the 7 days?!Enwrap
H
5

I was having a similar problem and managed to at least get a mail every time. The "trick" seems to be to use sum instead of count in combination with for most recent value - see the screenshot below.

This causes Stackdriver to send a mail everytime a matching log entry is found and closing the issue a minute later.

enter image description here

Hunfredo answered 14/11, 2019 at 8:56 Comment(2)
Thanks! This helped me. I had to put Aggregator to sum, but also under Advanced Aggregation set Aligner to sum, and Seconday Aggregator to sum. Not sure if other configurations work as well to solve the above issue or not, but this worked for me.Manned
I tried to summarize in this blog how I made it worked for my requirement medium.com/@prasadsawant1107/…Fda
P
4

Normally, alerts resolve themselves once the alerting policy stops firing. The problem you're having with your alerts not resolving is because your metric only writes non-zero points - if there are no errors, it doesn't write zero. That means that the policy never gets an unambiguous signal that everything is fine, so the alerts just sit there (they'll automatically close after 7 days, but I imagine that's not all that useful for you).

This is a common problem and it's a tricky one to solve. One possibility is to write your policy as a ratio of errors to something non-zero, like request count. As long as the request count is non-zero, the ratio will compute zero if there are no errors, and so an alert on the ratio will automatically resolve. You need to be a bit careful about rounding errors, though - if your request count is high enough, you might potentially miss a single error because the ratio could round to zero.

Aaron Sher, Stackdriver engineer

Prod answered 11/2, 2019 at 19:19 Comment(2)
for now we will have the support team to manually go and acknowledge the error in Stackdriver.Fda
With the tools available it is tricky to solve, but realistically it seems very easy for Google to solve - just allow a variable timeout, for logging incidents than the 7 days?!Enwrap
X
0

We got around this issue by having the insertId as a label of the log-based metric we created for every log record we get from the pods running our services. enter image description here

In the alerting policy, this label helped in two things:

  1. We grouped by it (named as record_id) which served in making each incident unique, so it got reported without waiting for other incidents to get resolved and at the same time it got resolved instantly. enter image description here
  2. We used it in the documentation of the notification to include a direct link to the issue (log record) itself which was a nice and essential feature to have. https://console.cloud.google.com/logs/viewer?project=MY_PROJECT&advancedFilter=insertId%3D%22${metric.label.record_id}%22

As @Aaron Sher mentioned in his answer, it is a tricky problem. We might have done something not recommended or not efficient, but it works fine and of course we are open for improvement recommendations.

Xenophobe answered 19/8, 2020 at 13:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.