How to stop exception alerts from going bezerk

Asked 28/10, 2010 at 15:25 Answered 15/7, 2017 at 7:44

Solved .net design-patterns exception error-handling alerts

Let's say you have a .NET system that needs to send out email notifications to a system administrator when there's an error. Example:

try
{
    //do something mission critical 
}
catch(Exception ex)
{
    //send ex to the system administrator
    //give the customer a user-friendly explanation
}

This block of code gets called hundreds of times a second by different users.

Now lets's say an underlying API/service/database goes down. This code is going to fail many, many times. The poor administrator is going to wake up to a few million e-mails in their inbox and the developer is going to get a rude phone call, not that such an incident (cough) necessarily occurred this morning.

It's pretty clear that this is not a design that scales well.

The first few solutions that come to mind are all flawed in some way:

Log errors to the database, then expose high error counts through an HTTP Health Check to an external monitoring service such as Pingdom. (My favourite candidate so far. But what if the database goes down?)
Have a static cache that keeps track of recent exceptions, and the alert system always checks for duplicates first. (Seems unnecessarily complex, and secondly a lot of error messages differ very slightly - e.g. if there is a time-stamp in the error, it's useless.)
Programmatically take our system offline after certain errors or based on constant monitoring of critical dependencies (Risky! What if there's a transient false positive?)
Just not alert on those errors, and rely on a different part of the system to monitor and report on the dependencies. (Doesn't cater for the 'unexpected' errors that we haven't anticipated.)

This seems like a problem that has to have been solved, and that we're going about it in a silly way. Suggestions appreciated, even if they involve a completely different exception management strategy!

Jericajericho answered 28/10, 2010 at 15:25 Comment(1)

You said you like the idea of logging in database. I agree that is the best option to get to control you want. You said, however, that the database might be down and that's why you are a little bit afraid of this option. I say that you have to live with this constraints. If the database is down you probably have more problems than just this part of your system (exception handling) going down. Its impossible to foreseen all that might appear and handle all. Otherwise we would have to build systems that can automatically handle lack of power supply for example. – Insomniac 16/11, 2010 at 1:6

the simplest solution that springs to mind is to assign this exception block an ID number (like, 1) and log the time of the last notification to the administrator. If the elapsed time between notifications is not large enough (say, an hour), don't notify the admin again

if this piece of code typically generates more than one kind of exception, you may want to log the class of the exception also; if the elapsed time between notifications for the same exception is not large enough, don't notify the admin again

Collect answered 28/10, 2010 at 15:39 Comment(2)

I agree with this solution: Categorize the exception somehow, put your own exception handling module in between and use log4net or EntLib for the handling below your own module (lots of features out-of-the-box) – Cristincristina 10/11, 2010 at 7:39

I use an aggregator helper with an LRU map; the key is the exception's Class so that I can filter duplicate exceptions. Every X many instances (or per unit time) I emit the exception with a "20 more like this". – Dig 15/11, 2010 at 15:29

Check for similarities (timestamps can be evaded using wildcards (??:?? for example)) and first let them be sent to you for a period of time. Now check which occured the most.

Say, there are 1000 exceptions of type A, 964 of type B, 120 of C and 7 of Types D - H.

That means, send an email to the sysadmin every 100th exception of type A and B, every 10th of Type C and every other excpetion as it occurs.

Pro:
+ Accurate
+ Prevents System-Spam
+ Not much code to implement

Con:
- Needs time to develop a reliable statistic
- Important Exceptions could be ignored accidently
- Relies on humans, which will probably always fail

Souza answered 16/11, 2010 at 2:57 Comment(0)

I've built monitoring apps that email admins before, and I'll sheepishly admit that I've been in your situation. The solution is to rate-limit your emails. Save the time of the last email sent somewhere, and build in a check to see if a minimum amount of time has passed since the last email before sending one (say, 10 minutes, or longer, up to you). That way the maximum amount of emails your poor admin will get will be <time issue has been going on> / <period>. In my previous sysadmin job this balanced our need to know that an issue was still going on with the need to have an email box not bursting with 1000 emails an hour.

Carvalho answered 28/10, 2010 at 15:41 Comment(0)

We have something similar in one of our remote apps. It emails a intermediary mailbox with all the exceptions, and a script runs every hour that scans the mails, and creates a summary email which fires off to our team mailbox(max 24 mails a day), and also saves the rest of data to a local DB for future reference.

Its not bullet proof, but its fairly quick/easy to setup.

Unknowable answered 12/11, 2010 at 0:35 Comment(0)

I know this has already been answered, but I feel it helpful to post this still.

Microsoft has been adding a wealth of information about cloud design patterns and architecture, ranging from things like microservices and service buses with message queues, to tinier details. It's all on the Microsoft Docs website, filed under Azure Architecture. The specific pattern that deals with this sort of problem is the Circuit Breaker pattern.

Use of this pattern does not completely solve the issue; there is still the problem of "how do we decide it's time to notify the operations folks?" One possible solution is to let the circuit breaker trip, and increment an internal counter to create a unique identifier for the trip (or something similar). Then, subsequent notifications could use this identifier. This is just an example - there are probably other ways you could reasonably accomplish this. The point is that I'd use the circuit breaker to handle the decision logic, by placing one anywhere that you need to have it's services, and just chain something onto it to provide the services you're describing about notifications. At the very least, though, you can avoid sending a deluge of e-mails.

Trumpery answered 15/7, 2017 at 7:44 Comment(0)

Recommended topics

Hot tags