Goal

Aiming to have a CloudWatch Alert triggered when a message from an SQS queue to a lambda function exceeds the maximum retries.

Problem

I presumed that this would be easy and the NumberOfMessagesReceived metric would reflect this. Those familiar with this will know that this is not the case.

Solutions

The 'Limbo' Solution

My quick and easy solution for this problem was the introduce a "Limbo" which acts as the first DLQ and within seconds pushes the message to the final/actual DLQ. In the metrics this results in a spike in the "Limbo" queue's visible messages metric. So having an alert threshold of "> 0" means that every time that queue receives a message an alert can be issued.

However the powers above me are not happy with having a "Limbo" queue for every time we want this functionality.

As far as I have been able to figure out there are some alternative methods but these seem worse than the Limbo Solution.

New Lambda Function

The first is to have a new lambda function that uses a SQS DLQ as a source and generates the alert.

Lambda Runtime Interception

Second is to have the have logic inside the existing lambdas (that process SQS messages) read the amount of times a message has gone been retried and on the final time generate the alert. This kind of removes the advantage of using a queue and a re-drive policy in the first place, and is an over engineered solution.

Metric Maths

The last alternative I can think of is to is to use some Metric Maths to look at the DLQ and calculate if there was been an increase in the last X minutes.

These all seem like strange and overly complex solutions to what (I am convinced) must have a simple implementation. How do I create an alert every time a DLQ receives a message?

Ales answered 20/3, 2020 at 8:29 Comment(8)

Interesting question. +1. I think lambda would be the easiest and most obvious way of doing this. But I'm interested to see if there some other solutions, probably more complicated, that would achieve the same outcome in a cheaper and easier to manage way than using lambda. – Unship 20/3, 2020 at 9:3

Why you don't use ApproximateNumberOfMessagesVisible for the metric in the DLQ? – Bucephalus 20/3, 2020 at 9:47

@GustavoTavares The issue is that CloudWatch Alerts only perform actions upon a state change. Using ApproximateNumberOfMessagesVisible. So if an alert is setup to be triggered when the threshold passes 1 then something needs to remove it from the queue to bring it back below the threshold again so the Alarm can go back into the OK state. If there is a way to make metric maths that is able to see if there has been in increase in ApproximateNumberOfMessagesVisible in the last 5 minutes then that would be suitable solution. – Ales 20/3, 2020 at 10:53

I'm not sure that I t understood your problem... You are not worried about your first case? I mean: ApproximateNumberOfMessagesVisible on DLQ > 1. Are you worried if they increase over time, is that? – Bucephalus 20/3, 2020 at 10:59

@GustavoTavares I added an image to the main body that might help clarify. The top line (orange) is the ApproximateNumberOfMessagesVisible value for a DLQ. When a message is added to the queue it goes up, and remains up. The second line is ApproximateNumberOfMessagesSent for my "Limbo" queue. As you can see it spikes when a message is sent to it, because it sends it to another queue almost straight away. This spike allows a CloudWatch alarm to change state when the threshold is crossed. Which means every time there is a spike an alert can be triggered. – Ales 20/3, 2020 at 12:15

What I didn't understood is why you can't create a CloudWatch Alarm Rule to change when your DLQ has ApproximateNumberOfMessagesVisible > 0 ? I know that for the DLQ the NumberOfMessagesReceived doesn't work! But if you create the rule with the number of visible messages it will alarm until you remove them... Do you want something different? – Bucephalus 20/3, 2020 at 13:56

@GustavoTavares It will alarm once and remain in the alarm state. I think the context that I failed to mention is why I we need the Alarm. We're trying to feed a message into a centralised system (that covers the entire business) so the support chain is notified of an issue. In theory we should be happy with just one alert, and they can purge the DLQ once the issue has been resolved. In practice support is much more likely to make the correct choice if they see the alert has been triggered 200 times over the once. I'm also not keen on relying on support to remember to purge the queue. – Ales 20/3, 2020 at 14:13

Let us continue this discussion in chat. – Ales 20/3, 2020 at 14:37

I came across this same issue and had success implementing it using Metrics Math. Cloudwatch has a RATE() function which:

"Returns the rate of change of the metric per second. This is calculated as the difference between the latest data point value and the previous data point value, divided by the time difference in seconds between the two values."

https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html

So I created an alarm which looks at the rate of change of the ApproximateNumberOfMessagesVisible metric on the Deadletter queue. It goes into alarm when the rate of change is greater than 0. Here is a Cloudformation template example for the alarm:

DeadletterAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties: 
    AlarmName: "DEADLETTER_ALARM"
    ComparisonOperator: GreaterThanThreshold
    EvaluationPeriods: 1
    TreatMissingData: missing
    Threshold: '0'      
    Metrics: 
      - Id: r1
        Expression: RATE(FILL(m1, 0))
        ReturnData: true
      - Id: m1          
        Label: VisibleAverage
        ReturnData: false
        MetricStat:
          Stat: Average
          Period: '300'
          Metric:
            MetricName: ApproximateNumberOfMessagesVisible
            Namespace: AWS/SQS
            Dimensions:
              - Name: QueueName
                Value: "Deadletter_queue_name"

Trifurcate answered 28/9, 2020 at 23:22 Comment(0)

One other way to accomplish this is to alarm on ApproximateNumberOfMessagesDelayed. Then you just need to set a delay on your DLQ. So it could look something like this:

MyDLQAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
  AlarmName: MyDLQAlarm
  AlarmDescription: "Alarm when we have 1 or more failed messages in 10 minutes for MyQueue."
  Namespace: "AWS/SQS"
  MetricName: "ApproximateNumberOfMessagesDelayed"
  Dimensions:
    - Name: "QueueName"
      Value:
        Fn::GetAtt:
          - "MyQueue"
          - "QueueName"
  Statistic: "Sum"
  Period: 300
  DatapointsToAlarm: 1
  EvaluationPeriods: 2
  Threshold: 1
  ComparisonOperator: "GreaterThanOrEqualToThreshold"
  AlarmActions:
    - Ref: "SNSTopic"

Then your DLQ can look like:

  MyQueueDLQ:
Type: AWS::SQS::Queue
Properties:
  QueueName: MyQueueDLQ
  MessageRetentionPeriod: 1209600
  DelaySeconds: 60

Koerner answered 5/12, 2020 at 22:34 Comment(0)