Configure SQS Dead letter Queue to raise a cloud watch alarm on receiving a message
Asked Answered
U

7

22

I was working with Dead letter Queue in Amazon SQS. I want that whenever a new message is received by the queue it should raise a CloudWatch alarm. The problem is I configured an alarm on the metric: number_of_messages_sent of the queue but this metric don't work as expected in case of Dead letter Queues as mentioned in the Amazon SQS Dead-Letter Queues - Amazon Simple Queue Service documentation.

Now some suggestions on this were use number_of_messages_visible but I am not sure how to configure this in an alarm. So if i set that the value of this metric>0 then this is not same as getting a new message in the queue. If an old message is there then the metric value will always be >0. I can do some kind of mathematical expression to get the delta in this metric for some defined period (let's say a minute) but I am looking for some better solution.

Unmanned answered 13/2, 2020 at 15:26 Comment(4)
What is the source of the DLQ? In other words, what is failing that results is something ending up in the DLQ? Is it a lambda? A SNS delivery?Sandblast
I have a java application that continuously polls data and processes it. If while processing an exception is raised then it is added to DLQ. The code to add the message to DLQ is also there in my application.Unmanned
So you are "manually" adding things to your DLQ? It's not an automated DLQ, like on a lambda?Sandblast
Consider simply having an alarm that is in alarm when messages are in your DLQ, rather than when they are received: simply alarming on ApproximateNumberOfMessagesVisible. From an operational perspective, you have a problem as long as messages are in your DLQ; the alarm should only move from ALARM to OK once the DLQ is empty and you've dealt with all the DLQ messages. This is especially true because you have a time limit to deal with DLQ messages within, due to the maximum retention period for a queue being 14 days.Baese
S
16

I used metric math function RATE to trigger an alarm whenever a message arrives in the dead letter queue.

Select two metrics ApproximateNumberOfMessagesVisible and ApproximateNumberOfMessagesNotVisible for your dead letter queue.

Configure the metric expression as RATE(m1+m2), set the threshold to 0 and select the comparison operator as GreaterThanThreshold.

m1+m2 is the total number of messages in the queue at a given time. Whenever a new message arrives in the queue the rate of this expression will go above then zero. That's how it works.

Souvaine answered 12/11, 2020 at 10:42 Comment(1)
This was the easiest and best solution for me. However I didn't need the 'ApproximateNumberOfMessagesNotVisible' metric too. Just m1='ApproximateNumberOfMessagesVisible' and RATE(m1) worked like a charmOkhotsk
K
7

I struggled with the same problem and the answer for me was to use NumberOfMessagesSent instead. Then I could set my criteria for new messages that came in during my configured period of time. Here is what worked for me in CloudFormation.

Note that individual alarms do not occur if the alarm stays in an alarm state from constant failure. You can setup another alarm to catch those. ie: Alarm when 100 errors occur in 1 hour using the same method.

Updated: Because the metrics for NumberOfMessagesReceived and NumberOfMessagesSent are dependent on how the message is queued, I have devised a new solutions for our needs using the metric ApproximateNumberOfMessagesDelayed after adding a delay to the dlq settings. If you are adding the messages to the queue manually then NumberOfMessagesReceived will work. Otherwise use ApproximateNumberOfMessagesDelayed after setting up a delay.

MyDeadLetterQueue:
    Type: AWS::SQS::Queue
    Properties:
      MessageRetentionPeriod: 1209600  # 14 days
      DelaySeconds: 60 #for alarms

DLQthresholdAlarm:
 Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmDescription: "Alarm dlq messages when we have 1 or more failed messages in 10 minutes"
      Namespace: "AWS/SQS"
      MetricName: "ApproximateNumberOfMessagesDelayed"
      Dimensions:
        - Name: "QueueName"
          Value:
            Fn::GetAtt:
              - "MyDeadLetterQueue"
              - "QueueName"
      Statistic: "Sum"
      Period: 300  
      DatapointsToAlarm: 1 
      EvaluationPeriods: 2       
      Threshold: 1
      ComparisonOperator: "GreaterThanOrEqualToThreshold"
      AlarmActions:
        - !Ref MyAlarmTopic
Krutz answered 29/5, 2020 at 23:57 Comment(4)
Using ApproximateNumberOfMessagesDelayed didn't actually work for me. After failed deliver of my sqs message I configured my sqs to send it to my DLQ. Then looked at the graph in CW dashboard and used NumberOfMessagesReceived to trigger alarms and I got it to work. Can you update your answerBenue
There are a lot of variables, including how the message is queued. In our case we had queuing happing from error handling in a state machine as well as through lambdas. I stand by my update to my answer, because it handles more scenarios that will cause your approach to fail. The delay must be configured properly on the queue to use ApproximateNumberOfMessagesDelayed.Krutz
@Krutz Here in this AlramActions will send event to SNS which inturn sends an email?Breechblock
@Karthik, yes because our subscription to the topic is configured to send an email.Krutz
F
6

We had the same issue and solved it by using 2 metrics and creating an math expression.

    ConsentQueue:
        Type: AWS::SQS::Queue
        Properties:
            QueueName: "queue"
            RedrivePolicy:
                deadLetterTargetArn:
                    Fn::GetAtt:
                        - "DLQ"
                        - "Arn"
                maxReceiveCount: 3 # after 3 tries the event will go to DLQ
             VisibilityTimeout: 65
    DLQ:
        Type: AWS::SQS::Queue
        Properties:
            QueueName: "DLQ"

    DLQAlarm:
        Type: AWS::CloudWatch::Alarm
        Properties:
            AlarmDescription: "SQS failed"
            AlarmName: "SQSAlarm"
            Metrics:
            - Expression: "m2-m1"
              Id: "e1"
              Label: "ChangeInAmountVisible"
              ReturnData: true
            - Id: "m1"
              Label: "MessagesVisibleMin"
              MetricStat:
                  Metric:
                      Dimensions:
                      - Name: QueueName
                        Value: !GetAtt DLQ.QueueName
                      MetricName: ApproximateNumberOfMessagesVisible
                      Namespace: "AWS/SQS"
                  Period: 300 # evaluate maximum over period of 5 min
                  Stat: Minimum
                  Unit: Count
              ReturnData: false
            - Id: "m2"
              Label: "MessagesVisibleMax"
              MetricStat:
                  Metric:
                      Dimensions:
                      - Name: QueueName
                        Value: !GetAtt DLQ.QueueName
                      MetricName: ApproximateNumberOfMessagesVisible
                      Namespace: "AWS/SQS"
                  Period: 300 # evaluate maximum over period of 5 min
                  Stat: Maximum
                  Unit: Count
              ReturnData: false
            ComparisonOperator: GreaterThanOrEqualToThreshold
            Threshold: 1
            DatapointsToAlarm: 1
            EvaluationPeriods: 1

The period is important so the minimum and maximum are evaluated over a longer period. AWS Math Expression Graph

Flied answered 22/12, 2020 at 7:21 Comment(0)
E
6

Terraform working example of above mentions of RATE(M1+M2)

resource "aws_cloudwatch_metric_alarm" "dlq_alarm" {
  alarm_name                = "alarm_name"
  comparison_operator       = "GreaterThanThreshold"
  evaluation_periods        = "1"
  threshold                 = "0"
  alarm_description         = "desc"
  insufficient_data_actions = []
  alarm_actions = [aws_sns_topic.sns.arn]

  metric_query {
    id          = "e1"
    expression  = "RATE(m2+m1)"
    label       = "Error Rate"
    return_data = "true"
  }

  metric_query {
    id = "m1"

    metric {
      metric_name = "ApproximateNumberOfMessagesVisible"
      namespace                 = "AWS/SQS"
      period      = "60"
      stat        = "Sum"
      unit        = "Count"

      dimensions = {
        QueueName    = "${aws_sqs_queue.sqs-dlq.name}"
      }
    }
  }

  metric_query {
    id = "m2"

    metric {
      metric_name = "ApproximateNumberOfMessagesNotVisible"
      namespace                 = "AWS/SQS"
      period      = "60"
      stat        = "Sum"
      unit        = "Count"

      dimensions = {
        QueueName    = "${aws_sqs_queue.sqs-dlq.name}"
      }
    }
  }
}
Expiatory answered 3/3, 2022 at 17:27 Comment(0)
D
4

It is difficult to achieve what is being asked in the question. If the endpoint of cloudwatch alarm is to send Email or notify users about the DLQ message arrival you can do a similar thing with the help of SQS, SNS and Lambda. And from cloudwatch you can see how the DLQ messages grows on time whenever you receive any Email.

  1. Create a SQS DLQ for an existing queue.
  2. Create an SNS topic and subscribe the SNS topic to send Email.
  3. Create a small lambda function which listens the SQS queue for an incoming messages, if there is any new incoming messages, send it to SNS. Since SNS is subscribed to Email you will get the Email whenever any new messages comes to SQS queue. Obviously the trigger for the lambda function is SQS and batch size is 1.
#!/usr/bin/python3
import json
import boto3
import os

def lambda_handler(event, context):
    batch_processes=[]
    for record in event['Records']:
        send_request(record["body"])


def send_request(body):
    # Create SNS client
    sns = boto3.client('sns')

    # Publish messages to the specified SNS topic
    response = sns.publish(
        TopicArn=#YOUR_TOPIC_ARN
        Message=body,    
    )

    # Print out the response
    print(response)
Deflexed answered 30/5, 2020 at 7:58 Comment(0)
B
2

I've encountered the same issue with Cloudwatch Alarms not firing when queue entries automatically flow into a DLQ, and believe I have come up with a solution.

You need to setup:

  • Consider a time period, for me I set up 5 minutes
  • Add a metric via the SQS collection for the dlq you need, and select "ApproximateNumberOfMessagesVisible". Set the statistics to Maximum.
  • Duplicate the above line, and set the statistics to Minimum.
  • Add a new empty expression Metric where the details are: (the id of maximum metric) - (the id of the minimum metric)
  • Make sure you only tick and click "Select Metric" for the new expression you created above.

This should now on a periodic basis, check the difference of number of entries in the DLQ, regardless of how they got there, so we can get past the problematic Metrics like NumberOfMessagesSent or NumberOfMessagesReceived.

UPDATE: I just realised that is the exact solution that Lucasz mentioned above, so consider this a confirmation that it works :)

Bulger answered 21/1, 2021 at 2:11 Comment(0)
P
0

What you can do is create a lambda with event source as your DLQ. And from the Lambda you can post custom metric data to CloudWatch. Alarm will be triggered when your data meets the conditions.

Use this reference to configure your lambda such that it gets triggered when a message is sent to your DLQ: Using AWS Lambda with Amazon SQS - AWS Lambda

Here is a nice explanation with code that suggests how we can post custom metrics from Lambda to CloudWatch: Sending CloudWatch Custom Metrics From Lambda With Code Examples

Once the metrics are posted, CloudWatch alarm will trigger as it will match the metrics.

Patman answered 13/2, 2020 at 20:35 Comment(3)
The downside to this approach is that the messages in the DLQ are effectively lost. You'd want to store them somewhere, which creates another potential for failure, though a lowered one, for sure.Sandblast
Actually I was looking for a solution using some metric of the same DLQ to trigger the alarm.Unmanned
Perhaps saving to cloud watch log group would sufficeShaunteshave

© 2022 - 2024 — McMap. All rights reserved.