How to handle Dead Letter Queues in Amazon SQS?
Asked Answered
O

2

7

I am using event-driven architecture for one of my projects. Amazon Simple Queue Service supports handling failures.

If a message was not successfully handled, it does not get to the part where I delete the message from the queue. If it's a one-time failure, it is handled graciously. However, if it is an erroneous message, it makes its way into DLQ.

My question is what should be happening with DLQs later on? There are thousands of those messages stuck in the DLQ. How are they supposed to be handled?

I would love to hear some real-life examples and engineering processes that are in place in some of the organizations.

Oleoresin answered 23/9, 2019 at 16:13 Comment(0)
R
7

"It depends!"

Messages would have been sent to the Dead Letter Queue because something didn't happen as expected. It might be due to a data problem, a timeout or a coding error.

You should:

  • Start examining the messages that went to the Dead Letter Queue
  • Try and re-process the messages to determine the underlying cause of the failure (but sometimes it is a random failure that you cannot reproduce)
  • Once a cause is found, update the system to handle that particular use-case, then move onto the next cause

Common causes can be database locks, network errors, programming errors and corrupt data.

It's probably a good idea to setup some sort of monitoring so that somebody investigates more quickly, rather than letting it accumulate to thousands of messages.

Reptilian answered 23/9, 2019 at 18:13 Comment(0)
Z
0

The messages moved to DLQ are considered as you said, erroneous.

If the messages are erroneous due to a bug in the code etc, you should redrive these DLQ messages to source queue once you fixed the bug. So that they'll have another chance to be reprocessed.

It is very unlikely that "temporarly" erroneous messages are moved to DLQ, if you already configured the maxReceiveCount as 3 or more for your source queue. Temporary problems are mostly bypassed with this retry configuration.

And eventually DLQ is also an ordinary SQS queue which retains messages up to 14 days. Even if there are thousands of messages there, they will be gone. At this point, there are two options:

  • Messages in DLQ are "really" erroneous. So see the metrics, messages and logs to identify the root cause. If there is no bug to fix, it means you keep unrequired data in DLQ. So there is nothing wrong to lose them in 14 days. If there is a bug, fix it an simply redrive messages from DLQ to source queue.
  • You dont want to investigate through the messages to identify that what was the reason for failure, and you only want to persist message data for historical reasons (god knows why). You can create a lambda function to poll messages and persist in a desired target database.
Zarzuela answered 7/2, 2022 at 19:20 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.