Amazon AWS, messages from SQS queue delivered multiple times
Asked Answered
C

3

5

I have a worker running on Elastic Beanstalk which accepts POST requests via messages from queue. These messages triggers long operation which takes several minutes (sometimes even hours) and it is crucial that this operation is executed only once.

The problem is that when I log in to the worker console to see the process, the message seems to be delivered each minute over and over again (the method triggered by receiving the requests gets called each minute). How can I get rid of this behavior?

I read the documentation and set the Visible timeout period to the max value (12 hours) for both the service queue and the dead letter queue. This does, however, not help at all.

When I send the message, it is displayed as "in flight" (which is a supposed behavior, I think, since the queue waits to receive a delete request or some kind of answer which is only provided at the end of the long operation).

Could someone hint me what is going on in this scenario? I probably missed some important detail in the configuration...

EDIT: it seems that the message is being redelivered each minutes as long as it is "in flight". Once I finish the process, the message finally disappears.

Constipation answered 17/6, 2015 at 7:27 Comment(3)
If it is "crucial" not to work a job more than once, you need to keep external track of what jobs have been worked, because duplicate deliveries are very unlikely, yet possible. What you describe, however, sounds like something else. I can't find it in the docs, but iirc, changing the queue's visibility timeout only affects messages received after the change. Did you consider that possibility?Colored
Read about Visibility Timeout at docs.aws.amazon.com/AWSSimpleQueueService/latest/…Jaine
If you set visibility timeout to 12 hours, it will be delivered only once per 12 hours. I suspect you set visibility timeout only in worker configuration and not in actual queue configuration.Dysthymia
R
9

There's an extra layer of complexity here because you're not polling the SQS queue directly; there's a worker process deployed by Elastic Beanstalk called sqsd that's polling the queue on your behalf, POSTing any messages it gets to your application, and deleting them from the queue when you respond with a 200.

The VisibilityTimeout setting on the queue controls how long the queue waits after delivering a message to the consumer (in this case, sqsd) before it assumes something has gone wrong and re-delivers the message to someone else. sqsd has a similar concept (called "InactivityTimeout") that controls how long it waits after POSTing to your application before it assumes something has gone wrong and retries. You'll need to configure this to also be high enough that sqsd doesn't re-send the request to your application before you finish processing it. I've seen reports of another "ProxyTimeout" setting that might need to be adjusted as well.

More generally, keep in mind that exactly-once delivery isn't physically possible to guarantee in a distributed system - even if you get all the timeouts right so it works correctly most of the time, there's always the possibility that you'll crash after completing the operation but before you can tell SQS about it, and the message will be re-delivered to someone else. The closest you can get is to make sure that if a message gets delivered twice, that the result is exactly the same - for example, by having your processing logic check whether the thing it's about to do has already been done, and if so just immediately returning a 200.

Ramon answered 1/7, 2015 at 5:45 Comment(0)
G
13

It seems like you forgot to delete the message after processing it.

After you dequeue a message, it is necessary to delete it. If you don't delete it explicitly, SQS assumes that you dequeued the message and failed to process it, so it will appear on the queue again.

There are 2 parameters of timeout that you can set in SQS and both are important:

  1. WaitTimeSeconds

  2. VisibilityTimeout

1) WaitTimeSeconds = 10 means that your call to SQS should return immediately if there are messages in the queue, BUT if there are no messages in the queue, your call will block until a message arrives to the queue, with a maximum of 10 seconds.

2) Once you have dequeued a message, VisibilityTimeout = 60 states that you have 60 seconds to process that message, otherwise it will appear again in the queue. If you processed that message before 60 seconds, you MUST send a deleteMessage request. If you fail to send that deleteMessage request before 60 seconds, the message will reappear in the queue.

If you send the deleteMessage request after 60 seconds, it will have no effect and the message will reappear anyway.

You have to write your code in a way that if your process fails, it will naturally fail to send the deleteMessage request, so that the message will naturally appear again in SQS.

You can find detailed info about 1) and 2) here:

http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/MessageLifecycle.html

http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-long-polling.html

http://boto.readthedocs.org/en/latest/ref/sqs.html#boto.sqs.queue.Queue.get_messages

Gavra answered 17/6, 2015 at 17:32 Comment(2)
That is correct. However, if I delete the message immediately after receiving it and the process then fails, the processing will not be triggered again since the message is no longer in queue. Ideally, I would like to dequeue the message, process it and only when I am finished inform the queue that I was successful and delete it. And I would like the queue to stop redelivering the message over and over again during the processing. The queue seems to be ignoring the visibility timeout completely...Constipation
Then you have to increase the visibility timeout. I've edited my answer to include more informationGavra
R
9

There's an extra layer of complexity here because you're not polling the SQS queue directly; there's a worker process deployed by Elastic Beanstalk called sqsd that's polling the queue on your behalf, POSTing any messages it gets to your application, and deleting them from the queue when you respond with a 200.

The VisibilityTimeout setting on the queue controls how long the queue waits after delivering a message to the consumer (in this case, sqsd) before it assumes something has gone wrong and re-delivers the message to someone else. sqsd has a similar concept (called "InactivityTimeout") that controls how long it waits after POSTing to your application before it assumes something has gone wrong and retries. You'll need to configure this to also be high enough that sqsd doesn't re-send the request to your application before you finish processing it. I've seen reports of another "ProxyTimeout" setting that might need to be adjusted as well.

More generally, keep in mind that exactly-once delivery isn't physically possible to guarantee in a distributed system - even if you get all the timeouts right so it works correctly most of the time, there's always the possibility that you'll crash after completing the operation but before you can tell SQS about it, and the message will be re-delivered to someone else. The closest you can get is to make sure that if a message gets delivered twice, that the result is exactly the same - for example, by having your processing logic check whether the thing it's about to do has already been done, and if so just immediately returning a 200.

Ramon answered 1/7, 2015 at 5:45 Comment(0)
T
1

With sqs you have to manually call the delete api to remove the message off the queue. Setting a high timeout value only ensures that no other poller will receive the same message for that amount of time.

You have 2 options. 1. Delete the message as soon as you read it and then start the downstream process. 2. Read the message, set the visibility timeout of the message to the timeout value of your process and then as part of your process, last step to do is to delete the message.

Truman answered 18/6, 2015 at 1:41 Comment(3)
Actually, what I discovered is, that by sending 200 OK status, the message gets deleted from queue. But what happens if I do not want to do this? I want to keep the message in queue until the whole process is finished and respond only after that. In that way, I can ensure that it will be processed eventually even if the instance fails. The SQS queue however ignores the visible timeout completely and redelivers the message over and over again even if I respond with some other status (like 202). Another thing is that I do not know how long the process will take.Constipation
We regularly use the sqs in our application and we have never seen this happening where the sqs is ignoring the visibility timeout completely. Can you please share your code?Truman
Also, if you do not know how long the process would take to complete then you need to have that process keep increasing the visibility timeout of the message ( using the message id and receipt handle) until the process is completed.Truman

© 2022 - 2024 — McMap. All rights reserved.