Message is received from Google Pub/Sub subscription again and again after acknowledge[Heisenbug]
Asked Answered
P

2

8

I would like to notice that the scenarion I will describe happen rare enough and in most cases everything works as expected.

I have 1 topic and 1 subscription on Pub/Sub side.

My java application listens for subscription, does some processing and sends acknowledge back. Because of fact that google Pub/Sub guarantees at least once delivery, we do message deduplication on our side based on objectGeneration header and 'objectId' header.

Sometimes we see that message that was acknowldged is accepted by our application again and again and it is unexpected behaviour.

Log example:

//first
2019-12-17 20:51:57.375 INFO 1 --- [sub-subscriber3] bucketNotificationFlow : Received new message from pub-sub: GenericMessage [payload={....}, headers={.....objectGeneration=1576615916875106, eventTime=2019-12-17T20:51:56.874940Z, objectId=Small_files_bunch/100_12_1.csv, ....
....
2019-12-17 20:51:57.698 INFO 1 --- [sub-subscriber3] .i.g.PubSubMessageAcknowledgementHandler : Acknowledged message - 1576615916875106
...
//duplicate 1
2019-12-17 20:51:59.663 INFO 1 --- [sub-subscriber4] bucketNotificationFlow : Received new message from pub-sub: GenericMessage [payload={...}, headers={ objectGeneration=1576615916875106, eventTime=2019-12-17T20:51:56.874940Z, objectId=Small_files_bunch/100_12_1.csv", ....
...
2019-12-17 20:51:59.704 INFO 1 --- [sub-subscriber4] c.b.m.i.DiscardedMessagesHandler : Duplicate message received GenericMessage [ headers={idempotent.keys=[objectGeneration.1576615916875106, objectId.Small_files_bunch/100_12_1.csv], ...
....
//duplicate 2
2019-12-17 22:52:02.239 INFO 1 --- [sub-subscriber1] bucketNotificationFlow : Received new message from pub-sub: GenericMessage [payload={...}, headers={objectGeneration=1576615916875106, eventTime=2019-12-17T20:51:56.874940Z, objectId=Small_files_bunch/100_12_1.csv, ...
...
2019-12-17 22:52:02.339 INFO 1 --- [sub-subscriber1] c.b.m.i.DiscardedMessagesHandler : Duplicate message received GenericMessage [ headers={idempotent.keys=[objectGeneration.1576615916875106, objectId.Small_files_bunch/100_12_1.csv], ...

// and so on each 2 hours

Code for acknowledgement:

var generation = message.getHeaders().get("objectGeneration");
pubSubMessage = message.getHeaders().get(GcpPubSubHeaders.ORIGINAL_MESSAGE, BasicAcknowledgeablePubsubMessage.class)
pubSubMessage.ack().addCallback(
        v -> {
            removeFromIdempotentStore(targetMessage, false);
            log.info("Acknowledged message - {}", generation); //from logs we see that this line was invoked
        },
        e -> {
            removeFromIdempotentStore(targetMessage, false);
            log.error("Failed to acknowledge message - {}", generation, e);
        }
);

GCP subscription page contains following diagram:

enter image description here

StackDriver acknowledge diagram: enter image description here

Any ideas what is going on, how to troubleshoot it and fix it ?

Propaganda answered 18/12, 2019 at 11:41 Comment(0)
S
1

Try checking Stackdriver to see if you are missing acknowledgement deadlines.

The two hour wait time between duplicates is very interesting. Have you tried expanding your message deadline before? (Info on this is at the above link.)

Ski answered 21/12, 2019 at 16:12 Comment(15)
Why 2 hours ? because I set max-ack-extension-period: 7200Propaganda
Acknowledgement deadline is 600 SecondsPropaganda
I found a thread on the GitHub Google Cloud Client Libraries for Go describing your issue. I wasn't able to find the same on the Java client library. Try reducing your message deadline to less than 600.Ski
Initially message deadline was 10 sec(default value) and it was attempt to fix the issue when I set 600 sec, so message deadline doesn't matter for that issuePropaganda
Did you check your Stackdriver logs to see if you are missing acknowledgement deadlines?Ski
dl3.joxi.net/drive/2019/12/27/0005/3037/338909/09/…Propaganda
looks like everuthing is blue(success) on the diagram(added diagram to the topic too)Propaganda
More detailed picture where you can see that I use correct filters: dl3.joxi.net/drive/2019/12/27/0005/3037/338909/09/…Propaganda
Do you have any ideas?Propaganda
I spent most of my day yesterday trying to reproduce this issue and I was unable to get the intended results. Here is my main idea as of now: Judging by your "oldest unacked message age" metric, one of your messages is not being acknowledged for 2 hours. This we already know. Because your stackdriver logs only show successes, this leads me to believe that the original acknowledgement is not being sent back to pubsub.Ski
And I am aware that your logs show that the message was acknowledged. And I know this is extremely unlikely, but I have been unable find any other cases similar to this where this solution was not the answer. You mentioned that this behavior only happens sometimes. Have you noticed any patterns? Like when a specific function is called?Ski
Let me share with you a bit more context. I have bucket which generates notification and sends it to the topic when I put file into it. Application listens corresponding subscription, processes file and the sends acknowledge to the Pub/Sub. This issue reproduces when I send thousands of files of different sizes(most of files have size 3 kb but 12 files has size 190MB). My application is powerful enough to make sure that time between notification generation and acknowledge much less than 2 hours(even for huge files)Propaganda
Okay, it sounds like we both agree on the issue not being with job processing and timeouts. And now I know that this only occurs when a large number (thousands) of files are sent to your bucket. With this in mind, the first duplicate message you receive in your logs is very close to when the first message is received. Can you try acknowledging the first message when it is received, and then not processing the message?Ski
You mean to acknowledge first duplicated message without processing? I've already implemented this workaround and it worksPropaganda
Okay so you have already implemented this. It seems the best course of action here would be to create an issue with Google Cloud Platform.Ski
L
0

See more info here: How to cleanup the JdbcMetadataStore?

According our conclusion, it would be better do not remove entries from metadata store table immediately after processing. Some external job should do the trick from time to time and only for those entries which are old enough for remove and we are definitely sure that Pub/Sub won't redeliver to us the same message anymore.

Lanham answered 26/12, 2019 at 17:23 Comment(2)
yes, it is my current state of things. But I thunk that it is a workaround. I suppose I configured google pub/sub or spring-cloud-gcp in a wrong way. Highly unlikely but it might be a bug in the library or pub/sub. Redelivery is ok and it is expected anomaly. But in our case successful callback is invoked but message was not removed from subscription and we get it again and again. So we can't just ignore duplicated we have to do repeated acknowledge to remove message from subscriptionPropaganda
again and again means endlessPropaganda

© 2022 - 2024 — McMap. All rights reserved.