Azure Function Event Hub Trigger reliability
Asked Answered
E

2

8

I'm a bit confused regarding the EventHubTrigger for Azure functions.

I've got an IoT Hub, and am using its eventhub-compatible endpoint to trigger an Azure function that is going to process and store the received data.

However, if my function fails (= throws an exception), that message (or messages) being processed during that function call will get lost. I actually would expect the Azure function runtime to process the messages at a later time again. Specifically, I would expect this behavior because the EventHubTrigger is keeping checkpoints in the Function Apps storage account in order to keep track of where in the event stream it has to continue.

The documention of the EventHubTrigger even states that

If all function executions succeed without errors, checkpoints are added to the associated storage account

But still, even when I deliberately throw exceptions in my function, the checkpoints will get updated and the messages will not get received again.

Is my understanding of the EventHubTriggers documentation wrong, or is the EventHubTriggers implementation (or its documentation) wrong?

Eula answered 8/3, 2018 at 6:31 Comment(0)
M
9

This piece of documentation seems confusing indeed. I guess they mean the errors of Function App host itself, not of your code. An exception inside function execution doesn't stop the processing and checkpointing progress.

The fact is that Event Hubs are not designed for individual message retries. The processor works in batches, and it can either mark the whole batch as processed (i.e. create a checkpoint after it), or retry the whole batch (e.g. if the process crashed).

See this forum question and answer.

If you still need to re-process failed events from Event Hub (and errors don't happen too often), you could implement such mechanism yourself. E.g.

  1. Add an output Queue binding to your Azure Function.
  2. Add try-catch around processing code.
  3. If exception is thrown, add the problematic event to the Queue.
  4. Have another Function with Queue trigger to process those events.

Note that the downside of this is that you will loose ordering guarantee provided by Event Hubs (since Queue message will be processed later than its neighbors).

Moy answered 8/3, 2018 at 7:15 Comment(3)
Thanks for your quick answer! I feared this is the answer :-) In my case, loosing the order shouldn't be a problem, I was thinking along those lines (poison queue for failed messages) anyway... What I don't get is the following: As far as I understood the checkpoint mechanism, it is designed the way it is, so you can implement retry-like functionality on consumer side, because the Event Hub doesn't offer this on its own (as opposed to queues). The fact that the EventHubTrigger treats its own failures different from user-code failures seems like a bug (or missing feature) to me...Eula
@Eula I believe Event Hub Trigger is based on EventHostProcessor, which is the recommended processing library for Event Hubs. So, it inherits most of its properties. Own failures are of different nature: they are fundamental problems like crashes or network issues.Moy
We are now probably going to implement a pattern as described here: hackernoon.com/…, which is basically the same you described. However I'm still not happy with that - Microsoft states that Event Hub and also the Event Processor Host are designed to implement "at least once delivery at scale". And not even with the fallback to a poison queue we can guarantee "at least once delivery" 100%. The issue is, that even if we do not set a checkpoint, the next batch of messages might succeed, setting the checkpoint.Eula
N
0

Quick fix. As retry policy would not work if down system is down for few hours. You can call Process.GetCurrentProcess().Kill(); in exception handling. This would stop the checkpoint moving forward. I have tested this with consumption based function app. You will not see anything in logs but i added email to notify that something went wrong and to avoid data loss i have killed the function instance. Hope this helps. Would put an blog over it and other part of workflow where I stop function in case of continuous failure on down system using logic app.

Nez answered 27/12, 2018 at 6:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.