How to recover from missed integration or notification events in event driven architecture?

Asked 23/12, 2020 at 13:27 Answered 2/1, 2021 at 11:27

Solved c#microservices integration publish-subscribe event-sourcing

The situation is as follows. There are three services, one service is event sourced and publishes integration or notification events (outbox pattern) to the other two services (subscribers) using an event bus (like Azure Service bus or ActiveMQ).

This design is inspired by .NET microservices - Architecture e-book - Subscribing to events.

I'm wondering what should happen if one of these events can not be delivered due to an error or if event handeling simply wasn't implemented correctly.

Should I trust my message bus in case of an application error?
- Is this a usecase for dead letter queues?
On republishing events, should all messages be republished to all topics or would it be possible to only republish a subset?
- Should the service republishing events be able to access publisher and subscriber databases to know the message offset?
- Or should the subscribing microservices be able to read the outbox?

Synonymize answered 23/12, 2020 at 13:27 Comment(1)

In addition to the linked eBook (which is excellent), take a look at "Enterprise Integration Patterns" by Gregor Hohpe and Bobby Woolfe (ISBN 978-0321200686) to learn about the patterns in use. A subset of the content is available at enterpriseintegrationpatterns.com/patterns/messaging – Scaife 27/12, 2020 at 5:17

Should I trust my message bus in case of an application error?

Yes.

(Edit: After reading this answer, read @StuartLC's answer for more info)

The system you described is an eventually consistent one. It works under the assumption that if each component does its job, all components will eventually converge on a consistent state.

The Outbox's job is to ensure that any event persisted by the Event Source Microservice is durably and reliably delivered to the message bus (via the Event Publisher). Once that happens, the Event Source and the Event Publisher are done--they can assume that the event will eventually be delivered to all subscribers. It is then the message bus's job to ensure that that happens.

The message bus and its subscriptions can be configured for either "at least once" or "at most once" delivery. (Note that "exactly once" delivery is generally not guaranteeable, so an application should be resilient against either duplicate or missed messages, depending on the subscription type).

An "at least once" (called "Peek Lock" by Azure Service Bus) subscription will hold on to the message until the subscriber gives confirmation that it was handled. If the subscriber gives confirmation, the message bus's job is done. If the subscriber responds with an error code or doesn't respond in a timely manner, the message bus may retry delivery. If delivery fails multiple times, the message may be sent to a poison message or dead-letter queue. Either way, the message bus holds on to the message until it gets confirmation that it was received.

On republishing events, should all messages be republished to all topics or would it be possible to only republish a subset?

I can't speak for all messaging systems, but I would expect a message bus to only republish to the subset of subscriptions that failed. Regardless, all subscribers should be prepared to handle duplicate and out-of-order messages.

Should the service republishing events be able to access publisher and subscriber databases to know the message offset?

I'm not sure I understand what you mean by "know the message offset", but as a general guideline, microservices should not share databases. A shared database schema is a contract. Once the contract established, it is difficult to change unless you have total control over all of its consumers (both their code and deployments). It's generally better to share data through application APIs to allow more flexibility.

Or should the subscribing microservices be able to read the outbox?

The point of the message bus is to decouple the message subscribers from the message publisher. Making the subscribers explicitly aware of the publisher defeats that purpose, and will likely be difficult to maintain as the number of publishers and subscribers grows. Instead, rely on a dedicated monitoring service and/or the monitoring capabilities of the message bus to track delivery failures.

Scaife answered 27/12, 2020 at 4:45 Comment(4)

Thank you, I understand now I shouldn't focus on republishing a subset of events, because my message bus should ensure no messages are lost even on multiple failed deliveries by "at least once" delivery. In case of a complete rebuild of one of the subscribers or if a new subscriber is added, should I just republish all integration events? Subscribers should check on already received messages or the messages should be Idempotent. Sounds heavy, but maybe that's biased on my part. And off course those subscribers can scale out at that moment. – Synonymize 27/12, 2020 at 14:7

You're right, that sounds heavy. The pub/sub system is for what's happening in the moment, not what happened in the past. Look for a way for the subscriber to pull the info they need to backfill rather than requiring the publisher to push it. The subscriber could use an API to request current state or history from the system of record. That API could be routed through the message bus to keep the publisher and subscriber decoupled. If enough of data is needed, you could also seed the subscriber with a snapshot of an existing data store. Sharing a snapshot is safer than sharing a live database. – Scaife 27/12, 2020 at 15:41

So you'd recommend not resending messages, but pulling all entities on a complete rebuild. Iterating all aggregate Id's could be achieved by for example a GRPC stream right? Since I will be using Orleans grains, the current state can be requested by using an Orleans client for each aggregate id. Or do you know of any rebuilding examples? – Synonymize 27/12, 2020 at 16:2

I can't give implementation specifics (I don't have experience with gRPC or Orleans, and I don't know any details of your system), but I think you're on the right track. Think about whether you really need to pull all entities, though. It might (or might not) be better to pull them just-in-time as needed. – Scaife 27/12, 2020 at 16:18

Just to add to @xander's excellent answer, I believe that you may be using an inappropriate technology for your event bus. You should find that Azure Event Hubs or Apache Kafka are better candidates for event publish / subscribe architectures. Benefits of a dedicated Event Bus technology over the older Service Bus approaches include:

There is only ever one copy of each event message (whereas Azure Service Bus or RabbitMQ make deep copies of each message for each subscriber)
Messages are not deleted after consumption by any one subsriber. Instead, messages are left on the topic for a defined period of time (which can be indefinite, in Kafka's case).
Each subscriber (consumer group) will be able to track it's committed offset. This allows subscribers to re-connect and rewind if it has lost messages, independently of the publisher, and other subscribers (i.e. isolated).
New consumers can subscribe AFTER messages have been published, and will still be able to receive ALL messages available (i.e. rewind to the start of available events)

With this in mind, :

Should I trust my message bus in case of an application error?

Yes, for the reasons xander provided. Once the publisher has a confirmation that the event bus has accepted the event, the publisher's job is now done and should never send this same event again.

Nitpicky, but since you are in a publish subscribe architecture (i.e. 0..N subscribers), you should refer to the bus as an event bus (not a message bus), irrespective of the technology used.

Is this a usecase for dead letter queues?

Dead letter queues are more usually an artifact of point-to-point queues or service bus delivery architecture, i.e. where there is a command message intended (transactionally) for a single, or possibly finite number of recipients. In a pub-sub event bus topology, it would be unfair to the publisher to expect it to monitor the delivery of all subscribers.

Instead, the subscriber should take on responsibility for resilient delivery. In technologies like Azure Event Hubs and Apache Kafka, events are uniquely numbered per consumer group, so the subscriber can be alerted to a missed message through monitoring of message offsets.

On republishing events, should all messages be republished to all topics or would it be possible to only republish a subset?

No, an event publisher should never republish an event, as this will corrupt the chain of events to all observer subscribers. Remember, that there may be N subscribers to each published event, some of which may be external to your organisation / outside of your control. Events should be regarded as 'facts' which have happened at a point in time. The event publisher shouldn't care whether there are zero or 100 subscribers to an event. It is up to each subscriber to decide on how the event message should be interpreted.

e.g. Different types of subscribers could do any of the following with an event:

Simply log the event for analytics purposes
Translate the event into a command (or Actor Model message) and be handled as a transaction specific to the subscriber
Pass the event into a Rules engine to reason over the wider stream of events, e.g. trigger counter-fraud actions if a specific customer is performing an unusually large number of transactions
etc.

So you can see that republishing events for the benefit of one flakey subscriber would corrupt the data flow for other subscribers.

Should the service republishing events be able to access publisher and subscriber databases to know the message offset?

As xander said, Systems and Microservices shouldn't share databases. However, systems can expose APIs (RESTful, gRPC etc)

The Event Bus itself should track which subscriber has read up to which offset (i.e. per consumer group, per topic and per partition). Each subscriber will be able to monitor and change its offsets, e.g. in case an event was lost and needs to be re-processed. (Again, the producer should never republish an event once it has confirmation that the event has been received by the bus)

Or should the subscribing microservices be able to read the outbox?

There are at least two common approaches to event driven enterprise architectures:

'Minimal information' events, e.g. Customer Y has purchased Product Z. In this case, many of the subscribers will find the information contained in the event insufficient to complete downstream workflows, and will need to enrich the event data, typically by calling an API close to the publisher, in order to retrieve the rest of the data they require. This approach has security benefits (since the API can authenticate the request for more data), but can lead to high I/O load on the API.
'Deep graph' events, where each event message has all the information that any subscriber should ever hope to need (this is surprisingly difficult to future proof!). Although the event message sizes will be bloated, it does save a lot of triggered I/O as the subscribers shouldn't need to perform further enrichment from the the producer.

Inventive answered 2/1, 2021 at 11:27 Comment(6)

Great additional info! I learned some things. – Scaife 2/1, 2021 at 16:53

It should also be noted that, to the extent that the downstream subscribers are themselves event-driven (e.g. maintaining their own aggregate states), the minimal information model doesn't require querying the publisher. – Buatti 2/1, 2021 at 23:25

Thank you for the extensive additional information. If you choose not to leave messages on the topic for an indefinite period of time. How would you make it possible for new consumers to be able to read all messages without republishing all events? – Synonymize 4/1, 2021 at 9:8

There's a mindshift change when moving from traditional 'queues' (MSMQ, MQ Series etc) and 'Service bus' (Az Service Bus, RabbitMQ) technologies to Kafka / Event Hubs. In the latter, event messages on a topic/partition don't belong to any one subscribing consumer, and frequent deletion is seen as a performance impediment, hence the rationale for leaving the messages on cheap storage for prolonged periods. – Inventive 4/1, 2021 at 9:21

I understand messages (or published events) are not short lived in event hub technology. Can I conclude that most event driven architectures store all published events in event hub (or Kafka) forever/indefinitely? – Synonymize 4/1, 2021 at 10:53

Az Event hubs can retain event messages for a maximum of 7 days Kafka can be configured to retain events indefintely, which does mean it can act as a source of truth for audits, and also has application in Data Lakes / Big Data (although you'll typically need to retain indexing of events if you need random access to same) – Inventive 4/1, 2021 at 11:4

Recommended topics

Hot tags