Just to add to @xander's excellent answer, I believe that you may be using an inappropriate technology for your event bus. You should find that Azure Event Hubs or Apache Kafka are better candidates for event publish / subscribe architectures. Benefits of a dedicated Event Bus technology over the older Service Bus approaches include:
- There is only ever one copy of each event message (whereas Azure Service Bus or RabbitMQ make deep copies of each message for each subscriber)
- Messages are not deleted after consumption by any one subsriber. Instead, messages are left on the topic for a defined period of time (which can be indefinite, in Kafka's case).
- Each subscriber (consumer group) will be able to track it's committed offset. This allows subscribers to re-connect and rewind if it has lost messages, independently of the publisher, and other subscribers (i.e. isolated).
- New consumers can subscribe AFTER messages have been published, and will still be able to receive ALL messages available (i.e. rewind to the start of available events)
With this in mind, :
Should I trust my message bus in case of an application error?
Yes, for the reasons xander provided. Once the publisher has a confirmation that the event bus has accepted the event, the publisher's job is now done and should never send this same event again.
Nitpicky, but since you are in a publish subscribe architecture (i.e. 0..N subscribers), you should refer to the bus as an event bus (not a message bus), irrespective of the technology used.
Is this a usecase for dead letter queues?
Dead letter queues are more usually an artifact of point-to-point queues or service bus delivery architecture, i.e. where there is a command message intended (transactionally) for a single, or possibly finite number of recipients. In a pub-sub event bus topology, it would be unfair to the publisher to expect it to monitor the delivery of all subscribers.
Instead, the subscriber should take on responsibility for resilient delivery. In technologies like Azure Event Hubs and Apache Kafka, events are uniquely numbered per consumer group, so the subscriber can be alerted to a missed message through monitoring of message offsets.
On republishing events, should all messages be republished to all topics or would it be possible to only republish a subset?
No, an event publisher should never republish an event, as this will corrupt the chain of events to all observer subscribers. Remember, that there may be N subscribers to each published event, some of which may be external to your organisation / outside of your control. Events should be regarded as 'facts' which have happened at a point in time. The event publisher shouldn't care whether there are zero or 100 subscribers to an event. It is up to each subscriber to decide on how the event message should be interpreted.
e.g. Different types of subscribers could do any of the following with an event:
- Simply log the event for analytics purposes
- Translate the event into a command (or Actor Model message) and be handled as a transaction specific to the subscriber
- Pass the event into a Rules engine to reason over the wider stream of events, e.g. trigger counter-fraud actions if a specific customer is performing an unusually large number of transactions
- etc.
So you can see that republishing events for the benefit of one flakey subscriber would corrupt the data flow for other subscribers.
Should the service republishing events be able to access publisher and subscriber databases to know the message offset?
As xander said, Systems and Microservices shouldn't share databases. However, systems can expose APIs (RESTful, gRPC etc)
The Event Bus itself should track which subscriber has read up to which offset (i.e. per consumer group, per topic and per partition). Each subscriber will be able to monitor and change its offsets, e.g. in case an event was lost and needs to be re-processed. (Again, the producer should never republish an event once it has confirmation that the event has been received by the bus)
Or should the subscribing microservices be able to read the outbox?
There are at least two common approaches to event driven enterprise architectures:
- 'Minimal information' events, e.g.
Customer Y has purchased Product Z
. In this case, many of the subscribers will find the information contained in the event insufficient to complete downstream workflows, and will need to enrich the event data, typically by calling an API close to the publisher, in order to retrieve the rest of the data they require. This approach has security benefits (since the API can authenticate the request for more data), but can lead to high I/O load on the API.
- 'Deep graph' events, where each event message has all the information that any subscriber should ever hope to need (this is surprisingly difficult to future proof!). Although the event message sizes will be bloated, it does save a lot of triggered I/O as the subscribers shouldn't need to perform further enrichment from the the producer.