How to restore state in an event based, message driven microservice architecture on failure scenario
Asked Answered
C

3

21

In the context of a microservice architecture, a message driven, asynchronous, event based design seems to be gaining popularity (see here and here for some examples, as well as the Reactive Manifesto - Message Driven trait) as opposed to a synchronous (possibly REST based) mechanism.

Taking that context and imagining an overly simplified ordering system, as depicted below:

ordering system

and the following message flow:

  • Order is placed from some source (web/mobile etc.)
  • Order service accepts order and publishes a CreateOrderEvent
  • The InventoryService reacts on the CreateOrderEvent, does some inventory stuff and publishes a InventoryUpdatedEvent when it's done
  • The Invoice service then reacts to the InventoryUpdatedEvent, sends an invoice and publishes a EmailInvoiceEvent

All services are up and we happily process orders... Everyone is happy. Then, the Inventory service goes down for some reason 😬

Assuming that the events on the event bus are flowing in a "non blocking" manor. I.e. the messages are being published to a central topic and do not pile up on a queue if no service is reading from it (what I'm trying to convey is an event bus where, if the event is published on the bus, it would flow "straight through" and not queue up - ignore what messaging platform/technology is used at this point). That would mean that if the Inventory service were down for 5 minutes, the CreateOrderEvent's passing through the event bus during that time are now "gone" or not seen by the Inventory service because in our overly simplified system, no other system is interested in those events.

My question then is: How does the Inventory service (and the system as a whole) restore state in a way that no orders are missed/not processed?

Counterfoil answered 10/6, 2016 at 13:6 Comment(0)
S
12

Good question! So there are basically three forces at play here:

  1. if a service goes down, any of the events it may have missed need to be replayed to keep it consistent
  2. the events, as they happen in "time", have a "this happened before that" ordering to them
  3. there may be (but doesn't have to be) another party interested in overseeing a cloud of events to make sure a certain state is achieved.

For both #1 and #2 you want some sort of persistent log of events. A traditional message queue/topic may provide this though you have to consider the cases when messages may be processed out of order wrt to transactions/exception/fault behaviors. A more simple log like Apache Bookkeeper, Apache Kafka, AWS Kinesis etc can store/persist these types of events in sequence and leave it to the consumers to process in order/filter out duplicates/partition streams etc.

number 3 to me is a state machine. however you implement the state machine is really up to you. Basically this state machine keeps track of what events have happened and transitions to allowed states (and potentially participates in emitting events/commands) based on the events in the other systems.

For example, a real-world use case might look like an "escrow" when you're trying to close on a house. The escrow company not just handles the financial transaction, but usually they work with the real-estate agent to coordinate getting papers in order, papers signed, money transferred, etc. After each event, the escrow changes state from "waiting for buyer signature" to "waiting for seller signature" to "waiting for funds" to "closed success" ... they even have deadlines for these events to happen, etc and can transition to another state if money doesn't get transferred like "transaction closed, not finished" or something.

This state machine in your example would listen on the pub/sub channels and captures this state, runs timers, emits other events to further the systems involved, etc. It doesn't necessarily "orchestrate" them per se, but it does track the progress and enforce timeouts and compensations where needed. This could be implemented as a stream processor, as a process engine, or (imho best place to start) just a simple "escrow" service.

There's actually more to keep track of like what happens if a "escrow" service goes down/fails, how does it handle duplicates, how does it handle unexpected events given it state, how does it contribute to duplicate events, etc... but hopefully enough to get started.

Sulfathiazole answered 10/6, 2016 at 20:5 Comment(1)
Thanks for the prompt response. Option 1 and 2 in the context of Kafka is quite interesting. Given that all services are idempotent, then, using my example, once the Inventory service came back up it could just start reading from the last known offset and "catch up". Option 3 seems to hint at the Saga pattern in CQRS, is that a fair assumption?Counterfoil
A
2

I am going to give an architects answer rather than drill down into details. I hope you don't mind.

The first suggestion is decouple all of the concepts: events, messages, bus and/or queue and asynch. This opens up possibilities, provided you have not already decided on the software you are using to implement your bus.

From an architecture standpoint, if you require a "must deliver" type of scenario, you will persist the messages when the service fails. Yes, you will likely need some sort of clean up in the system, as stuff happens, but focus on the guaranteed delivery problem first. I see two basic options off hand which can be expanded on (there are likely more, but these are sufficient to start thinking about the problem).

  • The inventory service handles pulling the message from a queue. In this method, the service spins back up and finds any messages.
  • The "bus" guarantees delivery. When there is a failure, it waits until the service is back up (could ping to see if up again or the service can re-register as a subscriber (Enterprise Service Bus type of scenario).

Just because the system is asynch and event based does not mean you cannot implement some type of guaranteed delivery. A queue is an option (you seem to discard this idea?), but a bus that persists on failure and retries once subscribers are up again is another. And you can persist without blocking.

One other issue is what tokens the messages use to get them synched back to the business function at hand, but I assume you have this handled in the system somehow. The only concept you may not have is having systems all respect the token and respect the other systems in returning messages in cases of failures.

Note that asynchronous communication, from the business standpoint, does not mean fire and forget at the point of contact. You can return messages back without using the asynch method on every single bit of information. What I mean here is the inventory system starting up may process a message and send to the application on the UI end and it can return "forget about it, you were too slow" so the transaction is returned to its original state (nonexistent?).

I don't have enough information (or time?) to suggest which method is best for your architecture, as the details are still a bit too high level, but hopefully this stirs some thought.

Hope this makes sense, as I basically did a brain to keyboard maneuver in my ADHD state. ;-)

Assumpsit answered 10/6, 2016 at 18:15 Comment(2)
Thanks for taking the time respond. The system defined in my question is purely fictional and only used to set the context. I intentionally excluded a queue based option to explore how the failure case could be handled by the system itself and not specifically relying on the messaging broker used for the event bus.Counterfoil
The way I look at software, as a consultant, is you have a concept that can either be solved by your own programming or by a product. Whether you build in the fault tolerance or leave it to the bus is a decision that requires more information. But, I hope my answer helped you think about your exercise and come up with something that works.Assumpsit
S
0

First of all, the systems we are building have a purpose, typically to increase revenue and profit by making customers happy and coming back. So messages/events which originated from customer actions must be processed (assuming, the company in question is prioritizing customer experience.....as in willing to invest money into it).

By the way, the customer-enterprise relationship is the one in the whole system that we want to be tightly coupled, unlike all the other ones internally. So in these case, it is an example of "authority" rather than autonomy. We guarantee an SLA, represented by the brand.

But the spectrum of message importance should be more fine-grained than "must deliver" or not, rather reflecting priorities. Similar to capabilities becoming more fine grained (microservices). more about this later

So the goal of ensuring messages/events are getting processed by subscribers can either be achieved by ensuring that services are never down (like the "virtual actor" concept in MS Orleans), or by putting more error handling logic into the delivery mechanism.

The latter option seems to be more centralistic/coupled rather than autonomous/decoupled. But if you assume that services are not always available (as we should) then you need to consider to remove your other assumption about "transient" messages.

The first option leaves the decisions of how to guarantee availability to the service and therefore to the agile team owning the service, while performance is measured via output metrics.

Besides, if services as encapsulated capabilities guarantee a high service level ("never down"), then control of the outcome of the overall system (=enterprise) can be continously adapted by adjusting message priorities as well as injecting new services and events into the system.

The other important aspect is the fact that synchronous architectures (=call stack based) provide three features that async architectures (event driven) don't exhibit for the sake of dependency reduction: coordination, continuation and context (see Hohpe, "Programming without a call stack", 2006).

We still need these features for our customers at the business level, hence they need to be covered elsewhere. Hohpe suggests that the configuration and monitoring of the behaviour of a loosely coupled system requires an additional code layer that is as important as the core business capabilities (Complex Event Processing to understand the relationship between events)

These modern CEP systems which have to deal with massive amounts of data, different velocities, structures and correctness levels could be implemented on top of modern data processing and big data systems (e.g. Spark) to be used for understanding, decision making and optimisation both by agile teams (to improve their service) and management teams at their level.

Sihon answered 17/7, 2016 at 12:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.