How to create replay mechanism within event-drive microservice

Asked 22/12, 2017 at 9:17 Answered 6/1, 2018 at 22:57

Solved java architecture transactions microservices event-driven-design

We have 7 microservices communicated via eventbus. We have a real-time transaction sequence:

Service 1->service2->service3 (and so on.) Until transactions considered as completed

We must make sure all transactions happened.

Ofcourse we can have failures at any point. So we are thinking about mechanisem to replay "half-baked" transactions into completion.

It's getting tricky. Two ways we thought about:

Having another service (supervisor service) that will log each part in our real time sequence and will be smart enough when transactions are not completed (timedout) to understand how we can continune from left point

Disadvantages: lots of "smart" logic on one central service
having retry mechanisem on every service while each one taking care of it's own and replay it's own until success or exhusated

Disadvantages: lots of retry duplicated code on each service

What do you experts think?

Thank

Libretto answered 22/12, 2017 at 9:17 Comment(3)

Related to https://mcmap.net/q/805230/-microservice-architecture-carry-message-through-services-when-order-doesn-39-t-matter – Oneirocritic 22/12, 2017 at 12:12

@ConstantinGalbenu I read your answer which was highly detailed. By that my current architecture is Choreography. could you please answer how would you handle "transaction" failure and replay a transactions while considering following things: 1. how to know from where to start. 2. how to avoid duplications as we have half-baked transaction. we dont want to replay while adding duplications to some services 3. how to concentrate all error outputs at once place which can orchestrate the replay. thank you. – Libretto 22/12, 2017 at 16:13

Two way service communication can perform better job in this scenario.your message needs to have unique id for this. i.e. Each service will drop message to other services in case of failure. In case of failure message, services it-self redo their task based on unique message id. The publisher service can again publish message with required fixes (if possible) or can notify someone about failure reason. If everything goes well, last service could drop completed message for others (Other services will considers it as failure if they doesn't get completed message in minimal time). – Outskirts 5/1, 2018 at 7:40

What you seem to be talking about is how to deal with transactions in a distributed architecture.

This is an extensive topic and entire books could be written about this. Your question seems to be just about retrying the transactions, but I believe that alone is probably not enough to solve the problem of distributed transactional workflow.

I believe you could probably benefit from gaining more understanding of concepts like:

The idea behind compensating transactions is that every ying has its yang: if you have one transaction that can place an order, then you could undo that with a transaction that cancels an order. This latter transaction is a compensating transaction. So, if you carry out a number of successful transactions and then one of them fails, you can trace back your steps and compensate every successful transaction you did and, as a result, revert their side effects.

I particularly liked a chapter in the book REST from Research to Practice. Its chapter 23 (Towards Distributed Atomic Transactions over RESTful Services) goes deep in explaining the Try/Cancel/Confirm pattern.

In general terms it implies that when you do a group of transactions, their side effects are not effective until a transaction coordinator gets a confirmation that they all were successful. For example, if you make a reservation in Expedia and your flight has two legs with different airlines, then one transaction would reserve a flight with American Airlines and another one would reserve a flight with United Airlines. If your second reservation fails, then you want to compensate the first one. But not only that, you want to avoid that the first reservation is effective until you have been able to confirm both. So, initial transaction makes the reservation but keeps its side effects pending to confirm. And the second reservation would do the same. Once the transaction coordinator knows everything is reserved, it can send a confirmation message to all parties such that they confirm their reservations. If reservations are not confirmed within a sensible time window, they are automatically reversed by the affected system.

The book Enterprise Integration Patterns has some basic ideas on how to implement this kind of event coordination (e.g. see process manager pattern and compare with routing slip pattern which are similar ideas to orchestration vs choreography in the Microservices world).

As you can see, being able to compensate transactions might be complicated depending on how complex is your distributed workflow. The process manager may need to keep track of the state of every step and know when the whole thing needs to be undone. This is pretty much that idea of Sagas in the Microservices world.

The book Microservices Patterns has an entire chapter called Managing Transactions with Sagas that delves in detail on how to implement this type of solution.

A few other aspects I also typically consider are the following:

Idempotency

I believe that a key to a successful implementation of your service transactions in a distributed system consists in making them idempotent. Once you can guarantee a given service is idempotent, then you can safely retry it without worrying about causing additional side effects. However, just retrying a failed transaction won't solve your problems.

Transient vs Persistent Errors

When it comes to retrying a service transaction, you shouldn't just retry because it failed. You must first know why it failed and depending on the error it might make sense to retry or not. Some types of errors are transient, for example, if one transaction fails due to a query timeout, that's probably fine to retry and most likely it will succeed the second time; but if you get a database constraint violation error (e.g. because a DBA added a check constraint to a field), then there is no point in retrying that transaction: no matter how many times you try it will fail.

Embrace Error as an Alternative Flow

In those cases of inter-service communication (computer-to-computer interactions) , when a given step of your workflow fails, you don't necessarily need to undo everything you did in previous steps. You can embrace error as part of your workflow. Catalog the possible causes of failure and make them an alternative flow of events that merely requires human intervention. It is just another step in the full orchestration that requires a person to intervene to make a decision, resolve an inconsistency with the data or just approve which way to go.

For example, maybe when you're processing an order, the payment service fails because you don't have enough funds. So, there is no point in undoing everything else. All we need is to put the order in a state that some problem solver can address it in the system and, once fixed, you can continue with the rest of the workflow.

Transaction and Data Model State are Key

I have discovered that this type of transactional workflows require a good design of the different states your model has to go through. As in the case of Try/Cancel/Confirm pattern, this implies initially applying the side effects without necessarily making the data model available to the users.

For example, when you place an order, maybe you add it to the database in a "Pending" status that will not appear in the UI of the warehouse systems. Once payments have been confirmed the order will then appear in the UI such that a user can finally process its shipments.

The difficulty here is discovering how to design transaction granularity in a way that even if one step of your transaction workflow fails, the system remains in a valid state from which you can resume once the cause of the failure is corrected.

Designing for Distributed Transactional Workflows

So, as you can see, designing a distributed system that works in this way is a bit more complicated than individually invoking distributed transactional services. Now every service invocation may fail for a number of reasons and leave your distributed workflow in an inconsistent state. And retrying the transaction may not always solve the problem. And your data needs to be modeled like a state machine, such that side effects are applied but not confirmed until the entire orchestration is successful.

That‘s why the whole thing may need to be designed in a different way than you would typically do in a monolithic client-server application. Your users may now be part of the designed solution when it comes to solving conflicts, and contemplate that transactional orchestrations could potentially take hours or even days to complete depending on how their conflicts are resolved.

As I was initially saying, the topic is way too broad, and it would require a more specific question to discuss, perhaps, just one or two of these aspects in detail.

At any rate, I hope this somehow helped you with your investigation.

Klute answered 6/1, 2018 at 22:57 Comment(1)

this answer is pure microservices experience. Good work! – Bloodthirsty 10/1, 2020 at 11:26

As far as I know (and you may also already know) seems you're trying to implement the Circuit Breaker pattern and whether to implement it as a central service or as part of your business transaction logic.

One parameter to decide if it would be better to have it as a separate service or not is to see if you have only one such transaction or there are more? If there is more than one, then maybe it would be better to pull-out the circuit breaker out of your actual business. It could be a sort of utility component included in different services or an standalone microservice. In case of an standalone service, an option could be to use an off the shelf product/library/framework to do so. I don't know that much about your environment and limitations, but you can even think about using something like Camel or a light BPM engine for this purpose.

In my opinion it would be better anyway to separate this non business logic from your actual transactional business, either as a utility component added as a library or a separate service.

Chlorella answered 22/12, 2017 at 10:32 Comment(2)

1. seperate it into diff service or inside each service as common utility? 2. it's not an easy way to actually create common utility as each service has it's own concerns and to "catch" errors in the right place in order to replay an execution isnt thing you can easily generic on each service – Libretto 22/12, 2017 at 11:58

also I dont think it's Circuit breaker issue. iam looking for a mechanism to recover with half-baked transactions while issues are gone – Libretto 22/12, 2017 at 12:5

Recommended topics

Hot tags