CQRS/Eventual Consistency - Handling Read Side Update Failure

Asked 12/7, 2018 at 15:44 Answered 4/2, 2022 at 17:24

architecture domain-driven-design cqrs event-sourcing eventual-consistency

I am interested in how others have handled a Read Side DB update failure in CQRS/Event Sourcing eventually consistent systems.

I have such a system that could append an event to my event store, and then for some reason fail to update a corresponding read side DB, leading to a state of inconsistency.

I have read this post and also this one which really focus on having a singleton/global aggregate that manages constraints before event storing events.

But how do you proceed when update failures are not related to constraints (temporary hardware failure for example)?

Another mentioned solution was manual intervention, but I suppose I am trying to avoid this. High level I am thinking of doing something like triggering some sort of job to rebuild my entire read side DB from the event store, while temporarily suspending and queuing the commands and event handlers that normally update the read side.

Does anyone else do something similar to this? Is there a better way?

Thanks!

Knuckleduster answered 12/7, 2018 at 15:44 Comment(0)

But how do you proceed when update failures are not related to constraints (temporary hardware failure for example)?

Part of the point of teasing apart the read and write models is that we can update the read model asynchronously.

So if you want to keep the cached representations of your read models hot, you can schedule refreshes of the read model to run at some regular interval. A transient failure will be mitigated by the next scheduled update. No big deal.

Another alternative is to treat the read model like a cache; before returning a result to the client, you first check that the cached representation is still valid, and then decide which alternative you want to take: report to the client that the query is not currently available, report to the client the latest available information, with annotations explaining that the data is stale, try to refresh the read model on demand (taking the latency hit, and possibly requiring a fallback to one of the other approaches if that fails).

It's often useful to think about making timeliness explicit in your queries -- "I need an answer less than 10 minutes old". Then everybody is on the same page about whether or not the available representation is good enough.

how would I avoid not overwriting any read side updates that may come through while I am running the refresh?

Think Conditional PUT, where the condition predicate that we use is that the source data of the current representation is newer than the copy that is currently stored.

If your source data is newer than the source data of the previously stored representation, then you can replace it. If your source data was older, then you throw away the work. So you store with the representation meta data that allows you to compare one source to another.

Meal answered 12/7, 2018 at 18:4 Comment(2)

Thanks for the comment, on this Q and some of my others. I suppose running scheduled refreshes could work, but how would I avoid not overwriting any read side updates that may come through while I am running the refresh? – Knuckleduster 12/7, 2018 at 18:37

Read-side repos in CQRS/ES are populated by projecting events. Would a refresh resend events (essentially a replay)? That might confuse other event handlers with duplicates and would require idempotency. And it would be impractical and non-performant to replay all the past events periodically, although a periodic snapshot event might help with that. – Maximinamaximize 4/2, 2022 at 16:22

Read-side projection failures will happen in CQRS/ES (because we're human after all), so you need a strategy.

You can try to ensure that the events being projected are not removed from the queue until the processor confirms removal with the broker after the projection has been completed. The issue here is that if there's a non-transient failure, that message will poison the processor until it's removed. Thus...
You'll need to remove any queued message that causes a non-transient failure, e.g., due to a processor bug, to avoid completely stalling projections. Using retry logic is probably a good way to go here. Of course this will mean that your read-side projection is no longer in sync with the domain state, probably for just a single identity though. In my experience, something like this eventually happens over the long haul, so you need a strategy to mitigate it. Starting with...
Make sure the failure is observable so that you can take action, like rolling back a buggy release, manually correcting data, and/or identifying a bug fix. This leads to...
Make your read-side repos self-healing. Try to translate your manual recovery into an automated process that runs periodically or is triggered via error handling. In my experience this is unfortunately hard to anticipate and ends up being driven by manually recovering from unanticipated failures. We successfully implemented several data integrity auto-fixes, and it was an effective solution.

Maximinamaximize answered 4/2, 2022 at 17:24 Comment(0)

Recommended topics

Hot tags