How to handle microservice Interaction when one of the microservice is down
Asked Answered
S

3

5

I am new to microservice architecture. Currently I am using spring boot for my microservices, in case one of the microservice is down how should fail over mechanism work ?

For Ex. if we have 3 microservices M1,M2,M3 . M1 is interacting with M2 and M2 is interacting with M3 . In case M2 microservice cluster is down how should we handle this situation?

Sudhir answered 28/5, 2018 at 8:52 Comment(1)
There is no one answer for this. One of the options is to use Hystrix or some other fault tolerant mechanisms and fallback to some predefined setup/valuesKempe
N
6

When any one of the microservice is down, Interaction between services becomes very critical as isolation of failure, resilience and fault tolerance are some of key characteristics for any microservice based architecture.

Totally agreed what @jayant had answered, in your case Implementing proper fallback mechanism makes more sense and you can implement required logic you wanna write based on use case and dependencies between M1, M2 and M3. you can also raise events in your fallback if needed.

Since you are new to microservice, you need to know below common techniques and architecture patterns for resilience and fault tolerance against the situation which you have raised in your question. And here you are using Spring-Boot, you can easily add Netflix-OSS in your microservices.

Netflix has released Hystrix, a library designed to control points of access to remote systems, services and 3rd party libraries, providing greater tolerance of latency and failure.

It include below important characteristics:

  • Importance of Circuit breaker and Fallback Mechanism:

Hystrix implements the circuit breaker pattern which is useful when a service failure can cause cascading failure all the way up to the user. When calls to a particular service exceed circuitBreaker.requestVolumeThreshold (default: 20 requests) and the failure percentage is greater than circuitBreaker.errorThresholdPercentage (default: >50%) in a rolling window defined by metrics.rollingStats.timeInMilliseconds (default: 10 seconds), the circuit opens and further calls are not made.

In cases of error and an open circuit, a fallback can be provided by the developer. Fallbacks may be chained so that the first fallback makes some other business call. check out Fallback Implementation of Hystrix

  • Retry:

When a request fails, you may want to have the request be retried automatically. Ribbon does this job for us. In distributed system, a microservices system retry can trigger multiple other requests or retries and start a cascading effect

here are some properties to look of Ribbon

sample-client.ribbon.MaxAutoRetries=1

Max number of next servers to retry (excluding the first server)

sample-client.ribbon.MaxAutoRetriesNextServer=1

Whether all operations can be retried for this client

sample-client.ribbon.OkToRetryOnAllOperations=true

Interval to refresh the server list from the source

sample-client.ribbon.ServerListRefreshInterval=2000

More details :- ribbon properties

  • Bulkhead Pattern:

In general, the goal of the bulkhead pattern is to avoid faults in one part of a system to take the entire system down. bulkhead pattern

The bulkhead implementation in Hystrix limits the number of concurrent calls to a component. This way, the number of resources (typically threads) that is waiting for a reply from the component is limited.

Assume you have a request based, multi threaded application (for example a typical web application) that uses three different components, M1, M2, and M3. If requests to component M3 starts to hang, eventually all request handling threads will hang on waiting for an answer from M3.

This would make the application entirely non-responsive. If requests to M3 is handled slowly we have a similar problem if the load is high enough. Implementation details can be found here

So, These are some factors you need to consider while handling microservice Interaction when one of the microservice is down.

Noetic answered 28/5, 2018 at 18:57 Comment(1)
,good points raised regarding fallback chaining and ribbon retries, does adding a broker in between two services also counts as a strategy as services wont be directly coupled together for communication, but that brings its own complexities as in when the broker itself goes down.Cantone
C
1

As mentioned in the comment, there are many ways you can go about it,

case 1: all are independent services, trivial case, no need to do anything, call all the services in blocking or non-blocking way, calling service 2 will in both case result in timeout

case 2: services are dependent M2 depends on M1 and M3 depends on M2

option a) M1 can wait for service M2 to come back up, doing periodic pings or fetching details from registry or naming server if M2 is up or not

option b) use hystrix as a circuit breaker implementation and handle fallback gracefully in M3 or your orchestrator(guy who is calling these services i.e M1,M2,M3 in order)

Cantone answered 28/5, 2018 at 9:32 Comment(0)
C
0
3. Service Discovery and Load Balancing

Ensure M1 and M2 use a service discovery mechanism. If M2 has multiple instances, the service discovery can help route requests to available instances if some are down.

Chariness answered 27/9, 2024 at 8:49 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.