Unexpected behaviour using nested Retry, and Circuit Breaker policies of Polly.Net
Asked Answered
I

2

2

I coded a resilience strategy based on retry, and a circuit-breaker policies. Now working, but with a issue in its behavior.

I noticed when the circuit-breaker is on half-open, and the onBreak() event is being executed again to close the circuit, one additional retry is triggered for the retry policy (this one aside of the health verification for the half-open status).

Let me explain step by step:

I've defined two strongly-typed policies for retry, and circuit-breaker:

static Policy<HttpResponseMessage> customRetryPolicy;
static Policy<HttpResponseMessage> customCircuitBreakerPolicy;

static HttpStatusCode[] httpStatusesToProcess = new HttpStatusCode[]
{
   HttpStatusCode.ServiceUnavailable,  //503
   HttpStatusCode.InternalServerError, //500
};

Retry policy is working this way: two (2) retry per request, waiting five (5) second between each retry. If the internal circuit-breaker is open, must not retry. Retry only for 500, and 503 Http statuses.

customRetryPolicy = Policy<HttpResponseMessage>   

//Not execute a retry if the circuit is open
.Handle<BrokenCircuitException>( x => 
{
    return !(x is BrokenCircuitException);
})

//Stop if some inner exception match with BrokenCircuitException
.OrInner<AggregateException>(x =>
{
    return !(x.InnerException is BrokenCircuitException);
})

//Retry if status are:
.OrResult(x => { return httpStatusesToProcess.Contains(x.StatusCode); })

// Retry request two times, wait 5 seconds between each retry
.WaitAndRetry( 2, retryAttempt => TimeSpan.FromSeconds(5),
    (exception, timeSpan, retryCount, context) =>
    {
        System.Console.WriteLine("Retrying... " + retryCount);
    }
);

Circuit-breaker policy is working in this way: Allow max three (3) failures in a row, next open the circuit for thirty (30) seconds. Open circuit ONLY for HTTP-500.

customCircuitBreakerPolicy = Policy<HttpResponseMessage>

// handling result or exception to execute onBreak delegate
.Handle<AggregateException>(x => 
    { return x.InnerException is HttpRequestException; })

// just break when server error will be InternalServerError
.OrResult(x => { return (int) x.StatusCode == 500; })

// Broken when fail 3 times in a row,
// and hold circuit open for 30 seconds
.CircuitBreaker(3, TimeSpan.FromSeconds(30),
    onBreak: (lastResponse, breakDelay) =>{
        System.Console.WriteLine("\n Circuit broken!");
    },
    onReset: () => {
        System.Console.WriteLine("\n Circuit Reset!");
    },
    onHalfOpen: () => {
        System.Console.WriteLine("\n Circuit is Half-Open");
    }); 

Finally, those two policies are being nested this way:

try
{
    customRetryPolicy.Execute(() =>
    customCircuitBreakerPolicy.Execute(() => {
       
       //for testing purposes "api/values", is returning 500 all time
        HttpResponseMessage msResponse
            = GetHttpResponseAsync("api/values").Result;
        
        // This just print messages on console, no pay attention
        PrintHttpResponseAsync(msResponse); 
        
        return msResponse;

   }));
}
catch (BrokenCircuitException e)
{
    System.Console.WriteLine("CB Error: " + e.Message);
}

What is the result that I expected?

  1. The first server responses is HTTP-500 (as expected)
  2. Retry#1, Failed (as expected)
  3. Retry#2, Failed (as expected)
  4. As we have three faults, circuit-breaker is now open (as expected)
  5. GREAT! This is working, perfectly!
  6. Circuit-breaker is open for the next thirty (30) seconds (as expected)
  7. Thirty seconds later, circuit-breaker is half-open (as expected)
  8. One attempt to check the endpoint health (as expected)
  9. Server response is a HTTP-500 (as expected)
  10. Circuit-breaker is open for the next thirty (30) seconds (as expected)
  11. HERE THE PROBLEM: an additional retry is launched when the circuit-breaker is already open!

Look at the images:

enter image description here

enter image description here

enter image description here

I'm trying to understand this behavior. Why one additional retry is being executed when the circuit-breaker is open for second, third,..., N time?

I've reviewed the machine state model for the retry, and circuit-breaker policies, but I don't understand why this additional retry is being performed.

Flow for the circuit-breaker: https://github.com/App-vNext/Polly/wiki/Circuit-Breaker#putting-it-all-together-

Flow for the retry policy: https://github.com/App-vNext/Polly/wiki/Retry#how-polly-retry-works

This really matter, because the time for the retry is being awaited (5 seconds for this example), and at the end, this is a waste of time for high-concurrency.

Any help / direction, will be appreciated. Many thanks.

Impart answered 12/10, 2021 at 3:50 Comment(8)
Note that the [asp.net] tag refers to ASP.NET (.NET Framework) not ASP.NET Core.Foggia
If my understanding is correct then your assumption is that the retry will be blocked until the CB is open. Unfortunately the retry does not work in this way. Retry will issue a new attempt if it is triggered and waited the delay penalty. It is unaware of the fact the CB is Open, Closed or Half-Open. So, you have to either increase the retry count to exceed the CB penalty or define the retry's delay with greater timespans.Migrate
I understood this: if the CB is HALF-OPEN then a health call will be performed to check if the resource behind is working already. If this one is working already, then CLOSE the CB and return the result of the request. If is not working yet, then remain the circuit OPEN, put the state of the CB in OPEN again.Impart
Retry is outer of CB, for this reason one retry should be launch BUT the CB is already OPEN, then retry must not retry because the CB is OPEN again. I'm planning to handle this additional retry with the CB status, but... this will be hard to handle when you have a CB per downstream. Am I understanding this bad @PeterCsala?Impart
@dfbarr Yes from Half-Open, CB can transit to Open or Closed depending on the end result of the probe. But when the CB is Open (durationOfBreak) all incoming requests are rejected immediately. You can define a sleepDurationProvider which receives a Context that can be shared between retry and cb. So, it is doable, but a bit tricky and hacky. Do you want me to provide an example?Migrate
Yes @PeterCsala. If you can do it, I will appreciate it. About this behavior on my resilience strategy It's a waste of resource because every time that CB is checking its health, aside of the natural request to check its health the retry pattern is being activated since the request to check health fail with a HTTP-500. Maybe is because the Retry and CB are handling the same HTTP-500: when CB get a HTTP-500 then this will pass again to OPEN, but this HTTP-500 is again processed for the Retry (because the retry is prepared to handle this behavior)Impart
Imagine this behavior on production environment, with a long WaitTimeToRetry, every time that CB verify to CLOSE / OPEN later of the first break. One additional retry with that long time will be executed when the CB already knows that resources behind are failing. I think that trying a different approach of handling a strong-typed ´Policy<TResult> could have a different result, but my business need are handling HTTP statuses. I will continue trying meanwhile.Impart
@DiegoFerb I've just posted a solution proposal where the retry "respects" that CB is in Open state and will not trigger until it transits to Half-Open. I hope it helps you a bit.Migrate
M
7

With Polly.Context you can exchange information between the two policies (in your case: Retry and Circuit Breaker). The Context is basically a Dictionary<string, object>.

So, the trick is to set a key on the onBreak then use that value inside the sleepDurationProdiver.

Let's start with inner Circuit Breaker policy:

static IAsyncPolicy<HttpResponseMessage> GetCircuitBreakerPolicy()
{
    return Policy<HttpResponseMessage>
        .HandleResult(res => res.StatusCode == HttpStatusCode.InternalServerError)
        .CircuitBreakerAsync(3, TimeSpan.FromSeconds(2),
           onBreak: (dr, ts, ctx) => { ctx[SleepDurationKey] = ts; },
           onReset: (ctx) => { ctx[SleepDurationKey] = null; });
}
  • It breaks after 3 subsequent failed requests
  • It stays in Open state for 2 seconds before it transits to HalfOpen
  • It sets a key on the context with the durationOfBreak value
  • When the CB goes back to "normal" Closed state (onReset) it deletes this value

Now, let's continue with the Retry policy:

static IAsyncPolicy<HttpResponseMessage> GetRetryPolicy()
{
    return Policy<HttpResponseMessage>
    .HandleResult(res => res.StatusCode == HttpStatusCode.InternalServerError)
    .Or<BrokenCircuitException>()
    .WaitAndRetryAsync(4,
        sleepDurationProvider: (c, ctx) =>
        {
            if (ctx.ContainsKey(SleepDurationKey))
                return (TimeSpan)ctx[SleepDurationKey];
            return TimeSpan.FromMilliseconds(200);
        },
        onRetry: (dr, ts, ctx) =>
        {
            Console.WriteLine($"Context: {(ctx.ContainsKey(SleepDurationKey) ? "Open" : "Closed")}");
            Console.WriteLine($"Waits: {ts.TotalMilliseconds}");
        });
}
  • It triggers either when the StatusCode is 500
    • or when there was an BrokenCircuitException
  • It triggers at most 4 times (so, all in all 5 attempts)
  • It sets the sleep duration based on the context
    • If the key is not present in the context (CB is in Open state) then it returns with 200 milliseconds
    • If the key is present in the context (CB is not in Open state) then it returns with value from the context
      • NOTE: You can add a few hundred milliseconds to this value to avoid race condition
  • It prints some values to the console inside onRetry only for debugging purposes

Finally let's wire up the policies and test it

const string SleepDurationKey = "Broken"; 
static HttpClient client = new HttpClient();
static async Task Main()
{
    var strategy = Policy.WrapAsync(GetRetryPolicy(), GetCircuitBreakerPolicy());
    await strategy.ExecuteAsync(async () => await Get());
}

static Task<HttpResponseMessage> Get()
{
    return client.GetAsync("https://httpstat.us/500");
}
  • It uses the http://httpstat.us website to emulate an overloaded downstream
  • It combines/chains the two policies (CB inner, Retry outer)
  • It calls the Get method in an asynchronous way

when handledEventsAllowedBeforeBreaking is 2

The output

Context: Closed
Waits: 200
Context: Open
Waits: 2000
Context: Open
Waits: 2000
Context: Open
Waits: 2000

when handledEventsAllowedBeforeBreaking is 3

The output

Context: Closed
Waits: 200
Context: Closed
Waits: 200
Context: Open
Waits: 2000
Context: Open
Waits: 2000

when handledEventsAllowedBeforeBreaking is 4

The output

Context: Closed
Waits: 200
Context: Closed
Waits: 200
Context: Closed
Waits: 200
Context: Open
Waits: 2000
Migrate answered 13/10, 2021 at 7:39 Comment(1)
I will try this solution, meanwhile I will share something that I did to solve the issueImpart
I
2

I tried this same scenario with this sample provided by polly Async Demo06_WaitAndRetryNesting CircuitBreaker.cs. I found it in Polly-Samples/Polly TestClient/Samples/.

Look sample here: Official Samples provided by Polly

The execution confirmed me that this behavior does not happen only on the sample provided by me.

enter image description here

Later of this confirmation, I added to the Retry Policies evaluation, one additional condition to Retry depending of the Circuit Breaker state. This works!

Pay attention to new condition on .OrResult() delegate

 customRetryPolicy = Policy<HttpResponseMessage>   

 .Handle<BrokenCircuitException>( 
    x => { return !(x is BrokenCircuitException); })

 .OrInner<AggregateException>(
    x => { return !(x.InnerException is BrokenCircuitException); })

//Retry if HTTP-Status, and Circuit Breake status are:
.OrResult(x => { 
    return httpStatusesToProcess.Contains(x.StatusCode) 

    //This condition evaluate the current state for the 
    //circuit-breaker before each retry
        && ((CircuitBreakerPolicy<HttpResponseMessage>) 
        customCircuitBreakerPolicy).CircuitState == CircuitState.Closed
    ;
})
.WaitAndRetry( 2, retryAttempt => TimeSpan.FromSeconds(1),
    (exception, timeSpan, retryCount, context) =>
    {
        System.Console.WriteLine("Retrying... " + retryCount);
    }
);

This is the result:

enter image description here

My assumption it was that Retry Policy (the most outer policy) was able to control if retry or not, checking the Circuit Breaker status. And this happen, BUT for some reason, when the Circuit-Breaker is half-open, and the health request is performed with a negative response, during the transition from half-open to closed, the penalty of one try is executed (just like @peter-csala said before).

I forced to evaluate the circuit-breaker state when this happen. But I considerer that Polly should be perform this for itself.

Impart answered 13/10, 2021 at 16:17 Comment(2)
Yes, this is also viable option, but I think it's a bit more complicated. You are explicitly referring to the customCircuitBreakerPolicy inside your retry policy definition. This coupling makes the solution a bit rigid (maybe this is not the best word, but anyway). I'm not saying that using a Context to exchange information is much better but the propagation of the context is done by the built-in escalation chain.Migrate
But I considerer that Polly should be perform this for itself. Yes and no. Let me start with the no: Polly provides a toolbox to build a resilience strategy to overcome on transient failures. One of the strength of Polly is the flexibility that you can combine multiple policies in so many different ways. On the other hand there are common patterns which are implemented over and over again by many Polly consumers. Polly.Contrib is a dedicated place to share such solutions with the community.Migrate

© 2022 - 2024 — McMap. All rights reserved.