Intro:
I am building a single-node web crawler to simply validate URLs are 200 OK
in a .NET Core console application. I have a collection of URLs at different hosts to which I am sending requests with HttpClient
. I am fairly new to using Polly and TPL Dataflow.
Requirements:
- I want to support sending multiple HTTP requests in parallel with a
configurable
MaxDegreeOfParallelism
. - I want to limit the number of parallel requests to any given host to 1 (or configurable). This is in order to gracefully handle per-host
429 TooManyRequests
responses with a Polly policy. Alternatively, I could maybe use a Circuit Breaker to cancel concurrent requests to the same host on receipt of one429
response and then proceed one-at-a-time to that specific host? - I am perfectly fine with not using TPL Dataflow at all in favor of maybe using a Polly Bulkhead or some other mechanism for throttled parallel requests, but I am not sure what that configuration would look like in order to implement requirement #2.
Current Implementation:
My current implementation works, except that I often see that I'll have x
parallel requests to the same host return 429
at about the same time... Then, they all pause for the retry policy... Then, they all slam the same host again at the same time often still receiving 429
s. Even if I distribute multiple instances of the same host evenly throughout the queue, my URL collection is overweighted with a few specific hosts that still start generating 429
s eventually.
After receiving a 429
, I think I only want to send one concurrent request to that host going forward to respect the remote host and pursue 200
s.
Validator Method:
public async Task<int> GetValidCount(IEnumerable<Uri> urls, CancellationToken cancellationToken)
{
var validator = new TransformBlock<Uri, bool>(
async u => (await _httpClient.GetAsync(u, HttpCompletionOption.ResponseHeadersRead, cancellationToken)).IsSuccessStatusCode,
new ExecutionDataflowBlockOptions {MaxDegreeOfParallelism = MaxDegreeOfParallelism}
);
foreach (var url in urls)
await validator.SendAsync(url, cancellationToken);
validator.Complete();
var validUrlCount = 0;
while (await validator.OutputAvailableAsync(cancellationToken))
{
if(await validator.ReceiveAsync(cancellationToken))
validUrlCount++;
}
await validator.Completion;
return validUrlCount;
}
The Polly policy applied to the HttpClient instance used in GetValidCount()
above.
IAsyncPolicy<HttpResponseMessage> waitAndRetryTooManyRequests = Policy
.HandleResult<HttpResponseMessage>(r => r.StatusCode == HttpStatusCode.TooManyRequests)
.WaitAndRetryAsync(3,
(retryCount, response, context) =>
response.Result?.Headers.RetryAfter.Delta ?? TimeSpan.FromMilliseconds(120),
async (response, timespan, retryCount, context) =>
{
// log stuff
});
Question:
How can I modify or replace this solution to add satisfaction of requirement #2?