.NET's HttpClient Throttling
Asked Answered
O

4

6

I'm developing a .NET4-based application that has to request third-party servers in order to get information from them. I'm using HttpClient to make these HTTP requests.

I have to create a hundred or a thousand requests in a short period of time. I would like to throttle the creation of these request to a limit (defined by a constant or something) so the other servers don't receive a lot of requests.

I've checked this link out that shows how to reduce the amount of tasks created at any time.

Here is my non-working approach:

// create the factory
var factory = new TaskFactory(new LimitedConcurrencyLevelTaskScheduler(level));

// use the factory to create a new task that will create the request to the third-party server
var task = factory.StartNew(() => {
    return new HttpClient().GetAsync(url);
}).Unwrap();

Of course, the problem here is that even that one task at the time is created, a lot of requests will be created and processed at the same time, because they run in another scheduler. I could not find the way to change the scheduler to the HttpClient.

How should I handle this situation? I would like limit the amount of request created to a certain limit but do not block waiting for these request to finish.

Is this possible? Any ideas?

Overdraft answered 29/11, 2012 at 3:41 Comment(2)
How are you calling the code you posted? Do you have a collection of URLs that you're using in a foreach loop, or something like that?Douty
Exactly, I have a collection of URLs and I convert them into a collection of Tasks. Each mapping is performed using the code posted above.Overdraft
D
1

If you can use .Net 4.5, one way would be to use TransformBlock from TPL Dataflow and set its MaxDegreeOfParallelism. Something like:

var block = new TransformBlock<string, byte[]>(
    url => new HttpClient().GetByteArrayAsync(url),
    new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = level });

foreach (var url in urls)
    block.Post(url);

block.Complete();

var result = new List<byte[]>();

while (await block.OutputAvailableAsync())
    result.Add(block.Receive());

There is also another way of looking at this, through ServicePointManager. Using that class, you can set limits on MaxServicePoints (how many servers can you be connected to at once) and DefaultConnectionLimit (how many connections can there be to each server). This way, you could start all your Tasks at the same moment, but only a limited amount of them would actually do something. Although limiting the number of Tasks (e.g. by using TPL Dataflow, as I suggested above) will be most likely more efficient.

Douty answered 29/11, 2012 at 18:38 Comment(5)
Great. This seem to be the solution I was looking for, but I cannot upgrade to .NET 4.5. Is there any way to port this to .NET 4 application?Overdraft
As far as I know, there isn't any way to run TDF on .Net 4.0.Douty
Ok. Thanks anyway but I cannot upgrade to .NET 4.5. Any other idea?Overdraft
Have you read the second part of my answer (about ServicePointManager)? That's not related to TDF (maybe I didn't say that clearly) and it will work on .Net 4.0.Douty
Nop, I totally overlooked it, sorry. I'll give it a try. That sounds interesting.Overdraft
A
0

First, you should consider partitioning the workload according to website, or at least expose an abstraction that lets you choose how to partition the list of urls. e.g., one strategy could be by second-level domain e.g. yahoo.com, google.com.

The other thing is that if you are doing serious crawling, you may want to consider doing it on a cloud instead. That way each node in the cloud can crawl a different partition. When you say "short period of time", you are already setting yourself up for failure. You need hard numbers on what you want to attain.

The other key benefit to partitioning well is you can also avoid hitting servers during their peak hours and risking IP bans at their router level, in the case that the site doesn't simply throttle you.

Anemometer answered 29/11, 2012 at 4:13 Comment(1)
Yes, thanks for the tips but your answer does not address my main point. Thanks anyway.Overdraft
J
0

You might consider launching a fixed set of threads. Each thread does the client net operations serially; maybe also pausing at certain points in order to throttle. This will give you specific control over loading; you can change your throttle policies and change the number of threads.

Jest answered 29/11, 2012 at 4:25 Comment(3)
Yes, I know that spawning my own threadpool is the way to go. But I was looking for a solution that involved using the async .net framework. Is this clear?Overdraft
You can keep a counter of async requests. Increment when adding an async net operation and decrement from the completion handler (or whatever-its-called. I a little rusty on this). To throttle you have to somehow defer new async requests when your counter exceeds n. You might have a single background thread just for this purpose.Jest
Good point. Is there any way to get notified when a http request is completed? I mean, I can easily increment this counter whenever I create a new HttpClient, but... when do I decrement this counter? I haven't found a hook in HttpClient that will be invoked when the request is done.Overdraft
B
0

You might consider creating a new DelegatingHandler to sit in the request/response pipeline of the HTTPClient that could keep count of the the number of pending requests.

Generally a single HTTPClient instance is used to process multiple requests. Unlike HttpWebRequest, disposing a HttpClient instance closes the underlying TCP/IP connection, so if you want to reuse connections you really need to re-use HTTPClient instances.

Bravin answered 4/12, 2012 at 13:33 Comment(2)
Not quite right... (11 years later!) Keep in mind that DelegatingHandlers are pooled and disposed by DefaultHttpClientFactory unless you use SetHandlerLifetime(Timeout.InfiniteTimeSpan)... And it's the http message handlers that own the network connections, not HttpClient (which can be reused all day long).Teahouse
Yes, when they implemented the DefaultHttpClientFactory they chose to pool the pipeline instead of just keeping the HttpClient instance around. However, for quite a while, you could only use the DefaultHttpClientFactory if you were building an ASP.NET Web App. Many of us were/are not. IMO it was a bad decision to create HttpClientFactory, because HttpClient was intended to be reused and now DefaultHttpClientFactory creates a new instance for every call, making properties like DefaultRequestHeaders pretty much useless.Bravin

© 2022 - 2024 — McMap. All rights reserved.