HttpClient with multiple proxies while handling socket exhaustion and DNS recycling
Asked Answered
K

3

5

We are working on a fun project with a friend and we have to execute hundreds of HTTP requests, all using different proxies. Imagine that it is something like the following:

for (int i = 0; i < 20; i++)
{
    HttpClientHandler handler = new HttpClientHandler { Proxy = new WebProxy(randomProxy, true) };

    using (var client = new HttpClient(handler))
    {
        using (var request = new HttpRequestMessage(HttpMethod.Get, "http://x.com"))
        {
            var response = await client.SendAsync(request);

            if (response.IsSuccessStatusCode)
            {
                string content = await response.Content.ReadAsStringAsync();
            }
        }

        using (var request2 = new HttpRequestMessage(HttpMethod.Get, "http://x.com/news"))
        {
            var response = await client.SendAsync(request2);

            if (response.IsSuccessStatusCode)
            {
                string content = await response.Content.ReadAsStringAsync();
            }
        }
    }
}

By the way, we are using .NET Core (Console Application for now). I know there are many threads about socket exhaustion and handling DNS recycling, but this particular one is different, because of the multiple proxy usage.

If we use a singleton instance of HttpClient, just like everyone suggests:

  • We can't set more than one proxy, because it is being set during HttpClient's instantiation and cannot be changed afterwards.
  • It doesn't respect DNS changes. Re-using an instance of HttpClient means that it holds on to the socket until it is closed so if you have a DNS record update occurring on the server the client will never know until that socket is closed. One workaround is to set the keep-alive header to false, so the socket will be closed after each request. It leads to a sub-optimal performance. The second way is by using ServicePoint:
ServicePointManager.FindServicePoint("http://x.com")  
    .ConnectionLeaseTimeout = Convert.ToInt32(TimeSpan.FromSeconds(15).TotalMilliseconds);

ServicePointManager.DnsRefreshTimeout = Convert.ToInt32(TimeSpan.FromSeconds(5).TotalMilliseconds);

On the other hand, disposing HttpClient (just like in my example above), in other words multiple instances of HttpClient, is leading to multiple sockets in TIME_WAIT state. TIME_WAIT indicates that local endpoint (this side) has closed the connection.

I'm aware of SocketsHttpHandler and IHttpClientFactory, but they can't solve the different proxies.

var socketsHandler = new SocketsHttpHandler
{
    PooledConnectionLifetime = TimeSpan.FromMinutes(10),
    PooledConnectionIdleTimeout = TimeSpan.FromMinutes(5),
    MaxConnectionsPerServer = 10
};

// Cannot set a different proxy for each request
var client = new HttpClient(socketsHandler);

What is the most sensible decision that can be made?

Karee answered 1/8, 2020 at 9:24 Comment(13)
HttpClientFactory fixes all the dns and socket exhaustion issuesLashelllasher
'but they can't solve the different proxies.' - what do you mean by thisLashelllasher
Create single separate httpclient instance per proxy and reuse it?Euphemize
Ahh ok, I think I see the issue. How many proxies are we talking about here?Lashelllasher
Note: ServicePointManager doesn't affect HttpClient in .NET Core because it intended for use with HttpWebRequest which is not used by HttpClint in .NET Core. And yes, HttpClient instance per proxy looks like reasonable solution. IHttpClientFactory will fix socket and dns problem at the same time.Cock
@TheGeneral, let's say around 100.Karee
ahh ok 100 proxies, I thought we were talkin about a lot more. Just keep the clients around, however you will run into the DNS issue (but if this is not a persistent enterprise app, do you really care?) , There are a few things you can to with the httpclientfactory, but for your experiment i wouldn't botherLashelllasher
@aepot, I think you were right about ServicePoint. It looks like HttpClient overrides the settings from ServicePoint. However, the solution I provided above is horrible performance-wise by looking at netstat -ano | findstr ipKaree
@TheGeneral, that was just an example, they can be more at some point. It's not a persistent app, but there has to be a proper solution. What are you referring to with IHttpClientFactory?Karee
One more thing: HttpResponseMessage is IDisposable. Apply using statement for it. It will affect sockets utilization behavior.Cock
Or if you really want to go medieval on this, just build it out of vanilla sockets, yehaaLashelllasher
One more question: how often you receive not success status code? If often, than you may save some sockets with not reading the whole response using HttpResponseMessage response = await client.SendAsync(request, HttpCompletionOption.ResponseHeadersRead); (C# 8.0 syntax)Cock
@aepot, oh that's true. I forgot to add the using on HttpResponseMessage. I can often receive an unsuccessful code. I will add HttpCompletionOption.ResponseHeadersRead.Karee
K
2

First of all, I want to mention that @Stephen Cleary's example works fine if the proxies are known at compile-time, but in my case they are known at runtime. I forgot to mention that in the question, so it's my fault.

Thanks to @aepot for pointing out those stuff.

That's the solution I came up with (credits @mcont):

/// <summary>
/// A wrapper class for <see cref="FlurlClient"/>, which solves socket exhaustion and DNS recycling.
/// </summary>
public class FlurlClientManager
{
    /// <summary>
    /// Static collection, which stores the clients that are going to be reused.
    /// </summary>
    private static readonly ConcurrentDictionary<string, IFlurlClient> _clients = new ConcurrentDictionary<string, IFlurlClient>();

    /// <summary>
    /// Gets the available clients.
    /// </summary>
    /// <returns></returns>
    public ConcurrentDictionary<string, IFlurlClient> GetClients()
        => _clients;

    /// <summary>
    /// Creates a new client or gets an existing one.
    /// </summary>
    /// <param name="clientName">The client name.</param>
    /// <param name="proxy">The proxy URL.</param>
    /// <returns>The <see cref="FlurlClient"/>.</returns>
    public IFlurlClient CreateOrGetClient(string clientName, string proxy = null)
    {
        return _clients.AddOrUpdate(clientName, CreateClient(proxy), (_, client) =>
        {
            return client.IsDisposed ? CreateClient(proxy) : client;
        });
    }

    /// <summary>
    /// Disposes a client. This leaves a socket in TIME_WAIT state for 240 seconds but it's necessary in case a client has to be removed from the list.
    /// </summary>
    /// <param name="clientName">The client name.</param>
    /// <returns>Returns true if the operation is successful.</returns>
    public bool DeleteClient(string clientName)
    {
        var client = _clients[clientName];
        client.Dispose();
        return _clients.TryRemove(clientName, out _);
    }

    private IFlurlClient CreateClient(string proxy = null)
    {
        var handler = new SocketsHttpHandler()
        {
            Proxy = proxy != null ? new WebProxy(proxy, true) : null,
            PooledConnectionLifetime = TimeSpan.FromMinutes(10)
        };

        var client = new HttpClient(handler);

        return new FlurlClient(client);
    }
}

A proxy per request means an additional socket for each request (another HttpClient instance).

In the solution above, ConcurrentDictionary is used to store the HttpClients, so I can reuse them, which is the exact point of HttpClient. I could use same proxy for 5 requests, before it gets blocked by API limitations. I forgot to mention that in the question as well.

As you've seen, there are two solutions solving socket exhaustion and DNS recycling: IHttpClientFactory and SocketsHttpHandler. The first one doesn't suit my case, because the proxies I'm using are known at runtime, not at compile-time. The solution above uses the second way.

For those who have same issue, you can read the following issue on GitHub. It explains everything.

I'm open-minded for improvements, so poke me.

Karee answered 3/8, 2020 at 10:14 Comment(0)
R
6

The point of reusing HttpClient instances (or more specifically, reusing the last HttpMessageHandler) is to reuse the socket connections. Different proxies mean different socket connections, so it doesn't make sense to try to reuse an HttpClient/HttpMessageHandler on a different proxy, because it would have to be a different connection.

we have to execute hundreds of HTTP requests, all using different proxies

If every request is truly a unique proxy, and no proxies are shared across any other requests, then you may as well just keep the individual HttpClient instances and live with the TIME_WAIT.

However, if multiple requests may go through the same proxy, and you want to re-use those connections, then that is certainly possible.

I would recommend using IHttpClientFactory. It allows you to define named HttpClient instances (again, technically the last HttpMessageHandler instances) that can be pooled and reused. Just make one for each proxy:

var proxies = new Dictionary<string, IWebProxy>(); // TODO: populate with proxies.
foreach (var proxy in proxies)
{
  services.AddHttpClient(proxy.Key)
      .ConfigurePrimaryHttpMessageHandler(() => new HttpClientHandler { Proxy = proxy.Value });
}

The ConfigurePrimaryHttpMessageHandler controls how the IHttpClientFactory creates the primary HttpMessageHandler instances that are pooled. I copied HttpClientHandler from the code in your question, but most modern apps use SocketsHttpHandler, which also has Proxy/UseProxy properties.

Then, when you want to use one, call IHttpClientFactory.CreateClient and pass the name of the HttpClient you want:

for (int i = 0; i < 20; i++)
{
  var client = _httpClientFactory.CreateClient(randomProxyName);
  ...
}
Rehnberg answered 1/8, 2020 at 14:31 Comment(11)
Is it a good idea when the proxies will be given at runtime? Something like that: pastebin.com/k1uBW52SKaree
@Electron: While that would work, it's unusual; DI is intended for one-time setup. That code doesn't allow sharing connections across multiple button clicks. But for a fully dynamic situation like yours, that's one solution. Another would be to create your own IHttpClientFactory implementation (starting with a copy/paste from DefaultHttpClientFactory) that can add new named clients at runtime.Rehnberg
Thanks for your answer, anyway! I will see what I can think of. HttpClient's design isn't made for this and IHttpClientFactory isn't made for runtime configurations. github.com/dotnet/runtime/issues/35992.Karee
may you look at the solution I came up with? ThanksKaree
@Electron: LGTMRehnberg
@StephenCleary What if we want to make concurrent requests with "proxied" named clients? I have something similar to OP with adding concurrency. I can make only 10 concurrent requests via "proxied" named client/s (I have tried it with one IP and with many different IPs from my proxies provider) but with no proxy envolved i can make as many requests as i want. Is there any limitation?Urumchi
@ggeorge: Look into ServicePointManager.DefaultConnectionLimit.Rehnberg
@StephenCleary Thanks for the reply! I didn't know that i could use ServicePointManager with System.Net.Http.HttpClient. I wll try it.Urumchi
@ggeorge: HttpClient uses that old stack on .NET Framework. On (modern versions of) .NET Core, it uses a fully managed stack and doesn't have that throttling anymore.Rehnberg
@StephenCleary I use .NET 5. Do you have any idea why i can't make above 10 requests concurrently using proxy? is there a case for the proxy provider to be responsible for that? (maybe i should open a new question!!!)Urumchi
@ggeorge: Yes, the proxy itself can certainly be throttling you. You can verify that with Fiddler. And opening a new question is probably best.Rehnberg
K
2

First of all, I want to mention that @Stephen Cleary's example works fine if the proxies are known at compile-time, but in my case they are known at runtime. I forgot to mention that in the question, so it's my fault.

Thanks to @aepot for pointing out those stuff.

That's the solution I came up with (credits @mcont):

/// <summary>
/// A wrapper class for <see cref="FlurlClient"/>, which solves socket exhaustion and DNS recycling.
/// </summary>
public class FlurlClientManager
{
    /// <summary>
    /// Static collection, which stores the clients that are going to be reused.
    /// </summary>
    private static readonly ConcurrentDictionary<string, IFlurlClient> _clients = new ConcurrentDictionary<string, IFlurlClient>();

    /// <summary>
    /// Gets the available clients.
    /// </summary>
    /// <returns></returns>
    public ConcurrentDictionary<string, IFlurlClient> GetClients()
        => _clients;

    /// <summary>
    /// Creates a new client or gets an existing one.
    /// </summary>
    /// <param name="clientName">The client name.</param>
    /// <param name="proxy">The proxy URL.</param>
    /// <returns>The <see cref="FlurlClient"/>.</returns>
    public IFlurlClient CreateOrGetClient(string clientName, string proxy = null)
    {
        return _clients.AddOrUpdate(clientName, CreateClient(proxy), (_, client) =>
        {
            return client.IsDisposed ? CreateClient(proxy) : client;
        });
    }

    /// <summary>
    /// Disposes a client. This leaves a socket in TIME_WAIT state for 240 seconds but it's necessary in case a client has to be removed from the list.
    /// </summary>
    /// <param name="clientName">The client name.</param>
    /// <returns>Returns true if the operation is successful.</returns>
    public bool DeleteClient(string clientName)
    {
        var client = _clients[clientName];
        client.Dispose();
        return _clients.TryRemove(clientName, out _);
    }

    private IFlurlClient CreateClient(string proxy = null)
    {
        var handler = new SocketsHttpHandler()
        {
            Proxy = proxy != null ? new WebProxy(proxy, true) : null,
            PooledConnectionLifetime = TimeSpan.FromMinutes(10)
        };

        var client = new HttpClient(handler);

        return new FlurlClient(client);
    }
}

A proxy per request means an additional socket for each request (another HttpClient instance).

In the solution above, ConcurrentDictionary is used to store the HttpClients, so I can reuse them, which is the exact point of HttpClient. I could use same proxy for 5 requests, before it gets blocked by API limitations. I forgot to mention that in the question as well.

As you've seen, there are two solutions solving socket exhaustion and DNS recycling: IHttpClientFactory and SocketsHttpHandler. The first one doesn't suit my case, because the proxies I'm using are known at runtime, not at compile-time. The solution above uses the second way.

For those who have same issue, you can read the following issue on GitHub. It explains everything.

I'm open-minded for improvements, so poke me.

Karee answered 3/8, 2020 at 10:14 Comment(0)
C
1

Collected my comments into the answer. But these are improvement suggestions, not a solution because your question is strongly context-dependent: how many proxies, how many requests per minute, what is average time of each request, etc.

Disclamer: I'm not familiar with IHttpClientFactory but afaik, it's the only way to solve the Socket exhaustion and DNS problem.

Note: ServicePointManager doesn't affect HttpClient in .NET Core because it intended for use with HttpWebRequest which is not used by HttpClient in .NET Core.

As suggested by @GuruStron, HttpClient instance per proxy looks like reasonable solution.

HttpResponseMessage is IDisposable. Apply using statement for it. It will affect sockets utilization behavior.

You may apply HttpCompletionOption.ResponseHeadersRead to SendAsync for not reading the whole response on sending the request. Then you may not read the response if server returned not successful Status Code.

To improve the internal performance you may also append .ConfigureAwait(false) at SendAsync() and ReadAsStringAsync() lines. It's mostly useful if current SynchronizationContext is not null (e.g. it's not a Console app).

Here's somewhat optimized code (C# 8.0):

private static async Task<string> GetHttpResponseAsync(HttpClient client, string url)
{
    using HttpResponseMessage response = await client.GetAsync(url, HttpCompletionOption.ResponseHeadersRead).ConfigureAwait(false);
    if (response.IsSuccessStatusCode)
    {
        return await response.Content.ReadAsStringAsync().ConfigureAwait(false);
    }
    return null;
}

Pass pooled HttpClient and URL to the method.

Cock answered 1/8, 2020 at 11:14 Comment(3)
Thanks for pointing those stuff. About the context-dependence: you can imagine thousand of proxies, which means thousand of HttpClient instances, each one of those HttpClient instances with a different proxy setup and each of them will execute two HttpRequestMessages. Just like in the example above, nothing more, nothing lessKaree
@Electron in case of IHttpClientFactory you may not keep alive all the clients but something like cache of 100 last used clients. Or any other scenario. You may also setup connection to close the socket immediately after response is received and use HttpClient per request, carefully. Only you decide how it will be exactly implemented, here' no silver bullet and stable best practice to solve that because the objective is too complex.Cock
@Electron will execute two HttpRequestMessages the code of the shown method will be the same then. DRY - Do not Repeat Yourself. You can manage clients outside of the method easily.Cock

© 2022 - 2024 — McMap. All rights reserved.