.NET Does NOT Have Reliable Asynchronouos Socket Communication?
Asked Answered
A

5

9

I once wrote a Crawler in .NET. In order to improve its scalability, I tried to take advantage of asynchronous API of .NET.

The System.Net.HttpWebRequest has asynchronous API BeginGetResponse/EndGetResponse. However, this pair of API is just to get a HTTP response headers and a Stream instance from which we can extract HTTP response content. So, my strategy is to use BeginGetResponse/EndGetResponse to asynchronously get the response Stream, then use BeginRead/EndRead to asynchronously get bytes from the response Stream instance.

Everything seems perfect until the Crawler goes to stress test. Under stress test, the Crawler suffers from high memory usage. I checked the memory with WinDbg+SoS and found out that lots of byte arrays are pined by System.Threading.OverlappedData instances. After some searching in internet, I found this KB http://support.microsoft.com/kb/947862 from microsoft.

According to the KB, the number of asynchronous I/O should have a "upper bound", but it doesn't tell a "suggested" bound value. So, in my eye, this KB helps nothing. This is obviously a .NET bug. Finally, I have to drop the idea to do asynchronous extracting bytes from response Stream, and just do it in synchronous way.

The .NET library that allows Asynchronous IO with dot net sockets (Socket.BeginSend / Socket.BeginReceive / NetworkStream.BeginRead / NetworkStream.BeginWrite) must have an upper bound on the amount of buffers outstanding (either send or receive) with their asynchronous IO.

The network application should have an upper bound on the number of outstanding asynchronous IO that it posts.

Edit: Add some question marks.

Anybody has any experience to do asynchronous I/O on Socket & NetworkStream? Generally speaking, does crawler in production do I/O with internet with Synchronous or Asynchronosly?

Aaronson answered 25/10, 2008 at 9:49 Comment(1)
Not a sigle questionmark except in subject... A bad sign.Tiphanie
S
11

This is not a .NET framework problem. The linked KB article could have been a bit more explanatory, every .NET program needs to run on an operating system and deal with its limitations. An operating system does not publish what it can do, or what resources are left, so no option of counting-down what you consumed. Necessarily so, its resources need to be shared by all programs that run on it.

And not limited to sockets, something as basic as memory is not endlessly available. As the name of the website helps us remember. If you use too much then you'll find out, the OS fails the request and you see that back as an exception in your program.

Resource management is still very much our job. A basic workaround in a stress test is to use a SemaphoreSlim. Call its Wait() method before you start the request, Release() when it completes. There is more than one OS limit. The TCP/IP design can't have more than 65535 active ports. The buffer used for the transfer needs to be pinned so the network device driver can write to it, that's the one you hit here. For a stress test, initializing the semaphore to 1000 is a decent and very high limit. Experiment to see how high you can go.

Spears answered 25/10, 2008 at 14:56 Comment(3)
But, how can I tell the "upper bound" in my program? The fact is that .NET doesn't release pinned byte array even if the application has abort the BeginXXX operation after timeout. I still believe this is a .net bug.Aaronson
Can't see how this is a helpful answer?!Mitzimitzie
Calling EndXxxx to release resources is a hard requirement. Do not skip that. Clearly that's easy to skip by accident when you implement a timeout scheme.Spears
R
4

This isn't limited to .Net.

It's a simple fact that each async request (file, network, etc) uses memory and (at some point, for networking requests at least) non paged pool (see here for details of the problems you can get in unmanaged code). The number of outstanding requests is therefore limited by the amount of memory. Pre-Vista there were some seriously low non paged pool limits that would cause you problems well before you ran out of memory, but in a post-vista environment things are much better for non paged pool usage (see here).

It's a little more complex in managed code as, in addition to the issues you get in the unmanaged world, you also have to deal with the fact that the memory buffers you use for async requests are pinned until those requests complete. Sounds like you're having these problems with reads, but it's just as bad, if not worse, for writes (as soon as TCP flow control kicks in on a connection those send completions are going to start taking longer to occur and so those buffers are pinned for longer and longer - see here and here).

The problem isn't that the .Net async stuff is broken, just that the abstraction is such that it makes it all look much easier than it really is. For example, to avoid the pinning issue, allocate all of your buffers in a single, large contiguous block at program start up rather than on demand...

Personally I'd write such a crawler in unmanaged code, but that's just me ;) You will still face many of the issues, but you have a bit more control over them.

Revealment answered 20/5, 2011 at 17:16 Comment(1)
Totally agree with this. I'm leaving .Net because of this. It's a highlevel language that thinks it can abstract away complexity end ends up shooting it self in the foot.Mellie
R
3

You obviously want to limit the number of concurrent requests, no matter if your crawler is synch/asynch. That limit is not fixed, it depends on your hardware, network, ...

I'm not so sure what's your question here, as .NET implementation of HTTP/Sockets is "ok". There are some holes (See my post about controlling timeouts properly), but it gets the job done (we have a production crawler that fetches ~ hundreds of pages per second).

BTW, we use synchronous IO, just for convenience sake. Every task has a thread, and we limit the number of concurrent thread. For thread-management, we used Microsoft CCR.

Rash answered 25/10, 2008 at 9:57 Comment(3)
I have no doubt that synchronous I/O on Socket works fine in DotNet. I just don't trust its asynchronous I/O API.Aaronson
The problem is aborting/canceling ops, it never works well in .NET. You should always prefer synch API (with timeouts), this way you don't need to cancel the op yourself.Rash
I would also suggest to wrap a synchronous WebRequest in a Task. Additionally do not use Threads, but Tasks<T> - which will protect you from extensive Thread-Generation by using a Threadpool. If you addtionally use a TaskCancelationSource, you can cancel easily running TasksGyromagnetic
S
0

No KB article can give you an upper bound. Upper bounds can vary depending on the hardware available - what is an upperbound for a 2G memory machine will be different for a machine with 16g of ram. It will also depend on the size of the GC heap, how fragmented it is etc.

What you should do is come up with a metric of your own using back of envelope calculations. Figure out how many pages you want to download per minute. That should determine how many async requests you want outstanding (N).

Once you know N, create a piece of code (like the consumer end of a producer-consumer pipeline) that can create N outstanding async download requests. As soon as a request finishes (either due to timeout or due to success), kick off another async request by pulling a workitem from the queue.

You also need to make sure that the queue does not grow beyond bounds, if for eg, the download becomes slow for whatever reason.

Sabadell answered 14/9, 2009 at 22:53 Comment(0)
P
0

This is happening when you use async Send (BeginSend) method of a socket. If you use your own custom threadpool, and send the data over thread with synched Send method is mostly solving this problem. Tested and proved.

Pieplant answered 20/5, 2011 at 10:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.