Download HTML pages concurrently using the Async CTP

private readonly ConcurrentStack<LinkItem> _LinkStack; private readonly Int32 _MaxStackSize; private readonly WebClient client = new WebClient(); Func<string, string, Task<List<LinkItem>>> DownloadFromLink = async (BaseURL, uri) => { string html = await client.DownloadStringTaskAsync(uri); return LinkFinder.Find(html, BaseURL); }; Action<LinkItem> DownloadAndPush = async (o) => { List<LinkItem> result = await DownloadFromLink(o.BaseURL, o.Href); if (this._LinkStack.Count() + result.Count <= this._MaxStackSize) { this._LinkStack.PushRange(result.ToArray()); o.Processed = true; } }; Parallel.ForEach(this._LinkStack, (o) => { DownloadAndPush(o); });

I think the best way to do something like this using C# 5/.Net 4.5 is to use TPL Dataflow. There even is a walkthrough on how to implement web crawler using it.

Basically, you create one "block" that takes care of downloading one URL and getting the link from it:

var cts = new CancellationTokenSource();

Func<LinkItem, Task<IEnumerable<LinkItem>>> downloadFromLink =
    async link =>
            {
                // WebClient is not guaranteed to be thread-safe,
                // so we shouldn't use one shared instance
                var client = new WebClient();
                string html = await client.DownloadStringTaskAsync(link.Href);

                return LinkFinder.Find(html, link.BaseURL);
            };

var linkFinderBlock = new TransformManyBlock<LinkItem, LinkItem>(
    downloadFromLink,
    new ExecutionDataflowBlockOptions
    { MaxDegreeOfParallelism = 4, CancellationToken = cts.Token });

You can set MaxDegreeOfParallelism to any value you want. It says at most how many URLs can be downloaded concurrently. If you don't want to limit it at all, you can set it to DataflowBlockOptions.Unbounded.

Then you create one block that processes all the downloaded links somehow, like storing them all in a list. It can also decide when to cancel downloading:

var links = new List<LinkItem>();

var storeBlock = new ActionBlock<LinkItem>(
    linkItem =>
    {
        links.Add(linkItem);
        if (links.Count == maxSize)
            cts.Cancel();
    });

Since we didn't set MaxDegreeOfParallelism, it defaults to 1. That means using collection that is not thread-safe should be okay here.

We create one more block: it will take a link from linkFinderBlock, and pass it both to storeBlock and back to linkFinderBlock.

var broadcastBlock = new BroadcastBlock<LinkItem>(li => li);

The lambda in its constructor is a "cloning function". You can use it to create a clone of the item if you want to, but it shouldn't be necessary here, since we don't modify the LinkItem after creation.

Now we can connect the blocks together:

linkFinderBlock.LinkTo(broadcastBlock);
broadcastBlock.LinkTo(storeBlock);
broadcastBlock.LinkTo(linkFinderBlock);

Then we can start processing by giving the first item to linkFinderBlock (or broadcastBlock, if you want to also send it to storeBlock):

linkFinderBlock.Post(firstItem);

And finally wait until the processing is complete:

try
{
    linkFinderBlock.Completion.Wait();
}
catch (AggregateException ex)
{
    if (!(ex.InnerException is TaskCanceledException))
        throw;
}

Recommended topics

Hot tags