I am currently working on a research project which involves indexing a large number of files (240k); they are mostly html, xml, doc, xls, zip, rar, pdf, and text with filesizes ranging from a few KB to more than 100 MB.
With all the zip and rar files extracted, I get a final total of one million files.
I am using Visual Studio 2010, C# and .NET 4.0 with support for TPL Dataflow and Async CTP V3. To extract the text from these files I use Apache Tika (converted with ikvm) and I use Lucene.net 2.9.4 as indexer. I would like the use the new TPL dataflow library and asynchronous programming.
I have a few questions:
Would I get performance benefits if I use TPL? It is mainly an I/O process and from what I understand, TPL doesn't offer much benefit when you heavily use I/O.
Would a producer/consumer approach be the best way to deal with this type of file processing or are there any other models that are better? I was thinking of creating one producer with multiple consumers using blockingcollections.
Would the TPL dataflow library be of any use for this type of process? It seems TPL Dataflow is best used in some sort of messaging system...
Should I use asynchronous programming or stick to synchronous in this case?