Dealing with a very large number of files

Asked 5/5, 2012 at 14:15 Answered 11/10, 2016 at 20:42

c#file-io task-parallel-library async-ctp tpl-dataflow

I am currently working on a research project which involves indexing a large number of files (240k); they are mostly html, xml, doc, xls, zip, rar, pdf, and text with filesizes ranging from a few KB to more than 100 MB.

With all the zip and rar files extracted, I get a final total of one million files.

I am using Visual Studio 2010, C# and .NET 4.0 with support for TPL Dataflow and Async CTP V3. To extract the text from these files I use Apache Tika (converted with ikvm) and I use Lucene.net 2.9.4 as indexer. I would like the use the new TPL dataflow library and asynchronous programming.

I have a few questions:

Would I get performance benefits if I use TPL? It is mainly an I/O process and from what I understand, TPL doesn't offer much benefit when you heavily use I/O.
Would a producer/consumer approach be the best way to deal with this type of file processing or are there any other models that are better? I was thinking of creating one producer with multiple consumers using blockingcollections.
Would the TPL dataflow library be of any use for this type of process? It seems TPL Dataflow is best used in some sort of messaging system...
Should I use asynchronous programming or stick to synchronous in this case?

Parboil answered 5/5, 2012 at 14:15 Comment(7)

Yes, one million files is fundamentally an I/O bottleneck. Throwing cpu cycles at the problem isn't going to be very effective and I'd assume that Google hardware isn't in reach. Thoroughly test your code on only a few files. Then turn it lose on the million and take a day off at the beach. – Lyra 5/5, 2012 at 16:4

I would give a go for "a few threads extracting text from docs,pdfs etc and putting the result to a blocking collection" and "a few indexer treads(sharing the same instance of IndexWriter) indexing the docs". – Tisman 5/5, 2012 at 21:11

L.B, I was thinking the same thing. @Hans I was hoping to index this amount off files within less then 12hours. I maybe don't have google hardware, but I not stuck one machine. – Parboil 6/5, 2012 at 5:57

Just a random thought: If you are really bottlenecked by I/O, you could buy a few small, cheap HDDs and RAID them into one partition- this should easily increase your I/O rate several fold. – Nicolanicolai 8/5, 2012 at 11:30

@Nicolanicolai Yes of course but that's not why I started this research, I want to see if it is possible to run this an a average machine while using the lastest technology .NET framework has to offer. So far I have setup a Dataflow producer/consumer and the first results look promising. – Parboil 9/5, 2012 at 11:21

If you got some solution please share some info, very interesting to see with which solution you ends up – Mig 17/7, 2012 at 9:17

I am still in search for the best solution, the one I am using now contains multiple actionblocks, a bufferblock and multiple producers and consumers. Still looking for a way to run multiple consumers to process multiple files at once. My project is to large to post here, maybe I will create a github project, will let you know – Parboil 24/7, 2012 at 6:32

async/await definitely helps when dealing with external resources - typically web requests, file system or db operations. The interesting problem here is that you need to fulfill multiple requirements at the same time:

consume as small amount of CPU as possible (this is where async/await will help)
perform multiple operations at the same time, in parallel
control the amount of tasks that are started (!) - if you do not take this into account, you will likely run out of threads when dealing with many files.

You may take a look at a small project I published on github:

Parallel tree walker

It is able to enumerate any number of files in a directory structure efficiently. You can define the async operation to perform on every file (in your case indexing it) while still controlling the maximum number of files that are processed at the same time.

For example:

await TreeWalker.WalkAsync(root, new TreeWalkerOptions
{
    MaxDegreeOfParallelism = 10,
    ProcessElementAsync = async (element) =>
    {
        var el = element as FileSystemElement;
        var path = el.Path;
        var isDirectory = el.IsDirectory;

        await DoStuffAsync(el);
    }
});

(if you cannot use the tool directly as a dll, you may still find some useful examples in the source code)

Swagerty answered 11/10, 2016 at 20:42 Comment(0)

You could use Everything Search. The SDK is open source and have C# example. It's the fastest way to index files on Windows I've seen.

From FAQ:

1.2 How long will it take to index my files?

"Everything" only uses file and folder names and generally takes a few seconds to build its > database. A fresh install of Windows XP SP2 (about 20,000 files) will take about 1 second to index. 1,000,000 files will take about 1 minute.

I'm not sure if you can use TPL with it though.

Felicita answered 16/5, 2012 at 19:55 Comment(1)

Tnx for the reply, but this isn't what I am looking for... I already known the files that need to be indexed, I need a way to extract the text and index the file's content without stressing my machine to much. – Parboil 20/5, 2012 at 12:3

Recommended topics

Hot tags