Performant File Copy in C#?
Asked Answered
O

6

7

I have a huge directory of about 500k jpg files, and I'd like to archive all files that are older than a certain date. Currently, the script takes hours to run.

This has a lot to do with the very piss-poor performance of GoGrid's storage servers, but at the same time, I'm sure there's a way more efficient way Ram/Cpu wise to accomplish what I'm doing.

Here's the code I have:

var dirInfo = new DirectoryInfo(PathToSource);
var fileInfo = dirInfo.GetFiles("*.*");
var filesToArchive = fileInfo.Where(f => 
    f.LastWriteTime.Date < StartThresholdInDays.Days().Ago().Date
      && f.LastWriteTime.Date >= StopThresholdInDays.Days().Ago().Date
);

foreach (var file in filesToArchive)
{
    file.CopyTo(PathToTarget+file.Name);
}

The Days().Ago() stuff is just syntactic sugar.

Ozenfant answered 4/11, 2009 at 22:49 Comment(7)
That relies on the host operating system, which should be top-notch.Lauderdale
Ya, the truth is there could be millions of files in there, I'm unable even to get a count of the directory through windows explorer because of similar performance problems.Ozenfant
The grammar Nazi says: "Performant" is not a word :)Bursary
Performant is so a word. dictionary.reference.com/browse/performantLazarolazaruk
Well, it is because it is used, and a dictionary is a living, changing thing. But in the technical sense it is as much a word as "Homie".Bursary
Every word was established through use at some point, resisting the evolution of language by making rules about what is 'technically' a word or not is the linguistic equivalent of refusing to adopt new technologies. The real test of a word is if reader understands what the writer means when they use it.Onstad
+1 for a good practical question that is bound to affect most large websites eventuallyPitchblende
M
3

While .NET 4.0 provides the lazy Directory.EnumerateFiles, you can do this right now on .NET 3.5:

Masry answered 4/11, 2009 at 23:30 Comment(2)
Thanks Mauricio...this works for the RAM problem, but not CPU. It still takes hours to accomplish but at least the RAM doesn't balloon out on me.Ozenfant
That works well enough to solve my problem. Takes about 2 hours, but now it can run in the background w/ a maximum of 4 megs of RAM, whereas before, it would use hundreds of megs.Ozenfant
A
10

The only part that I think you could improve is the dirInfo.GetFiles("*.*"). In .NET 3.5 and earlier, it returns an array with all the file names, which takes time to build and uses lots of RAM. In .NET 4.0, there is a new Directory.EnumerateFiles method that returns an IEnumerable<string> instead, and fetches results immediately as they are read from the disk. This could improve performance a bit, but don't expect miracles...

Abra answered 4/11, 2009 at 22:58 Comment(2)
Actually that is exatcly what needs to be done, EnumerateFiles returns Enumerator not the whole list. You save all the memory needed for the array. Let's say its 500k files * 100bytes = 50MBs of RAM. Using Enumerate you will only use up 100bytes, because you get 1 file at a time.Aronarondel
+1, .Net 4.0 has lots of really nice features in System.IO. Not sure if it will improve the situation with a million files in a directory :-DKetty
M
3

While .NET 4.0 provides the lazy Directory.EnumerateFiles, you can do this right now on .NET 3.5:

Masry answered 4/11, 2009 at 23:30 Comment(2)
Thanks Mauricio...this works for the RAM problem, but not CPU. It still takes hours to accomplish but at least the RAM doesn't balloon out on me.Ozenfant
That works well enough to solve my problem. Takes about 2 hours, but now it can run in the background w/ a maximum of 4 megs of RAM, whereas before, it would use hundreds of megs.Ozenfant
K
2

I'd keep in mind the 80/20 rule and note that if the bulk of the slowdown is file.CopyTo, and this slowdown far outweighs the performance of the LINQ query, then I wouldn't worry. You can test this by removing the file.CopyTo line and replacing it with a Console.WriteLine operation. Time that versus the real copy. You'll find the overhead of GoGrid versus the rest of the operation. My hunch is there won't be any realistic big gains on your end.

EDIT: Ok, so the 80% is the GetFiles operation, which isn't surprising if in fact there are a million files in the directory. Your best bet may be to begin using the Win32 API directly (like FindFirstFile and family) and P/Invoke:

[DllImport("kernel32.dll", CharSet=CharSet.Auto)]
static extern IntPtr FindFirstFile(string lpFileName, 
    out WIN32_FIND_DATA lpFindFileData);

I'd also suggest, if possible, altering the directory structure to decrease the number of files per directory. This will improve the situation immensely.

EDIT2: I'd also consider changing from GetFiles("*.*") to just GetFiles(). Since you're asking for everything, no sense in having it apply globbing rules at each step.

Ketty answered 4/11, 2009 at 22:56 Comment(2)
The bulk of the operation is the dirInfo.GetFiles(".") statement. I'm doing a test with only 5 days worth of files, and I run out of RAM/Patience before I can even get a count of the files in the directory from which to do the linq query. Is there a better way to GetFiles[], like just have GetFiles[] return Files that are within a range, instead of having to return them all? At least that way, I can break this operation into chunks of 10% this first time, and then have the archiver run every night. As it stands now, I can't really get anywhere.Ozenfant
Yes, altering the directory structure is what I'm trying to do, but first I need to access files without waiting all day and timing out the server :)Ozenfant
N
2

You should consider using a third party utility to perform the copying for you. Something like robocopy may speed up your processing significantly. See also https://serverfault.com/questions/54881/quickest-way-of-moving-a-large-number-of-files

Null answered 4/11, 2009 at 23:9 Comment(1)
And robocopy is included in Win7 and Server 2008 by default!Usa
O
1

You could experiment with using (a limited number of) Threads to perform the CopyTo(). Right now the whole operation is limited to 1 core.

This will only improve performance if it is now CPU-bound. But if this runs on a RAID, it may work.

Orvie answered 4/11, 2009 at 22:53 Comment(1)
I believe GoGrid is "in the Cloud". There may be limitations on active connections. Regardless, good advice.Ketty
U
0

Take a listen to this Hanselminutes podcast. Scott talks to Aaron Bockover the author of Banshee media player, they ran in to this exact issue and talk about it at 8:20 in the podcast.

If you can use .Net 4.0 then use their Directory.EnumerateFiles as mentioned by Thomas Levesque. If not then you may need to write your own directory walking code like they did in Mono.Posix using the native Win32 APIs.

Usa answered 4/11, 2009 at 23:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.