Directory file size calculation - how to make it faster?
Asked Answered
K

8

20

Using C#, I am finding the total size of a directory. The logic is this way : Get the files inside the folder. Sum up the total size. Find if there are sub directories. Then do a recursive search.

I tried one another way to do this too : Using FSO (obj.GetFolder(path).Size). There's not much of difference in time in both these approaches.

Now the problem is, I have tens of thousands of files in a particular folder and its taking like atleast 2 minute to find the folder size. Also, if I run the program again, it happens very quickly (5 secs). I think the windows is caching the file sizes.

Is there any way I can bring down the time taken when I run the program first time??

Kittle answered 5/6, 2010 at 6:34 Comment(6)
Is your method any slower than it takes "explorer" to do this the first time?Sardanapalus
I think that's normal. You might use a lower level API to do the recursion at the file system level, but I doubt that would be significantly faster.Inflow
@Marc, no it is not considerably different. Also, I have tried WinApi but not much of a difference.Kittle
Defragmenting the file system with option to group folders will speed up the initial search; AFAIK, no, there is not speed up method; you can use SSD drives...Atheistic
@MarcGravell can you do that in your machine and let me know if it works?Kittle
Check this answer: https://mcmap.net/q/223968/-what-39-s-the-best-way-to-calculate-the-size-of-a-directory-in-net It's 4 times faster.Ratsbane
B
36

If fiddled with it a while, trying to Parallelize it, and surprisingly - it speeded up here on my machine (up to 3 times on a quadcore), don't know if it is valid in all cases, but give it a try...

.NET4.0 Code (or use 3.5 with TaskParallelLibrary)

    private static long DirSize(string sourceDir, bool recurse)
    {
        long size = 0;
        string[] fileEntries = Directory.GetFiles(sourceDir);

        foreach (string fileName in fileEntries)
        {
            Interlocked.Add(ref size, (new FileInfo(fileName)).Length);
        }

        if (recurse)
        {
            string[] subdirEntries = Directory.GetDirectories(sourceDir);

            Parallel.For<long>(0, subdirEntries.Length, () => 0, (i, loop, subtotal) =>
            {
                if ((File.GetAttributes(subdirEntries[i]) & FileAttributes.ReparsePoint) != FileAttributes.ReparsePoint)
                {
                    subtotal += DirSize(subdirEntries[i], true);
                    return subtotal;
                }
                return 0;
            },
                (x) => Interlocked.Add(ref size, x)
            );
        }
        return size;
    }
Bunnell answered 5/6, 2010 at 17:2 Comment(5)
At least it probably optimizes the user-mode operations.Pancratium
When I was at the Microsoft Visual Studio 2010 launch event (UK Tech Days) the example used to demonstrate the new Parallel LINQ methods was exactly this: calculating directory size. IIRC we saw at least a 2x speed increase when using PLINQ on his quad core laptop. It's in one of the videos here but I can't remember which one: microsoft.com/uk/techdays/resources.aspxViviparous
Can you please also explain why you have checked for ReparsePoint? Because if I comment that line speed increases more than 5 times.Gambeson
@Gambeson because in my Opinion Reparse Points are not a real file. MSDN: "The file contains a reparse point, which is a block of user-defined data associated with a file or a directory.". But as always, it depends on your needs & requirements.Bunnell
some dir will cause System.UnauthorizedAccessException,how to avoid?Olcott
H
10

Hard disks are an interesting beast - sequential access (reading a big contiguous file for example) is super zippy, figure 80megabytes/sec. however random access is very slow. this is what you're bumping into - recursing into the folders wont read much (in terms of quantity) data, but will require many random reads. The reason you're seeing zippy perf the second go around is because the MFT is still in RAM (you're correct on the caching thought)

The best mechanism I've seen to achieve this is to scan the MFT yourself. The idea is you read and parse the MFT in one linear pass building the information you need as you go. The end result will be something much closer to 15 seconds on a HD that is very full.

some good reading: NTFSInfo.exe - http://technet.microsoft.com/en-us/sysinternals/bb897424.aspx Windows Internals - http://www.amazon.com/Windows%C2%AE-Internals-Including-Windows-PRO-Developer/dp/0735625301/ref=sr_1_1?ie=UTF8&s=books&qid=1277085832&sr=8-1

FWIW: this method is very complicated as there really isn't a great way to do this in Windows (or any OS I'm aware of) - the problem is that the act of figuring out which folders/files are needed requires much head movement on the disk. It'd be very tough for Microsoft to build a general solution to the problem you describe.

Hartz answered 21/6, 2010 at 2:5 Comment(0)
C
7

The short answer is no. The way Windows could make the directory size computation a faster would be to update the directory size and all parent directory sizes on each file write. However, that would make file writes a slower operation. Since it is much more common to do file writes than read directory sizes it is a reasonable tradeoff.

I am not sure what exact problem is being solved but if it is file system monitoring it might be worth checking out: http://msdn.microsoft.com/en-us/library/system.io.filesystemwatcher.aspx

Calabria answered 5/6, 2010 at 6:34 Comment(0)
P
2

Peformance will suffer using any method when scanning a folder with tens of thousands of files.

  • Using the Windows API FindFirstFile... and FindNextFile... functions provides the fastest access.

  • Due to marshalling overhead, even if you use the Windows API functions, performance will not increase. The framework already wraps these API functions, so there is no sense doing it yourself.

  • How you handle the results for any file access method determines the performance of your application. For instance, even if you use the Windows API functions, updating a list-box is where performance will suffer.

  • You cannot compare the execution speed to Windows Explorer. From my experimentation, I believe Windows Explorer reads directly from the file-allocation-table in many cases.

  • I do know that the fastest access to the file system is the DIR command. You cannot compare performance to this command. It definitely reads directly from the file-allocation-table (propbably using BIOS).

  • Yes, the operating-system caches file access.

Suggestions

  • I wonder if BackupRead would help in your case?

  • What if you shell out to DIR and capture then parse its output? (You are not really parsing because each DIR line is fixed-width, so it is just a matter of calling substring.)

  • What if you shell out to DIR /B > NULL on a background thread then run your program? While DIR is running, you will benefit from the cached file access.

Presentday answered 16/6, 2010 at 2:22 Comment(6)
This is incorrect. DIR does not read from the file allocaion table. Neither does Windows Explorer. Both make calls that go through Kernel32 and NTDLL and are handled by the filesystem drivers in kernel mode. I ran the dependency walker (depends.exe) on cmd.exe and determined that the DIR command make calls to the Kernel32.dll routines FindFirstFileW and FindNextFileW. So shelling out to the DIR command will be slower than just calling these yourself.Choline
First, it is not possible to use "depends" to determine what API calls the DIR command uses.Presentday
Second, if you monitor the DIR command using "Process Monitor" you will notice only QueryDirectory operations are performed. If you create a simple console application in .NET that calls GetFileSystemInfos and GetDirectories you will notice the same operations are performed more often, including numerous CloseFile and CreateFile operations. These .NET methods call the API routines. Therefore, you can infer the DIR command is not calling these API functions.Presentday
Third, do what I did. Create a console application using C/C++. This application only calls the API routines and recurses down a folder structure. It does not output any content. Compare its execution time to the same DIR command redirected to NULL or to a file. The DIR command is always significanly faster. All access must go through the filesystem drivers, but DIR and in some cases Windows Explorer, read directly from the file allocation table. See Chris Gray's answer.Presentday
Lastly, if you really want to disprove the DIR reads directly from the "fat", use DEBUG and debug CMD. I choose to write the test application to verify the behavior I was experiencing. It is my opinion, that DIR has some kind of a "hook" that allows it to read the file-allocation-table in "blocks". (Most likely it uses the technique in Chris Gray's answer.) There is no other explaination for its ability to read file informat from the hard drive so quickly.Presentday
+1 for the DIR idea. I measured that getting file size of EXE files is much slower than the same for other files (within the same directory). This indicates that the real-time anti-virus scanner kicks in, despite the files are not opened just fstat'ed. DIR avoids this, which confirms it accesses the directory info only.Moise
B
2

Based on the answer by spookycoder, I found this variation (using DirectoryInfo) at least 2 times faster (and up to 10 times faster on complex folder structures!) :

    public static long CalcDirSize(string sourceDir, bool recurse = true)
    {
        return _CalcDirSize(new DirectoryInfo(sourceDir), recurse);
    }

    private static long _CalcDirSize(DirectoryInfo di, bool recurse = true)
    {
        long size = 0;
        FileInfo[] fiEntries = di.GetFiles();
        foreach (var fiEntry in fiEntries)
        {
            Interlocked.Add(ref size, fiEntry.Length);
        }

        if (recurse)
        {
            DirectoryInfo[] diEntries = di.GetDirectories("*.*", SearchOption.TopDirectoryOnly);
            System.Threading.Tasks.Parallel.For<long>(0, diEntries.Length, () => 0, (i, loop, subtotal) =>
            {
                if ((diEntries[i].Attributes & FileAttributes.ReparsePoint) == FileAttributes.ReparsePoint) return 0;
                subtotal += __CalcDirSize(diEntries[i], true);
                return subtotal;
            },
                (x) => Interlocked.Add(ref size, x)
            );

        }
        return size;
    }
Bergen answered 13/11, 2018 at 12:36 Comment(0)
C
1

I don't think it will change a lot, but it might go a little faster if you use the API functions FindFirstFile and NextFile to do it.

I don't think there's any really quick way of doing it however. For comparison purposes you could try doing dir /a /x /s > dirlist.txt and to list the directory in Windows Explorer to see how fast they are, but I think they will be similar to FindFirstFile.

PInvoke has a sample of how to use the API.

Coinsure answered 5/6, 2010 at 6:55 Comment(0)
M
0

With tens of thousands of files, you're not going to win with a head-on assault. You need to try to be a bit more creative with the solution. With that many files you could probably even find that in the time it takes you calculate the size, the files have changed and your data is already wrong.

So, you need to move the load to somewhere else. For me, the answer would be to use System.IO.FileSystemWatcher and write some code that monitors the directory and updates an index.

It should take only a short time to write a Windows Service that can be configured to monitor a set of directories and write the results to a shared output file. You can have the service recalculate the file sizes on startup, but then just monitor for changes whenever a Create/Delete/Changed event is fired by the System.IO.FileSystemWatcher. The benefit of monitoring the directory is that you are only interested in small changes, which means that your figures have a higher chance of being correct (remember all data is stale!)

Then, the only thing to look out for would be that you would have multiple resources both trying to access the resulting output file. So just make sure that you take that into account.

Moult answered 17/6, 2010 at 11:6 Comment(1)
please dont do this, you'll end up hogging resources for all other apps. not to mention this trick is very fragile.Hartz
K
0

I gave up on the .NET implementations (for performance reasons) and used the Native function GetFileAttributesEx(...)

Try this:

[StructLayout(LayoutKind.Sequential)]
public struct WIN32_FILE_ATTRIBUTE_DATA
{
    public uint fileAttributes;
    public System.Runtime.InteropServices.ComTypes.FILETIME creationTime;
    public System.Runtime.InteropServices.ComTypes.FILETIME lastAccessTime;
    public System.Runtime.InteropServices.ComTypes.FILETIME lastWriteTime;
    public uint fileSizeHigh;
    public uint fileSizeLow;
}

public enum GET_FILEEX_INFO_LEVELS
{
    GetFileExInfoStandard,
    GetFileExMaxInfoLevel
}

public class NativeMethods {
    [DllImport("KERNEL32.dll", CharSet = CharSet.Auto)]
    public static extern bool GetFileAttributesEx(string path, GET_FILEEX_INFO_LEVELS  level, out WIN32_FILE_ATTRIBUTE_DATA data);

}

Now simply do the following:

WIN32_FILE_ATTRIBUTE_DATA data;
if(NativeMethods.GetFileAttributesEx("[your path]", GET_FILEEX_INFO_LEVELS.GetFileExInfoStandard, out data)) {

     long size = (data.fileSizeHigh << 32) & data.fileSizeLow;
}
Karyosome answered 21/6, 2010 at 10:5 Comment(3)
Not working on my machine. File-size-high and file-size-low are always Zero for folders.Presentday
Have you tried it with GET_FILEEX_INFO_LEVELS.GetFileMaxInfoLevel?? Also no trailing '\' at the end of the path?Karyosome
Doesn't work for me either. GetFileAttributesEx returns true but fileSizeHigh and fileSizeLow are always zero. Tried with and without trailing slash.Ungovernable

© 2022 - 2024 — McMap. All rights reserved.