Faster way to get the total space taken by the directory containing 5 million files in linux
Asked Answered
S

2

7

I have a target board running linux which has approximately around 5 million+ files in the directory. (This directory doesn't have any sub directories) If i execute this program it takes several minutes to get the total space information. Is there a faster way to accomplish this? Thanks

#include <stdio.h>
#include <dirent.h>
#include <string.h>
#include <stdlib.h>
#include <limits.h>
#include <sys/stat.h>
#include <errno.h>


void calcSpace(char *path,  long long int *totalSpace) 
{
    DIR *dir;                /* dir structure we are reading */
    struct dirent *ent;      /* directory entry currently being processed */
    char absPath[200];
    struct stat statbuf;     /* buffer for stat()*/
    long long int fileCount=0;

    fprintf(stderr, "Opening dir %s\n", path);
    dir = opendir(path);
    if(NULL == dir) {
        perror(path);
        return;
    }
    while((ent = readdir(dir))) 
    {
       fileCount++;
       sprintf(absPath, "%s/%s", path, ent->d_name);
       if(stat(absPath, &statbuf)) {
          perror(absPath);
          return;
       }
       *totalSpace= (*totalSpace) + statbuf.st_size;
    }

    fprintf(stderr, "Closing dir %s\n", path);
    printf("fileCount=%lld.\n", fileCount);
    closedir(dir);
}

int main(int argc, char *argv[]) 
{
    char *dir;
    long long int totalSpace=0;
    if(argc > 1)
        dir = argv[1];
    else
        dir = ".";

    calcSpace(dir,  &totalSpace);
    printf("totalSpace=%lld\n", totalSpace);
    return 0;
}
Scrimp answered 16/11, 2019 at 8:59 Comment(23)
change your hard drive? use ram disk? use a data base?Cavin
Its not possible to change as this is a embedded target board.Scrimp
You could chdir to the target directory to avoid a lot of string operations, but that's probably not going to gain you much.Maurya
Maybe you can employ ftw to avoid extra call to stat. it would also be a bit helpful to avoid dereferencing *totalSpace on each iterationPointed
Have you tried system ("du -s") after changing to the directory (or popen)?Clearance
Yes tried "du -s ." directly on the cmd prompt and it too takes more time.Scrimp
Unfortunately dirent does not contain the file size, even though readdir reads the directory entry, whch contains the file size. So there is no way to prevent the exta call to stat. Unless you would modify readdir and dirent to contain the file size, which could be possible as this is an embedded board with probably its own library implementation.Dilly
@PaulOgilvie: That's a good point, maybe OP can copy the code of readdir and customize it in this way.Year
Before start doing any optimization I would suggest to profile the code to make clear what should be optimizedPotamic
How about making a separate filesystem just for this directory, and symlinking to it from its current location? Then you could just ask the filesystem for the space used.Year
Any optimization such as chdir, no dereferencing or even profiling won't help because the cost are in the calls to readdir and stat.Dilly
@PaulOgilvie: I agreeScrimp
How many cores does the hardware provide.Zhukov
@alk: Hw provides 2 coresScrimp
Then symlink the files to two dummy directories and scan those in parallel?Zhukov
If the remainder of the filesystem is fairly static in size, maybe you could approximate by subtracting free space and static size from total filesystem space.Offertory
Total space taken by a file may have nothing to do with its size. E.g. if you create a new file, seek to 1 TiB, then write one byte and close it you'll probably end up with a 1 TIB file that consumes 4 KiB of space (not 1 TiB and not 1 byte).Convulsion
@MarkSetchell: Remaining file system has other folders which get updated dynamically. Files may get created in other folders as well.Scrimp
You did not specify the type of the file system. If your file system support this (e.g., ext3), consider turning on directory indexing if not already enabled. See: linux-tips.com/t/…Laubin
Do you know something about the files ? Can you tell if all files can be extended (or resized). If only a small set of files is being modified, you can cache results between calls.Laubin
@dash-o: Its an ext-4 FS. If i have to try the cmd e2fsck , i have to try at bootup i guess. I cannot unmount the FS when the target is running. And i dont intend to reboot it often.Scrimp
Use a dedicated partition on that drive for that purpose (or use a distinct drive). Then size used would essentially be partition size minus empty space.Oblige
@Phil1970: Since this target system's Filesystem is already mounted and files are present, we don't want to lose them.Scrimp
I
1

As stated in the comments, it seems that the main cost are the calls for stat and readdir.

Optimizing readdir calls

We can save some serious costs on the readdir costs using the getdents(2) syscall instead. This syscall is similar to 'readdir', but you can use it in order to read multiple directory entries in each syscall - greatly reducing the overhead of calling the readdir syscall for each entry.

A code example can be found in the man page I linked to. The importat thing to note is that you probably should fiddle around with the amount of entries read at a time using getdents (the count paramater - in the example it is 1024), to find the sweet spot for your configuration and machine (this step will probably give you the performance improvement you want over readdir).

Optimizing stat calls

It is recommended to use the fstatat(2) function instead of the regular stat you used (relevant man page for both). This is because the first paramter of fstatat(2) is a dirfd - file descriptor for the directory the file you are stating is under.

That means that you can open a file descriptor to the directory once (using open(2)), and then all fstatat calls will be done with this dirfd. This will optimize the stating process in the kernel (as the reference for the entire path and the directory itself shouldn't be resolved for every stat syscall anymore), and would probably make your code simpler and a little bit faster (as path concatenation will not be needed anymore).

Insolvable answered 16/11, 2019 at 13:12 Comment(2)
I do not believe that switching from readdir to direct calls to getdents is going to make a difference here. Readdir is implemented with by issuing getdents for multiple entries, and the major cost is the actual IO. Impact of larger blocks is minimalLaubin
Likewise, using fstatat over stat in the current folder is minimal. The main cost of the fstat is the search for filename inside the 5M folder. Locating the current folder is a small component, likely to be cached in the process. Benchmark of readdir/getdents and stat/fstatat performance does not support the case for switching (on my Mint 19). Might be different on the embedded system that is used by the OP.Laubin
L
0

While technically this is not an 'answer', some comments about the issue here:

Behind the scenes, most Linux file system represent files as 'inodes', with the directory representing list of names->inode. Depending on the file system, the list may be simple linear list, balanced tree, or a hash - which will improve on the performance of lookup, and the penalty for working with folders with large number of files.

Older operating systems (Vax VMS, and it's predecessors FILES-11) had to ability to open a file by a unique ID (file number, sequence number), in addition to opening a file by file path. In Linux space this will be equivalent to opening a file by inode number. This approach make it possible to open, or query for meta data with little overhead. Unfortunately, Unix/Linux does not have similar feature at the application level, and as much as I know, there are no plans to create such an interface. It will require significant upgrade to all file system drivers.

Another approach will be to implement multi-file stat system calls, which will benefit from being able to perform the file lookup for all files during a single scan. While type of system calls will speed up various utilities, it will not benefit most applications, which will usually stick to POSIX calls. it will be hard to implement such a feature, without changes to various file system drivers. I believe it's unlikely to be available soon.

Laubin answered 16/11, 2019 at 17:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.