Efficiently Traverse Directory Tree with opendir(), readdir() and closedir()
Asked Answered
T

5

20

The C routines opendir(), readdir() and closedir() provide a way for me to traverse a directory structure. However, each dirent structure returned by readdir() does not seem to provide a useful way for me to obtain the set of pointers to DIR that I would need to recurse into the directory subdirectories.

Of course, they give me the name of the files, so I could either append that name to the directory path and stat() and opendir() them, or I could change the current working directory of the process via chdir() and roll it back via chdir("..").

The problem with the first approach is that if the length of the directory path is great enough, then the cost to pass a string containing it to opendir() will overweight the cost of opening a directory. If you are a bit more theoretical, you could say your complexity could increase beyond linear time (in the total character count of the (relative) filenames in the directory tree).

Also, the second approach has a problem. Since each process has a single current working directory, all but one thread will have to block in a multithreaded application. Also, I don't know if the current working directory is just a mere convenience (i.e., the relative path will be appended to it prior to a filesystem query). If it is, this approach will be inefficient too.

I am accepting alternatives to these functions. So how is it one can traverse a UNIX directory tree efficiently (linear time in the total character count of the files under it)?

Toothsome answered 22/2, 2010 at 15:57 Comment(1)
The maximum length of a filename or sub-directory is set by MAXCOMPLEN, which is traditionally 255 and hardly ever more than 512. So if you make your function recursive, you will not have huge strings, certainly nowhere near the point that allocating and managing the strings holding directory paths affects the overall complexity of the traversal.Tenderize
S
18

Have you tried ftw() aka File Tree Walk ?

Snippit from man 3 ftw:

int ftw(const char *dir, int (*fn)(const char *file, const struct stat *sb, int flag), int nopenfd);

ftw() walks through the directory tree starting from the indicated directory dir. For each found entry in the tree, it calls fn() with the full pathname of the entry, a pointer to the stat(2) structure for the entry and an int flag

Sometimes answered 22/2, 2010 at 20:8 Comment(4)
And nftw() sometimes - there's a subtle difference between the two, but I'd have to go manual bashing to find it...opengroup.org/onlinepubs/9699919799/functions/nftw.html ("The nftw() function shall recursively descend the directory hierarchy rooted in path. The nftw() function has a similar effect to ftw() except that it takes an additional argument flags...").Rockabilly
Thanks for reminding me of nftw(). I do remember using that over ftw() because the former allows you to pass a flag to tell it not to recurse over symlinks (among other things).Sometimes
Can we use ftw() to do in-order traversal through a directory ? (so that we can delete files/dirs from the bottom up of the directory tree structure)Oppidan
@D-Shan Yes. See the FTW_DEPTH flag at the man page: linux.die.net/man/3/nftw.Thiol
O
6

You seem to be missing one basic point: directory traversal involves reading data from the disk. Even when/if that data is in the cache, you end up going through a fair amount of code to get it from the cache into your process. Paths are also generally pretty short -- any more than a couple hundred bytes is pretty unusual. Together these mean that you can pretty reasonably build up strings for all the paths you need without any real problem. The time spent building the strings is still pretty minor compared to the time to read data from the disk. That means you can normally ignore the time spent on string manipulation, and work exclusively at optimizing disk usage.

My own experience has been that for most directory traversal a breadth-first search is usually preferable -- as you're traversing the current directory, put the full paths to all sub-directories in something like a priority queue. When you're finished traversing the current directory, pull the first item from the queue and traverse it, continuing until the queue is empty. This generally improves cache locality, so it reduces the amount of time spent reading the disk. Depending on the system (disk speed vs. CPU speed, total memory available, etc.) it's nearly always at least as fast as a depth-first traversal, and can easily be up to twice as fast (or so).

Omura answered 22/2, 2010 at 16:28 Comment(3)
Why use a priority queue and not something simpler like a FIFO queue? What do you use as the priority attribute?Slotnick
@Andrew: Good question. A FIFO will work perfectly well. A PQ simply makes it easy to produce results in order sorted by name, which the user generally prefers (certainly, I prefer it when I'm using it...)Omura
thanks, that makes sense, I hadn't considered the output format.Slotnick
C
3

The way to use opendir/readdir/closedir is to make the function recursive! Have a look at the snippet here on Dreamincode.net.

Hope this helps.

EDIT Thanks R.Sahu, the linky has expired, however, found it via wayback archive and took the liberty to add it to gist. Please remember, to check the license accordingly and attribute the original author for the source! :)

Carmelinacarmelita answered 22/2, 2010 at 16:14 Comment(1)
The link for the code is not valid any more. Just FYI.Calvano
D
3

Instead of opendir(), you can use a combination of openat(), dirfd() and fdopendir() and construct a recursive function to walk a directory tree:

#include <dirent.h>
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <string.h>

void dir_recurse (DIR *parent, int level) {
    struct dirent *ent;
    if (!parent) {
        return;
    }
    while ((ent = readdir(parent)) != NULL) {
        if ((strcmp(ent->d_name, ".") == 0) ||
            (strcmp(ent->d_name, "..") == 0)) {
            continue;
        }

        int parent_fd = dirfd(parent);
        if (parent_fd < 0) {
            perror("dirfd");
            continue;
        }
        int fd = openat(parent_fd, ent->d_name, O_RDONLY | O_DIRECTORY);
        if (fd != -1) { /* Directory */
            printf("%*s%s/\n", level, "", ent->d_name);
            DIR *child = fdopendir(fd);
            if (child) {
                dir_recurse(child, level + 1);
                closedir(child);
            } else {
                perror("fdopendir");
            }
        } else if (errno == ENOTDIR) { /* Regular file */
            printf("%*s%s\n", level, "", ent->d_name);
        } else {
            perror("openat");
        }
    }
}

int main (int argc, const char **argv) {
    DIR *root = opendir("..");
    if (root) {
        dir_recurse(root, 0);
        closedir(root);
    } else {
        perror("opendir");
    }

    return 0;
}

Here readdir() is still used to get the next directory entry. If the next entry is a directory, then we find the parent directory fd with dirfd() and pass this, along with the child directory name to openat(). The resulting fd refers to the child directory. This is passed to fdopendir() which returns a DIR * pointer for the child directory, which can then be passed to our dir_recurse() where it again will be valid for use with readdir() calls.

This program recurses over the whole directory tree rooted at .. Entries are printed, indented by 1 space per directory level. Directories are printed with a trailing /.

On Compiler Explorer.

Deach answered 10/7, 2019 at 18:13 Comment(3)
d_type is a BSD and glibc extension, i.e. this is not POSIX-compatible. Also, even on Linux, d_type is not supported by all filesystems.Bertelli
@Bertelli Fair enough. I've removed the use of d_type and check the openat() errno for ENOTDIR insteadDeach
The nice thing about using this approach is that there's zero dependency on [mis]using PATH_MAX - which doesn't even have to exist (and arguably shouldn't exist on Linux anyway...: "A definition of one of the symbolic constants in the following list shall be omitted from the <limits.h> header on specific implementations where the corresponding value is equal to or greater than the stated minimum, but where the value can vary depending on the file to which it is applied."Bodega
B
2

Probably overkill for your application, but here's a library designed to traverse a directory tree with hundreds of millions of files.

https://github.com/hpc/libcircle

Bertie answered 9/2, 2012 at 16:42 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.