Directory recursion and symlinks
Asked Answered
R

6

10

If you recursively traverse a directory tree by the obvious method, you'll run into trouble with infinite recursion when a symlink points to a parent directory.

An obvious solution would be to just check for symlinks and not follow them at all. But that might be an unpleasant surprise for a user who doesn't expect what behaves for other purposes like a perfectly normal directory to be silently ignored.

An alternative solution might be to keep a hash table of all directories visited so far, and use this to check for loops. But this would require there to be some canonical representation, some way to get the identity, of the directory you are currently looking at (regardless of the path by which you reached it).

Would Unix users typically regard the second solution as less surprising?

If so, is there a way to obtain such a canonical representation/identity of a directory, that's portable across Unix systems? (I'd like it to work across Linux, BSD, Mac OS, Solaris etc. I expect to have to write separate code for Windows.)

Remand answered 11/9, 2011 at 10:54 Comment(0)
A
3

The most frequently ignored API in this field would be

nftw

Nftw has options to avoid it traversing symlinks. It has much more advanced capabilities than that. Here is a simple sample from the man page itself:

#define _XOPEN_SOURCE 500
#include <ftw.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>

static int
display_info(const char *fpath, const struct stat *sb,
             int tflag, struct FTW *ftwbuf)
{
    printf("%-3s %2d %7jd   %-40s %d %s\n",
           (tflag == FTW_D) ?   "d"   : (tflag == FTW_DNR) ? "dnr" :
           (tflag == FTW_DP) ?  "dp"  : (tflag == FTW_F) ?   "f" :
           (tflag == FTW_NS) ?  "ns"  : (tflag == FTW_SL) ?  "sl" :
           (tflag == FTW_SLN) ? "sln" : "???",
           ftwbuf->level, (intmax_t) sb->st_size,
           fpath, ftwbuf->base, fpath + ftwbuf->base);
    return 0;           /* To tell nftw() to continue */
}

int
main(int argc, char *argv[])
{
    int flags = 0;

    if (argc > 2 && strchr(argv[2], 'd') != NULL)
        flags |= FTW_DEPTH;
    if (argc > 2 && strchr(argv[2], 'p') != NULL)
        flags |= FTW_PHYS;

    if (nftw((argc < 2) ? "." : argv[1], display_info, 20, flags)
            == -1)
    {
        perror("nftw");
        exit(EXIT_FAILURE);
    }
    exit(EXIT_SUCCESS);
}

See also

Afoul answered 11/9, 2011 at 11:50 Comment(2)
The man page for nftw says to use the fts functions instead, which are also available on both Linux and BSD (macOS) and work more efficiently. See also my answer below (which doesn't provide more info, though)Milurd
Thanks @ThomasTempelmann for adding perspective. I never actually use this function in practice (I suppose I might in a "quick" program that needed to be in C. I did once ace an interview question with it :)). It's good to re-read man-pages when re-using snippets after 10 years!Afoul
F
3

The absolute path of the directory is such a representation. You can get it with the realpath function, which is defined in the POSIX standard, so it will work on any POSIX-compliant system. See man 3 realpath.

Fetal answered 11/9, 2011 at 11:3 Comment(0)
A
3

The most frequently ignored API in this field would be

nftw

Nftw has options to avoid it traversing symlinks. It has much more advanced capabilities than that. Here is a simple sample from the man page itself:

#define _XOPEN_SOURCE 500
#include <ftw.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>

static int
display_info(const char *fpath, const struct stat *sb,
             int tflag, struct FTW *ftwbuf)
{
    printf("%-3s %2d %7jd   %-40s %d %s\n",
           (tflag == FTW_D) ?   "d"   : (tflag == FTW_DNR) ? "dnr" :
           (tflag == FTW_DP) ?  "dp"  : (tflag == FTW_F) ?   "f" :
           (tflag == FTW_NS) ?  "ns"  : (tflag == FTW_SL) ?  "sl" :
           (tflag == FTW_SLN) ? "sln" : "???",
           ftwbuf->level, (intmax_t) sb->st_size,
           fpath, ftwbuf->base, fpath + ftwbuf->base);
    return 0;           /* To tell nftw() to continue */
}

int
main(int argc, char *argv[])
{
    int flags = 0;

    if (argc > 2 && strchr(argv[2], 'd') != NULL)
        flags |= FTW_DEPTH;
    if (argc > 2 && strchr(argv[2], 'p') != NULL)
        flags |= FTW_PHYS;

    if (nftw((argc < 2) ? "." : argv[1], display_info, 20, flags)
            == -1)
    {
        perror("nftw");
        exit(EXIT_FAILURE);
    }
    exit(EXIT_SUCCESS);
}

See also

Afoul answered 11/9, 2011 at 11:50 Comment(2)
The man page for nftw says to use the fts functions instead, which are also available on both Linux and BSD (macOS) and work more efficiently. See also my answer below (which doesn't provide more info, though)Milurd
Thanks @ThomasTempelmann for adding perspective. I never actually use this function in practice (I suppose I might in a "quick" program that needed to be in C. I did once ace an interview question with it :)). It's good to re-read man-pages when re-using snippets after 10 years!Afoul
M
3

There is also the Linux/BSD function fts_open().

It gives you an easy-to-use iterator for traversing all sub directory contents while also detecting such symlink recursions.

In fact, the man page (on macOS) for nftw says that it's an old function which is now superceded by the fts API I mention here:

These functions are provided for compatibility with legacy code. New code should use the fts(3) functions.

Milurd answered 28/7, 2021 at 12:8 Comment(0)
K
2

Not only symlinks, but hard-links as well. Not very common, but not forbidden. (Only root can hardlink directories) The only thing that is canonical is {device_number, inode_number}. But network filesystems can misbehave.

Katherine answered 11/9, 2011 at 11:19 Comment(0)
B
2

This problem of identical files must be solved by many applications, for example a checker for file doublettes (indentical contents, different names) and utilities acting on whole directory hierarchies, like tar.

A good implementation wouldn't want to give false positives for hard linked files and symlinked files, either through symlinks to parent directories or to files.

The most portable approach to solve this is identifying files by looking at the POSIX stat/fstat functions and the struct stat they fill in with st_dev and st_ino members. A real world implementation of a file dupe checker in C employing this strategy is samefile (a different implementation of which was a winning entry of the 1998 IOCCC :-)

Brighten answered 11/9, 2011 at 11:38 Comment(0)
W
1

Since you haven't specified what language you're working with (if any), let's start with just the shell: if you're on a system with GNU readlink, just use readlink -f <path> to canonicalize it.

If you're on a Mac (which has a non-GNU readlink that behaves differently), see How can I get the behavior of GNU's readlink -f on a Mac? for the way to accomplish the same task.

The other option is to use inode ids to track unique files (via stat or similar), but that'll require first following all of the symlinks anyway (since symlinks themselves do have their own unique inode id), and the simplest way to follow all of the symlinks is, well, readlink.


Alternatively, many programming languages have bindings to the POSIX realpath function, which essentially performs the same function as readlink -f (but as a library call). For instance, Python has os.path.realpath(), C has it as a function in stdlib.h, et cetera.

If you're already working in a language that has such a function, using it is highly recommended, since you'll often get cross-platform compatibility for free (assuming your language is cross-platform).

Wb answered 11/9, 2011 at 10:59 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.