How to list first level directories only in C?

M

5

4

In a terminal I can call ls -d */. Now I want a c program to do that for me, like this:

#include <sys/types.h>
#include <sys/wait.h>
#include <stdio.h>
#include <unistd.h>

int main( void )
{
    int status;

    char *args[] = { "/bin/ls", "-l", NULL };

    if ( fork() == 0 )
        execv( args[0], args );
    else
        wait( &status ); 

    return 0;
}

This will ls -l everything. However, when I am trying:

char *args[] = { "/bin/ls", "-d", "*/",  NULL };

I will get a runtime error:

ls: */: No such file or directory

Multidisciplinary answered 10/9, 2016 at 19:24 Comment(5)

Just call system. Globs on Unixes are expanded by the shell. system will give you a shell. – Westbound 10/9, 2016 at 19:30

Thanks @PSkocik, that did it! Would like to post an answer? system("/bin/ls -d */"); Explaining why execv() couldn't do the trick ;) – Multidisciplinary 10/9, 2016 at 19:34

Remember that if you use system(), you shouldn't also fork(). – Acquah 10/9, 2016 at 19:39

Correct @unwind, I wrote the code, 3 lines of code in the body of main(). – Multidisciplinary 10/9, 2016 at 19:40

avoid system() and use execv() wherever possible. system() requires proper quoting and is the source of many security problems. Your problem is that '*' is expanded by the shell but not by ls. You can try to execute find -type d instead of. – Cordillera 10/9, 2016 at 19:42

A

5

Unfortunately, all solutions based on shell expansion are limited by the maximum command line length. Which varies (run true | xargs --show-limits to find out); on my system, it is about two megabytes. Yes, many will argue that it suffices -- as did Bill Gates on 640 kilobytes, once.

(When running certain parallel simulations on non-shared filesystems, I do occasionally have tens of thousands of files in the same directory, during the collection phase. Yes, I could do that differently, but that happens to be the easiest and most robust way to collect the data. Very few POSIX utilities are actually silly enough to assume "X is sufficient for everybody".)

Fortunately, there are several solutions. One is to use find instead:

system("/usr/bin/find . -mindepth 1 -maxdepth 1 -type d");

You can also format the output as you wish, not depending on locale:

system("/usr/bin/find . -mindepth 1 -maxdepth 1 -type d -printf '%p\n'");

If you want to sort the output, use \0 as the separator (since filenames are allowed to contain newlines), and -t= for sort to use \0 as the separator, too. tr will convert them to newlines for you:

system("/usr/bin/find . -mindepth 1 -maxdepth 1 -type d -printf '%p\0' | sort -t= | tr -s '\0' '\n'");

If you want the names in an array, use glob() function instead.

Finally, as I like to harp every now and then, one can use the POSIX nftw() function to implement this internally:

#define _GNU_SOURCE
#include <stdio.h>
#include <ftw.h>

#define NUM_FDS 17

int myfunc(const char *path,
           const struct stat *fileinfo,
           int typeflag,
           struct FTW *ftwinfo)
{
    const char *file = path + ftwinfo->base;
    const int depth = ftwinfo->level;

    /* We are only interested in first-level directories.
       Note that depth==0 is the directory itself specified as a parameter.
    */
    if (depth != 1 || (typeflag != FTW_D && typeflag != FTW_DNR))
        return 0;

    /* Don't list names starting with a . */
    if (file[0] != '.')
        printf("%s/\n", path);

    /* Do not recurse. */
    return FTW_SKIP_SUBTREE;
}

and the nftw() call to use the above is obviously something like

if (nftw(".", myfunc, NUM_FDS, FTW_ACTIONRETVAL)) {
    /* An error occurred. */
}

The only "issue" in using nftw() is to choose a good number of file descriptors the function may use (NUM_FDS). POSIX says a process must always be able to have at least 20 open file descriptors. If we subtract the standard ones (input, output, and error), that leaves 17. The above is unlikely to use more than 3, though.

You can find the actual limit using sysconf(_SC_OPEN_MAX), and subtracting the number of descriptors your process may use at the same time. In current Linux systems, it is typically limited to 1024 per process.

The good thing is, as long as that number is at least 4 or 5 or so, it only affects the performance: it just determines how deep nftw() can go in the directory tree structure, before it has to use workarounds.

If you want to create a test directory with lots of subdirectories, use something like the following Bash:

mkdir lots-of-subdirs
cd lots-of-subdirs
for ((i=0; i<100000; i++)); do mkdir directory-$i-has-a-long-name-since-command-line-length-is-limited ; done

On my system, running

ls -d */

in that directory yields bash: /bin/ls: Argument list too long error, while the find command and the nftw() based program all run just fine.

You also cannot remove the directories using rmdir directory-*/ for the same reason. Use

find . -name 'directory-*' -type d -print0 | xargs -r0 rmdir

instead. Or just remove the entire directory and subdirectories,

cd ..
rm -rf lots-of-subdirs

Aquamarine answered 10/9, 2016 at 21:38 Comment(5)

find -delete would be even easier for that special case. But xargs -0 is a good example. For GNU find, find -exec rmdir {} + would batch args together into maximum-size groups (unlike find -exec rmdir {} \;), so it can often replace xargs. – Zoosperm 10/9, 2016 at 22:49

@PeterCordes: Agreed. I was wondering whether to wax about using handle = popen("find ... -print0", "r"); or handle = popen("find ... -printf '%p\n'") with getdelim(&name, &namesize, '\0', handle) to find specific files, as it is a good KISS way to do it (assuming we do not care if the user has done something weird to the find utility or PATH). – Aquamarine 10/9, 2016 at 22:59

Edit: or handle = popen("find ... -printf '\p\0'"); above, of course. – Aquamarine 10/9, 2016 at 23:6

Careful with your quoting. \0 inside a C double-quoted string-literal terminates it early. I think you mean "... -printf '\\p\\0'". – Zoosperm 10/9, 2016 at 23:8

@PeterCordes: Gaaaaaaah. No, it should've been handle = popen("find ... -printf '%p\\0'", "r");. Anyway, the approach is especially nice if you allow plugins or templates either in the plugins directory or in a subdirectory, with a specific filename suffix denoting its type. Very user-friendly. In practice, it ends up looking like handle = popen(FIND_CMD " " PLUGIN_DIRS " " FIND_PLUGIN_SPEC " -printf '%p\\0'", "r"); or something, with the macros determined at compile time (in case e.g. somebody wants to use /usr/bin/find explicitly on their distro). – Aquamarine 10/9, 2016 at 23:17

Z

13

The lowest-level way to do this is with the same Linux system calls ls uses.

So look at the output of strace -efile,getdents ls:

execve("/bin/ls", ["ls"], [/* 72 vars */]) = 0
...
openat(AT_FDCWD, ".", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
getdents(3, /* 23 entries */, 32768)    = 840
getdents(3, /* 0 entries */, 32768)     = 0
...

getdents is a Linux-specific system call. The man page says that it's used under the hood by libc's readdir(3) POSIX API function.

The lowest-level portable way (portable to POSIX systems), is to use the libc functions to open a directory and read the entries. POSIX doesn't specify the exact system call interface, unlike for non-directory files.

These functions:

DIR *opendir(const char *name);
struct dirent *readdir(DIR *dirp);

can be used like this:

// print all directories, and symlinks to directories, in the CWD.
// like sh -c 'ls -1UF -d */'  (single-column output, no sorting, append a / to dir names)
// tested and works on Linux, with / without working d_type

#define _GNU_SOURCE    // includes _BSD_SOURCE for DT_UNKNOWN etc.
#include <dirent.h>
#include <stdint.h>

#include <sys/types.h>
#include <sys/stat.h>
#include <stdio.h>
#include <stdlib.h>

int main() {
    DIR *dirhandle = opendir(".");     // POSIX doesn't require this to be a plain file descriptor.  Linux uses open(".", O_DIRECTORY); to implement this
    //^Todo: error check
    struct dirent *de;
    while(de = readdir(dirhandle)) { // NULL means end of directory
        _Bool is_dir;
    #ifdef _DIRENT_HAVE_D_TYPE
        if (de->d_type != DT_UNKNOWN && de->d_type != DT_LNK) {
           // don't have to stat if we have d_type info, unless it's a symlink (since we stat, not lstat)
           is_dir = (de->d_type == DT_DIR);
        } else
    #endif
        {  // the only method if d_type isn't available,
           // otherwise this is a fallback for FSes where the kernel leaves it DT_UNKNOWN.
           struct stat stbuf;
           // stat follows symlinks, lstat doesn't.
           stat(de->d_name, &stbuf);              // TODO: error check
           is_dir = S_ISDIR(stbuf.st_mode);
        }

        if (is_dir) {
           printf("%s/\n", de->d_name);
        }
    }
}

There's also a fully compilable example of reading directory entries and printing file info in the Linux stat(3posix) man page. (not the Linux stat(2) man page; it has a different example).

The man page for readdir(3) says the Linux declaration of struct dirent is:

   struct dirent {
       ino_t          d_ino;       /* inode number */
       off_t          d_off;       /* not an offset; see NOTES */
       unsigned short d_reclen;    /* length of this record */
       unsigned char  d_type;      /* type of file; not supported
                                      by all filesystem types */
       char           d_name[256]; /* filename */
   };

d_type is either DT_UNKNOWN, in which case you need to stat to learn anything about whether the directory entry is itself a directory. Or it can be DT_DIR or something else, in which case you can be sure it is or isn't a directory without having to stat it.

Some filesystems, like EXT4 I think, and very recent XFS (with the new metadata version), keep type info in the directory, so it can be returned without having to load the inode from disk. This is a huge speedup for find -name: it doesn't have to stat anything to recurse through subdirs. But for filesystems that don't do this, d_type will always be DT_UNKNOWN, because filling it in would require reading all the inodes (which might not even be loaded from disk).

Sometimes you're just matching on filenames, and don't need type info, so it would be bad if the kernel spent a lot of extra CPU time (or especially I/O time) filling in d_type when it's not cheap. d_type is just a performance shortcut; you always need a fallback (except maybe when writing for an embedded system where you know what FS you're using and that it always fills in d_type, and that you have some way to detect the breakage when someone in the future tries to use this code on another FS type.)

Zoosperm answered 10/9, 2016 at 20:32 Comment(4)

With dirfd (3) and fstatat (2) you can work with any directory. not only the current one. – Draught 1/10, 2017 at 18:19

@Igor What about the code above suggests to you that only the current directory can be used? – Soundless 2/8, 2018 at 15:22

@ChristopherSchultz: I used stat(de->d_name, &stbuf);, i.e. using the dir entry straight from readdir as a relative path, i.e. relative to the current directory. Using dirfd and fstatat is a great suggestion for using them relative to another directory, instead of doing string manipulation to create paths to that directory. – Zoosperm 2/8, 2018 at 19:24

@PeterCordes Aah, thanks for pointing that out. I was assuming that string-manipulation was not a problem, an that @Igor was claiming that calling chdir would be necessary to use stat. – Soundless 3/8, 2018 at 16:47

A

5

Unfortunately, all solutions based on shell expansion are limited by the maximum command line length. Which varies (run true | xargs --show-limits to find out); on my system, it is about two megabytes. Yes, many will argue that it suffices -- as did Bill Gates on 640 kilobytes, once.

(When running certain parallel simulations on non-shared filesystems, I do occasionally have tens of thousands of files in the same directory, during the collection phase. Yes, I could do that differently, but that happens to be the easiest and most robust way to collect the data. Very few POSIX utilities are actually silly enough to assume "X is sufficient for everybody".)

Fortunately, there are several solutions. One is to use find instead:

system("/usr/bin/find . -mindepth 1 -maxdepth 1 -type d");

You can also format the output as you wish, not depending on locale:

system("/usr/bin/find . -mindepth 1 -maxdepth 1 -type d -printf '%p\n'");

If you want to sort the output, use \0 as the separator (since filenames are allowed to contain newlines), and -t= for sort to use \0 as the separator, too. tr will convert them to newlines for you:

system("/usr/bin/find . -mindepth 1 -maxdepth 1 -type d -printf '%p\0' | sort -t= | tr -s '\0' '\n'");

If you want the names in an array, use glob() function instead.

Finally, as I like to harp every now and then, one can use the POSIX nftw() function to implement this internally:

#define _GNU_SOURCE
#include <stdio.h>
#include <ftw.h>

#define NUM_FDS 17

int myfunc(const char *path,
           const struct stat *fileinfo,
           int typeflag,
           struct FTW *ftwinfo)
{
    const char *file = path + ftwinfo->base;
    const int depth = ftwinfo->level;

    /* We are only interested in first-level directories.
       Note that depth==0 is the directory itself specified as a parameter.
    */
    if (depth != 1 || (typeflag != FTW_D && typeflag != FTW_DNR))
        return 0;

    /* Don't list names starting with a . */
    if (file[0] != '.')
        printf("%s/\n", path);

    /* Do not recurse. */
    return FTW_SKIP_SUBTREE;
}

and the nftw() call to use the above is obviously something like

if (nftw(".", myfunc, NUM_FDS, FTW_ACTIONRETVAL)) {
    /* An error occurred. */
}

The only "issue" in using nftw() is to choose a good number of file descriptors the function may use (NUM_FDS). POSIX says a process must always be able to have at least 20 open file descriptors. If we subtract the standard ones (input, output, and error), that leaves 17. The above is unlikely to use more than 3, though.

You can find the actual limit using sysconf(_SC_OPEN_MAX), and subtracting the number of descriptors your process may use at the same time. In current Linux systems, it is typically limited to 1024 per process.

The good thing is, as long as that number is at least 4 or 5 or so, it only affects the performance: it just determines how deep nftw() can go in the directory tree structure, before it has to use workarounds.

If you want to create a test directory with lots of subdirectories, use something like the following Bash:

mkdir lots-of-subdirs
cd lots-of-subdirs
for ((i=0; i<100000; i++)); do mkdir directory-$i-has-a-long-name-since-command-line-length-is-limited ; done

On my system, running

ls -d */

in that directory yields bash: /bin/ls: Argument list too long error, while the find command and the nftw() based program all run just fine.

You also cannot remove the directories using rmdir directory-*/ for the same reason. Use

find . -name 'directory-*' -type d -print0 | xargs -r0 rmdir

instead. Or just remove the entire directory and subdirectories,

cd ..
rm -rf lots-of-subdirs

Aquamarine answered 10/9, 2016 at 21:38 Comment(5)

find -delete would be even easier for that special case. But xargs -0 is a good example. For GNU find, find -exec rmdir {} + would batch args together into maximum-size groups (unlike find -exec rmdir {} \;), so it can often replace xargs. – Zoosperm 10/9, 2016 at 22:49

@PeterCordes: Agreed. I was wondering whether to wax about using handle = popen("find ... -print0", "r"); or handle = popen("find ... -printf '%p\n'") with getdelim(&name, &namesize, '\0', handle) to find specific files, as it is a good KISS way to do it (assuming we do not care if the user has done something weird to the find utility or PATH). – Aquamarine 10/9, 2016 at 22:59

Edit: or handle = popen("find ... -printf '\p\0'"); above, of course. – Aquamarine 10/9, 2016 at 23:6

Careful with your quoting. \0 inside a C double-quoted string-literal terminates it early. I think you mean "... -printf '\\p\\0'". – Zoosperm 10/9, 2016 at 23:8

@PeterCordes: Gaaaaaaah. No, it should've been handle = popen("find ... -printf '%p\\0'", "r");. Anyway, the approach is especially nice if you allow plugins or templates either in the plugins directory or in a subdirectory, with a specific filename suffix denoting its type. Very user-friendly. In practice, it ends up looking like handle = popen(FIND_CMD " " PLUGIN_DIRS " " FIND_PLUGIN_SPEC " -printf '%p\\0'", "r"); or something, with the macros determined at compile time (in case e.g. somebody wants to use /usr/bin/find explicitly on their distro). – Aquamarine 10/9, 2016 at 23:17

W

4

Just call system. Globs on Unixes are expanded by the shell. system will give you a shell.

You can avoid the whole fork-exec thing by doing the glob(3) yourself:

int ec;
glob_t gbuf;
if(0==(ec=glob("*/", 0, NULL, &gbuf))){
    char **p = gbuf.gl_pathv;
    if(p){
        while(*p)
            printf("%s\n", *p++);
    }
}else{
   /*handle glob error*/ 
}

You could pass the results to a spawned ls, but there's hardly a point in doing that.

(If you do want to do fork and exec, you should start with a template that does proper error checking -- each of those calls may fail.)

Westbound answered 10/9, 2016 at 19:39 Comment(8)

As I just got it to work with supplying just a single directory, and was rather flummoxed with finding out the problem with *, can you replace 'globs' with 'wildcards' – and explain why those are a problem for ls? – Mighell 10/9, 2016 at 19:43

Really low level would just fd= opendir("."), and readdir(fd). Use stat() on the entries, if readdir doesn't return filetype info to let you find the directories without stating ever dirent. – Zoosperm 10/9, 2016 at 19:44

@RadLexus: ls and other normal Unix programs don't treat their args as wildcards. So in the shell, you could run ls '*' to pass a literal * to ls. Use strace ls * to see the args ls actually gets when you run that. Some programs ported from DOS (or that use globs for a special purpose) will have glob-handling built-in, so you have to use an extra layer of quoting to protect meta-characters from the shell and from the program the shell passes them too, if you want to deal with arbitrary filenames. – Zoosperm 10/9, 2016 at 19:45

added an answer using POSIX opendir and d_type with a fallback to stat. I'll leave it for someone else to write an answer using the Linux getdents() system call directly. Using glob for this special case seems silly to me. – Zoosperm 10/9, 2016 at 20:33

As usual, I believe nftw() is the proper answer, with a simple helper function. glob() is useful if you need the names in an array; for just printing, I too think it is silly. opendir()/readdir() in this particular case is okay, because no recursion is done. – Aquamarine 10/9, 2016 at 21:41

@NominalAnimal readdir on linux is technically better. it usually allows you to avoid stat calls which are expensive. nftw will stat unconditionally. For the recursive case, a spawned GNU find is usually faster than nftw. – Westbound 10/9, 2016 at 22:6

@NominalAnimal As far as library based solutions are correct, this guy github.com/tavianator/bfs seems to be doing the recursive version right (no stats unless needed) but it's nontrivial because of the need to work within the fildescriptor limit. – Westbound 10/9, 2016 at 22:12

@PSkocik: As I said, readdir() in this particular case is okay. The only truly working method of avoiding the file descriptor limit without races is to spawn helper slave processes to hold earlier descriptors in escrow. Speed is irrelevant when exchanged for relability! You may consider fast but sometimes incorrect "technically better", but I do not. – Aquamarine 10/9, 2016 at 22:52

M

4

If you are looking for a simple way to get a list of folders into your program, I'd rather suggest the spawnless way, not calling an external program, and use the standard POSIX opendir/readdir functions.

It's almost as short as your program, but has several additional advantages:

you get to pick folders and files at will by checking the d_type
you can elect to early discard system entries and (semi)hidden entries by testing the first character of the name for a .
you can immediately print out the result, or store it in memory for later use
you can do additional operations on the list in memory, such as sorting and removing other entries that don't need to be included.

#include <stdio.h>
#include <sys/types.h>
#include <sys/dir.h>

int main( void )
{
    DIR *dirp;
    struct dirent *dp;

    dirp = opendir(".");
    while ((dp = readdir(dirp)) != NULL)
    {
        if (dp->d_type & DT_DIR)
        {
            /* exclude common system entries and (semi)hidden names */
            if (dp->d_name[0] != '.')
                printf ("%s\n", dp->d_name);
        }
    }
    closedir(dirp);

    return 0;
}

Mighell answered 10/9, 2016 at 19:53 Comment(4)

Using d_type without checking for DT_UNKNOWN is an error. Your program will never find directories on typical XFS filesystems, because mkfs.xfs doen't enable -n ftype=1, so the filesystem doesn't cheaply provide filetype info, so it sets d_type=DT_UNKNOWN. (And of course any other FS that always has DT_UNKNOWN). See my answer for a fallback to stat for DT_UNKNOWN, and for symlinks (in case they're symlinks to directories, preserving that part of the semantics of */, too.) As usual, the lower-level higher performance APIs hide less of the complexity than higher-level APIs. – Zoosperm 10/9, 2016 at 20:38

@PeterCordes: I just noticed your much more complete answer! (I came here to upvote and chew bubblegum, but alas, I'm all out of votes.) – Mighell 10/9, 2016 at 20:41

I think you posted yours after I started working on mine, probably just after I finished reading the existing answers (neither of which were even close to what I'd call "low-level"). I mean, my answer still isn't in assembly language with direct syscalls instead of using glibc function calls, and I even used printf! – Zoosperm 10/9, 2016 at 21:2

Nice approach too @RadLexus! – Multidisciplinary 11/9, 2016 at 4:45

M

1

Another less low-level approach, with system():

#include <stdlib.h>

int main(void)
{
    system("/bin/ls -d */");
    return 0;
}

Notice with system(), you don't need to fork(). However, I recall that we should avoid using system() when possible!

As Nomimal Animal said, this will fail when the number of subdirectories is too big! See his answer for more...

Multidisciplinary answered 10/9, 2016 at 19:42 Comment(2)

This won't work if the directory contains so many subdirectories that listing them all would exceed the maximum command line length. This affects all the answers that rely on shell doing the globbing, and providing them as parameters to a single command like ls. See my answer for details. – Aquamarine 10/9, 2016 at 21:44

Thank you @NominalAnimal for letting me know. However, I won't delete, since it can be used for simple use. :) Updated! :) – Multidisciplinary 11/9, 2016 at 4:46

Recommended topics

Hot tags