How to list first level directories only in C?
Asked Answered
M

5

4

In a terminal I can call ls -d */. Now I want a program to do that for me, like this:

#include <sys/types.h>
#include <sys/wait.h>
#include <stdio.h>
#include <unistd.h>

int main( void )
{
    int status;

    char *args[] = { "/bin/ls", "-l", NULL };

    if ( fork() == 0 )
        execv( args[0], args );
    else
        wait( &status ); 

    return 0;
}

This will ls -l everything. However, when I am trying:

char *args[] = { "/bin/ls", "-d", "*/",  NULL };

I will get a runtime error:

ls: */: No such file or directory

Multidisciplinary answered 10/9, 2016 at 19:24 Comment(5)
Just call system. Globs on Unixes are expanded by the shell. system will give you a shell.Westbound
Thanks @PSkocik, that did it! Would like to post an answer? system("/bin/ls -d */"); Explaining why execv() couldn't do the trick ;)Multidisciplinary
Remember that if you use system(), you shouldn't also fork().Acquah
Correct @unwind, I wrote the code, 3 lines of code in the body of main().Multidisciplinary
avoid system() and use execv() wherever possible. system() requires proper quoting and is the source of many security problems. Your problem is that '*' is expanded by the shell but not by ls. You can try to execute find -type d instead of.Cordillera
A
5

Unfortunately, all solutions based on shell expansion are limited by the maximum command line length. Which varies (run true | xargs --show-limits to find out); on my system, it is about two megabytes. Yes, many will argue that it suffices -- as did Bill Gates on 640 kilobytes, once.

(When running certain parallel simulations on non-shared filesystems, I do occasionally have tens of thousands of files in the same directory, during the collection phase. Yes, I could do that differently, but that happens to be the easiest and most robust way to collect the data. Very few POSIX utilities are actually silly enough to assume "X is sufficient for everybody".)

Fortunately, there are several solutions. One is to use find instead:

system("/usr/bin/find . -mindepth 1 -maxdepth 1 -type d");

You can also format the output as you wish, not depending on locale:

system("/usr/bin/find . -mindepth 1 -maxdepth 1 -type d -printf '%p\n'");

If you want to sort the output, use \0 as the separator (since filenames are allowed to contain newlines), and -t= for sort to use \0 as the separator, too. tr will convert them to newlines for you:

system("/usr/bin/find . -mindepth 1 -maxdepth 1 -type d -printf '%p\0' | sort -t= | tr -s '\0' '\n'");

If you want the names in an array, use glob() function instead.

Finally, as I like to harp every now and then, one can use the POSIX nftw() function to implement this internally:

#define _GNU_SOURCE
#include <stdio.h>
#include <ftw.h>

#define NUM_FDS 17

int myfunc(const char *path,
           const struct stat *fileinfo,
           int typeflag,
           struct FTW *ftwinfo)
{
    const char *file = path + ftwinfo->base;
    const int depth = ftwinfo->level;

    /* We are only interested in first-level directories.
       Note that depth==0 is the directory itself specified as a parameter.
    */
    if (depth != 1 || (typeflag != FTW_D && typeflag != FTW_DNR))
        return 0;

    /* Don't list names starting with a . */
    if (file[0] != '.')
        printf("%s/\n", path);

    /* Do not recurse. */
    return FTW_SKIP_SUBTREE;
}

and the nftw() call to use the above is obviously something like

if (nftw(".", myfunc, NUM_FDS, FTW_ACTIONRETVAL)) {
    /* An error occurred. */
}

The only "issue" in using nftw() is to choose a good number of file descriptors the function may use (NUM_FDS). POSIX says a process must always be able to have at least 20 open file descriptors. If we subtract the standard ones (input, output, and error), that leaves 17. The above is unlikely to use more than 3, though.

You can find the actual limit using sysconf(_SC_OPEN_MAX), and subtracting the number of descriptors your process may use at the same time. In current Linux systems, it is typically limited to 1024 per process.

The good thing is, as long as that number is at least 4 or 5 or so, it only affects the performance: it just determines how deep nftw() can go in the directory tree structure, before it has to use workarounds.

If you want to create a test directory with lots of subdirectories, use something like the following Bash:

mkdir lots-of-subdirs
cd lots-of-subdirs
for ((i=0; i<100000; i++)); do mkdir directory-$i-has-a-long-name-since-command-line-length-is-limited ; done

On my system, running

ls -d */

in that directory yields bash: /bin/ls: Argument list too long error, while the find command and the nftw() based program all run just fine.

You also cannot remove the directories using rmdir directory-*/ for the same reason. Use

find . -name 'directory-*' -type d -print0 | xargs -r0 rmdir

instead. Or just remove the entire directory and subdirectories,

cd ..
rm -rf lots-of-subdirs
Aquamarine answered 10/9, 2016 at 21:38 Comment(5)
find -delete would be even easier for that special case. But xargs -0 is a good example. For GNU find, find -exec rmdir {} + would batch args together into maximum-size groups (unlike find -exec rmdir {} \;), so it can often replace xargs.Zoosperm
@PeterCordes: Agreed. I was wondering whether to wax about using handle = popen("find ... -print0", "r"); or handle = popen("find ... -printf '%p\n'") with getdelim(&name, &namesize, '\0', handle) to find specific files, as it is a good KISS way to do it (assuming we do not care if the user has done something weird to the find utility or PATH).Aquamarine
Edit: or handle = popen("find ... -printf '\p\0'"); above, of course.Aquamarine
Careful with your quoting. \0 inside a C double-quoted string-literal terminates it early. I think you mean "... -printf '\\p\\0'".Zoosperm
@PeterCordes: Gaaaaaaah. No, it should've been handle = popen("find ... -printf '%p\\0'", "r");. Anyway, the approach is especially nice if you allow plugins or templates either in the plugins directory or in a subdirectory, with a specific filename suffix denoting its type. Very user-friendly. In practice, it ends up looking like handle = popen(FIND_CMD " " PLUGIN_DIRS " " FIND_PLUGIN_SPEC " -printf '%p\\0'", "r"); or something, with the macros determined at compile time (in case e.g. somebody wants to use /usr/bin/find explicitly on their distro).Aquamarine
Z
13

The lowest-level way to do this is with the same Linux system calls ls uses.

So look at the output of strace -efile,getdents ls:

execve("/bin/ls", ["ls"], [/* 72 vars */]) = 0
...
openat(AT_FDCWD, ".", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
getdents(3, /* 23 entries */, 32768)    = 840
getdents(3, /* 0 entries */, 32768)     = 0
...

getdents is a Linux-specific system call. The man page says that it's used under the hood by libc's readdir(3) POSIX API function.


The lowest-level portable way (portable to POSIX systems), is to use the libc functions to open a directory and read the entries. POSIX doesn't specify the exact system call interface, unlike for non-directory files.

These functions:

DIR *opendir(const char *name);
struct dirent *readdir(DIR *dirp);

can be used like this:

// print all directories, and symlinks to directories, in the CWD.
// like sh -c 'ls -1UF -d */'  (single-column output, no sorting, append a / to dir names)
// tested and works on Linux, with / without working d_type

#define _GNU_SOURCE    // includes _BSD_SOURCE for DT_UNKNOWN etc.
#include <dirent.h>
#include <stdint.h>

#include <sys/types.h>
#include <sys/stat.h>
#include <stdio.h>
#include <stdlib.h>

int main() {
    DIR *dirhandle = opendir(".");     // POSIX doesn't require this to be a plain file descriptor.  Linux uses open(".", O_DIRECTORY); to implement this
    //^Todo: error check
    struct dirent *de;
    while(de = readdir(dirhandle)) { // NULL means end of directory
        _Bool is_dir;
    #ifdef _DIRENT_HAVE_D_TYPE
        if (de->d_type != DT_UNKNOWN && de->d_type != DT_LNK) {
           // don't have to stat if we have d_type info, unless it's a symlink (since we stat, not lstat)
           is_dir = (de->d_type == DT_DIR);
        } else
    #endif
        {  // the only method if d_type isn't available,
           // otherwise this is a fallback for FSes where the kernel leaves it DT_UNKNOWN.
           struct stat stbuf;
           // stat follows symlinks, lstat doesn't.
           stat(de->d_name, &stbuf);              // TODO: error check
           is_dir = S_ISDIR(stbuf.st_mode);
        }

        if (is_dir) {
           printf("%s/\n", de->d_name);
        }
    }
}

There's also a fully compilable example of reading directory entries and printing file info in the Linux stat(3posix) man page. (not the Linux stat(2) man page; it has a different example).


The man page for readdir(3) says the Linux declaration of struct dirent is:

   struct dirent {
       ino_t          d_ino;       /* inode number */
       off_t          d_off;       /* not an offset; see NOTES */
       unsigned short d_reclen;    /* length of this record */
       unsigned char  d_type;      /* type of file; not supported
                                      by all filesystem types */
       char           d_name[256]; /* filename */
   };

d_type is either DT_UNKNOWN, in which case you need to stat to learn anything about whether the directory entry is itself a directory. Or it can be DT_DIR or something else, in which case you can be sure it is or isn't a directory without having to stat it.

Some filesystems, like EXT4 I think, and very recent XFS (with the new metadata version), keep type info in the directory, so it can be returned without having to load the inode from disk. This is a huge speedup for find -name: it doesn't have to stat anything to recurse through subdirs. But for filesystems that don't do this, d_type will always be DT_UNKNOWN, because filling it in would require reading all the inodes (which might not even be loaded from disk).

Sometimes you're just matching on filenames, and don't need type info, so it would be bad if the kernel spent a lot of extra CPU time (or especially I/O time) filling in d_type when it's not cheap. d_type is just a performance shortcut; you always need a fallback (except maybe when writing for an embedded system where you know what FS you're using and that it always fills in d_type, and that you have some way to detect the breakage when someone in the future tries to use this code on another FS type.)

Zoosperm answered 10/9, 2016 at 20:32 Comment(4)
With dirfd (3) and fstatat (2) you can work with any directory. not only the current one.Draught
@Igor What about the code above suggests to you that only the current directory can be used?Soundless
@ChristopherSchultz: I used stat(de->d_name, &stbuf);, i.e. using the dir entry straight from readdir as a relative path, i.e. relative to the current directory. Using dirfd and fstatat is a great suggestion for using them relative to another directory, instead of doing string manipulation to create paths to that directory.Zoosperm
@PeterCordes Aah, thanks for pointing that out. I was assuming that string-manipulation was not a problem, an that @Igor was claiming that calling chdir would be necessary to use stat.Soundless
A
5

Unfortunately, all solutions based on shell expansion are limited by the maximum command line length. Which varies (run true | xargs --show-limits to find out); on my system, it is about two megabytes. Yes, many will argue that it suffices -- as did Bill Gates on 640 kilobytes, once.

(When running certain parallel simulations on non-shared filesystems, I do occasionally have tens of thousands of files in the same directory, during the collection phase. Yes, I could do that differently, but that happens to be the easiest and most robust way to collect the data. Very few POSIX utilities are actually silly enough to assume "X is sufficient for everybody".)

Fortunately, there are several solutions. One is to use find instead:

system("/usr/bin/find . -mindepth 1 -maxdepth 1 -type d");

You can also format the output as you wish, not depending on locale:

system("/usr/bin/find . -mindepth 1 -maxdepth 1 -type d -printf '%p\n'");

If you want to sort the output, use \0 as the separator (since filenames are allowed to contain newlines), and -t= for sort to use \0 as the separator, too. tr will convert them to newlines for you:

system("/usr/bin/find . -mindepth 1 -maxdepth 1 -type d -printf '%p\0' | sort -t= | tr -s '\0' '\n'");

If you want the names in an array, use glob() function instead.

Finally, as I like to harp every now and then, one can use the POSIX nftw() function to implement this internally:

#define _GNU_SOURCE
#include <stdio.h>
#include <ftw.h>

#define NUM_FDS 17

int myfunc(const char *path,
           const struct stat *fileinfo,
           int typeflag,
           struct FTW *ftwinfo)
{
    const char *file = path + ftwinfo->base;
    const int depth = ftwinfo->level;

    /* We are only interested in first-level directories.
       Note that depth==0 is the directory itself specified as a parameter.
    */
    if (depth != 1 || (typeflag != FTW_D && typeflag != FTW_DNR))
        return 0;

    /* Don't list names starting with a . */
    if (file[0] != '.')
        printf("%s/\n", path);

    /* Do not recurse. */
    return FTW_SKIP_SUBTREE;
}

and the nftw() call to use the above is obviously something like

if (nftw(".", myfunc, NUM_FDS, FTW_ACTIONRETVAL)) {
    /* An error occurred. */
}

The only "issue" in using nftw() is to choose a good number of file descriptors the function may use (NUM_FDS). POSIX says a process must always be able to have at least 20 open file descriptors. If we subtract the standard ones (input, output, and error), that leaves 17. The above is unlikely to use more than 3, though.

You can find the actual limit using sysconf(_SC_OPEN_MAX), and subtracting the number of descriptors your process may use at the same time. In current Linux systems, it is typically limited to 1024 per process.

The good thing is, as long as that number is at least 4 or 5 or so, it only affects the performance: it just determines how deep nftw() can go in the directory tree structure, before it has to use workarounds.

If you want to create a test directory with lots of subdirectories, use something like the following Bash:

mkdir lots-of-subdirs
cd lots-of-subdirs
for ((i=0; i<100000; i++)); do mkdir directory-$i-has-a-long-name-since-command-line-length-is-limited ; done

On my system, running

ls -d */

in that directory yields bash: /bin/ls: Argument list too long error, while the find command and the nftw() based program all run just fine.

You also cannot remove the directories using rmdir directory-*/ for the same reason. Use

find . -name 'directory-*' -type d -print0 | xargs -r0 rmdir

instead. Or just remove the entire directory and subdirectories,

cd ..
rm -rf lots-of-subdirs
Aquamarine answered 10/9, 2016 at 21:38 Comment(5)
find -delete would be even easier for that special case. But xargs -0 is a good example. For GNU find, find -exec rmdir {} + would batch args together into maximum-size groups (unlike find -exec rmdir {} \;), so it can often replace xargs.Zoosperm
@PeterCordes: Agreed. I was wondering whether to wax about using handle = popen("find ... -print0", "r"); or handle = popen("find ... -printf '%p\n'") with getdelim(&name, &namesize, '\0', handle) to find specific files, as it is a good KISS way to do it (assuming we do not care if the user has done something weird to the find utility or PATH).Aquamarine
Edit: or handle = popen("find ... -printf '\p\0'"); above, of course.Aquamarine
Careful with your quoting. \0 inside a C double-quoted string-literal terminates it early. I think you mean "... -printf '\\p\\0'".Zoosperm
@PeterCordes: Gaaaaaaah. No, it should've been handle = popen("find ... -printf '%p\\0'", "r");. Anyway, the approach is especially nice if you allow plugins or templates either in the plugins directory or in a subdirectory, with a specific filename suffix denoting its type. Very user-friendly. In practice, it ends up looking like handle = popen(FIND_CMD " " PLUGIN_DIRS " " FIND_PLUGIN_SPEC " -printf '%p\\0'", "r"); or something, with the macros determined at compile time (in case e.g. somebody wants to use /usr/bin/find explicitly on their distro).Aquamarine
W
4

Just call system. Globs on Unixes are expanded by the shell. system will give you a shell.

You can avoid the whole fork-exec thing by doing the glob(3) yourself:

int ec;
glob_t gbuf;
if(0==(ec=glob("*/", 0, NULL, &gbuf))){
    char **p = gbuf.gl_pathv;
    if(p){
        while(*p)
            printf("%s\n", *p++);
    }
}else{
   /*handle glob error*/ 
}

You could pass the results to a spawned ls, but there's hardly a point in doing that.

(If you do want to do fork and exec, you should start with a template that does proper error checking -- each of those calls may fail.)

Westbound answered 10/9, 2016 at 19:39 Comment(8)
As I just got it to work with supplying just a single directory, and was rather flummoxed with finding out the problem with *, can you replace 'globs' with 'wildcards' – and explain why those are a problem for ls?Mighell
Really low level would just fd= opendir("."), and readdir(fd). Use stat() on the entries, if readdir doesn't return filetype info to let you find the directories without stating ever dirent.Zoosperm
@RadLexus: ls and other normal Unix programs don't treat their args as wildcards. So in the shell, you could run ls '*' to pass a literal * to ls. Use strace ls * to see the args ls actually gets when you run that. Some programs ported from DOS (or that use globs for a special purpose) will have glob-handling built-in, so you have to use an extra layer of quoting to protect meta-characters from the shell and from the program the shell passes them too, if you want to deal with arbitrary filenames.Zoosperm
added an answer using POSIX opendir and d_type with a fallback to stat. I'll leave it for someone else to write an answer using the Linux getdents() system call directly. Using glob for this special case seems silly to me.Zoosperm
As usual, I believe nftw() is the proper answer, with a simple helper function. glob() is useful if you need the names in an array; for just printing, I too think it is silly. opendir()/readdir() in this particular case is okay, because no recursion is done.Aquamarine
@NominalAnimal readdir on linux is technically better. it usually allows you to avoid stat calls which are expensive. nftw will stat unconditionally. For the recursive case, a spawned GNU find is usually faster than nftw.Westbound
@NominalAnimal As far as library based solutions are correct, this guy github.com/tavianator/bfs seems to be doing the recursive version right (no stats unless needed) but it's nontrivial because of the need to work within the fildescriptor limit.Westbound
@PSkocik: As I said, readdir() in this particular case is okay. The only truly working method of avoiding the file descriptor limit without races is to spawn helper slave processes to hold earlier descriptors in escrow. Speed is irrelevant when exchanged for relability! You may consider fast but sometimes incorrect "technically better", but I do not.Aquamarine
M
4

If you are looking for a simple way to get a list of folders into your program, I'd rather suggest the spawnless way, not calling an external program, and use the standard POSIX opendir/readdir functions.

It's almost as short as your program, but has several additional advantages:

  • you get to pick folders and files at will by checking the d_type
  • you can elect to early discard system entries and (semi)hidden entries by testing the first character of the name for a .
  • you can immediately print out the result, or store it in memory for later use
  • you can do additional operations on the list in memory, such as sorting and removing other entries that don't need to be included.

#include <stdio.h>
#include <sys/types.h>
#include <sys/dir.h>

int main( void )
{
    DIR *dirp;
    struct dirent *dp;

    dirp = opendir(".");
    while ((dp = readdir(dirp)) != NULL)
    {
        if (dp->d_type & DT_DIR)
        {
            /* exclude common system entries and (semi)hidden names */
            if (dp->d_name[0] != '.')
                printf ("%s\n", dp->d_name);
        }
    }
    closedir(dirp);

    return 0;
}
Mighell answered 10/9, 2016 at 19:53 Comment(4)
Using d_type without checking for DT_UNKNOWN is an error. Your program will never find directories on typical XFS filesystems, because mkfs.xfs doen't enable -n ftype=1, so the filesystem doesn't cheaply provide filetype info, so it sets d_type=DT_UNKNOWN. (And of course any other FS that always has DT_UNKNOWN). See my answer for a fallback to stat for DT_UNKNOWN, and for symlinks (in case they're symlinks to directories, preserving that part of the semantics of */, too.) As usual, the lower-level higher performance APIs hide less of the complexity than higher-level APIs.Zoosperm
@PeterCordes: I just noticed your much more complete answer! (I came here to upvote and chew bubblegum, but alas, I'm all out of votes.)Mighell
I think you posted yours after I started working on mine, probably just after I finished reading the existing answers (neither of which were even close to what I'd call "low-level"). I mean, my answer still isn't in assembly language with direct syscalls instead of using glibc function calls, and I even used printf!Zoosperm
Nice approach too @RadLexus!Multidisciplinary
M
1

Another less low-level approach, with system():

#include <stdlib.h>

int main(void)
{
    system("/bin/ls -d */");
    return 0;
}

Notice with system(), you don't need to fork(). However, I recall that we should avoid using system() when possible!


As Nomimal Animal said, this will fail when the number of subdirectories is too big! See his answer for more...

Multidisciplinary answered 10/9, 2016 at 19:42 Comment(2)
This won't work if the directory contains so many subdirectories that listing them all would exceed the maximum command line length. This affects all the answers that rely on shell doing the globbing, and providing them as parameters to a single command like ls. See my answer for details.Aquamarine
Thank you @NominalAnimal for letting me know. However, I won't delete, since it can be used for simple use. :) Updated! :)Multidisciplinary

© 2022 - 2024 — McMap. All rights reserved.