All of the other answers focus on individually checking every file, but if the files are all in one directory (folder), it might be much more efficient to just read the directory and check for the existence of every file name you want.
This might even be more efficient even if the files are spread across several directories, depends on the exact ratio of directories to files. Once you start getting closer to each target file being in its own directory, or there being lots of other files in the same directories which you don't want to check for, then I'd expect it to finally tip over into being less efficient than checking each file individually.
A good heuristic: working on a bunch of data you already have is much faster than asking the operating system for any amount of data. System call overhead is huge relative to individual machine instructions. So it is almost always going to be faster to ask the OS "give me the entire list of files in this directory" and then to dig through that list, and slower to ask the OS "give me information on this file", "okay now give me information on this other file", "now give me information on ...", and so on.
Every good C library implements its "iterate over all files in a directory" APIs in an efficient way, just like buffered I/O - internally it reads up a big list of directory entries from the OS at once, even though the APIs look like asking the OS for each entry individually.
So if I had this requirement, I would
- do everything possible to encourage the design and usage so that all the files were in one folder, and no other files were in that folder,
- put the list of file names that I need to be present into a data structure in memory that has O(1) or at least O(log(n)) lookup and delete times (like a hash map or a binary tree),
- list the files in that directory, and "check off" (delete) each one as I went from the "list" (hash map or binary tree) in memory.
Except depending on the exact use case, maybe instead of deleting entries from a hash map or tree, I would keep track of a "do I have this file?" boolean for each entry, and figure out a data structure that would make it O(1) to ask "do I have every file?". Maybe a binary tree but the struct for each non-leaf node also has a boolean that is a logical-and of the booleans of its leaf nodes. That scales well - after setting a boolean in a leaf node, you just walk up the tree and set each node's "have this?" boolean with the &&
of its child node's boolean (and you don't need to recurse down those other child nodes, since if you're doing this process consistently every time you go to set one of the leaves to true, they will be set to true if and only if all of their children are.)
Sadly, there is no standard way to do it until C++17.
C++17 got std::filesystem::directory_iterator
.
Of course there is a corresponding boost::filesystem::directory_iterator
which I presume will work in older versions of C++.
The closest thing to a standard C way is opendir
and readdir
from dirent.h
. That is a standard C interface, it's just standardized in POSIX and not in the C standard itself. It comes is available out-of-the-box on Mac OS, Linux, all the BSDs, other UNIX/UNIX-like systems, and any other POSIX/SUS system. For Windows, there is a dirent.h
implementation that you just have to download and drop into your include path.
However, since you're looking for the fastest way, you might want to look beyond the portable/standard stuff.
On Linux, you might be able to optimize your performance by manually specifying the buffer size with the raw system call getdents64
.
On Windows, after a bit of digging, it looks like for maximum performance you want to use FindFirstFileEx
with FindExInfoBasic
and FIND_FIRST_EX_LARGE_FETCH
when you can, which a lot of the open source libraries like the above dirent.h
for Windows don't seem to do. But for code that needs to work with stuff older than the last couple Windows versions, you might as well just use the straightforward FindFirstFile
without the extra flags.
Plan 9 won't be covered by any of the above, and there you'll want dirread
or dirreadall
(the latter if you can safely assume you have enough memory for the entire directory contents). If you want more control over the buffer size for performance use plain read
or read
and decode the directory entry data - they're in a documented machine-independent format and I think there's helper functions provided.
I don't know about any other operating systems.
I might edit this answer with some tests later. Others are welcome to edit in test results as well.
boost::filesystem
seems to usestat()
. (Assuming from the documentation.) I don't think you can do much faster for FS calls. The way to make what you're doing fast is "avoid looking at thousands of files." – Upholstergit push
probably doesn't bother to make sure you're not touching the working tree after the initial dirty check. – Upholsterunistd.h
is POSIX, not the pedantically defined libc. Of course, I can't think of a C/C++ implementation that wouldn't have it, orstat()
, except for some sort of weird embedded environment which doesn't really fit the "thousands of files" requirement. Which makes the "no OS APIs" kind of a dumb restriction, seeing as fiddling with the filesystem is indeed the domain of the OS. – Upholster