I'll try to qualitatively answer how fast file existence tests are on tmpfs, and then I can suggest how you can make your whole program run faster.
First, tmpfs directory lookups rely (in the kernel) on directory entry cache hash table lookups, which aren't that sensitive to the number of files in your directory. They are affected, but sub-linearly. It has to do with the fact that properly-done hash table lookups take some constant time, O(1)
, regardless of the number of items in the hash table.
To explain, we can look at the work that is done by test -f
, or [ -f X ]
, from coreutils (gitweb):
case 'e':
unary_advance ();
return stat (argv[pos - 1], &stat_buf) == 0;
...
case 'f': /* File is a file? */
unary_advance ();
/* Under POSIX, -f is true if the given file exists
and is a regular file. */
return (stat (argv[pos - 1], &stat_buf) == 0
&& S_ISREG (stat_buf.st_mode));
So it uses stat()
on the filename directly. No directory listing is done explicitly by test
, but the runtime of stat
may be affected by the number of files in the directory. The completion time for the stat
call will depend on the unterlying filesystem implementation.
For every filesystem, stat will split up the path into directory components, and walk it down. For instance, for the path /tmp/hashes/the_md5
: first /
, gets its inode, then looks up tmp
inside it, gets that inode (it's a new mountpoint), then gets hashes
inode, and finally then the test filename and its inode. You can expect the inodes all the way to /tmp/hashes/
to be cached because they are repeated at each iteration, so those lookups are fast and likely don't require disk access. Each lookup will depend on the filesystem the parent directory is on. After the /tmp/
portion, lookups happen on tmpfs (which is all in memory, except if you ever run out of memory and need to use swap).
tmpfs in linux relies on simple_lookup
to obtain the inode of a file in a directory. tmpfs is located under its old name in the tree linux mm/shmem.c . tmpfs, much like ramfs, doesn't seem to be implementing data structures of its own to keep track of virtual data, it simply relies on VFS directory entry caches (under Directory Entry Caches).
Therefore, I suspect the lookup for a file's inode, in a directory, is as simple as a hash table lookup. I'd say that as long as all your temporary files fit in your memory, and you use tmpfs/ramfs, it doesn't matter how many files are there -- it's O(1) lookup each time.
Other filesystems like Ext2/3, however, will incur a penalty linear with the number of files present in the directory.
storing them in memory
As others have suggested, you may also store MD5s in memory by storing them in bash variables, and avoid the filesystem (and associated syscall) penalties. Storing them on a filesystem has the advantage that you could resume from where you left if you were to interrupt your loop (your md5 could be a symlink to the file whose digest matches, which you could rely on, on subsequent runs), but is slower.
MD5=d41d8cd98f00b204e9800998ecf8427e
let SEEN_${MD5}=1
...
digest=$(md5hash_of <filename>)
let exists=SEEN_$digest
if [[ "$exists" == 1 ]]; then
# already seen this file
fi
faster tests
And you may use [[ -f my_file ]]
instead of [ -f my_file ]
. The command [[
is a bash built-in, and is much faster than spawning a new process (/usr/bin/[
) for each comparison. It will make an even bigger difference.
what is /usr/bin/[
/usr/bin/test
and /usr/bin/[
are two different programs, but the source code for [
(lbracket.c) is the same as test.c (again in coreutils):
#define LBRACKET 1
#include "test.c"
so they're interchangeable.
md5sum *
and compare the two text files you obtain. – Unbeknownmd5sum * | sort -u -k 1
but this is a clever idea, indeed, I admit. – Kevindiff
on the files before deleting them. – Kevin