Time complexity of finding duplicate files in bash
Asked Answered
H

4

6

I had to write a Bash script to delete duplicate files today, using their md5 hashes. I stored those hashes as files in a temporary directory:

for i in * ; do
    hash=$(md5sum /tmp/msg | cut -d " " -f1) ;
    if [ -f /tmp/hashes/$hash ] ;
    then
        echo "Deleted $i" ;
        mv $i /tmp/deleted ;
    else
        touch /tmp/hashes/$hash ;
    fi ;
done

It worked perfectly, but led me to wonder: is it a time-efficient way of doing that? I initially thought of storing the MD5 hashes in a file, but then I thought "no, because checking whether a given MD5 is in this file requires to re-read it entirely every time". Now, I wonder: is it the same when using the "create files in a directory" method? Has the Bash [ -f ] check linear, or quasi-constant complexity when there are lots of file in the same directory?

If it depends on the filesystem, what's the complexity on tmpfs?

Hustings answered 1/8, 2015 at 17:37 Comment(15)
Use Awk with an associative array if the file system gets too slow.Unbeknown
I'd hope for it to be roughly logarithmic (in the number of files) for any decent file system but you'll still be much faster storing the hashes in an in-memory hash table. If you can use Python, for example, this would be a trivial thing to do.Kevin
Or just do md5sum * and compare the two text files you obtain.Unbeknown
@Unbeknown What two files? I think you mean something like md5sum * | sort -u -k 1 but this is a clever idea, indeed, I admit.Kevin
@Ted: there may be two different files with the same md5sum.Frizzly
You should use something like SHA-256 instead of MD5, because MD5 is known to have hash collisions.Landri
One from the original set of files and another from the set you are comparing against. If it's all just one set, just sort and remove adjacent duplicates.Unbeknown
@Unbeknown As I understand, OP is interested in the second scenario. Hence my comment about sorting.Kevin
@NayukiMinase MD5 is probably good enough for this. It's not secure against a knowledgeable adversary, but the chance that two random files will have a collision is still only 1 in 2^128, or approximately 1 in 3.4e38. (Though granted, if you are on a fast enough system, feel free to use a slower checksum if you're extremely paranoid.)Unbeknown
@NayukiMinase I think MD5 is well-suited for this task since it is fast and collisions are expected to be extremely rare. If you want to be sure, I'd rather do a diff on the files before deleting them.Kevin
Also, all hashes will generate collisions, since the space of possible hashes is far smaller than the set of objects you might want to hash. The problem with MD5 is that there are techniques for generating a file with the same hash as a given file.Barthol
If you need your record of deleted files to persist on disk then using file names in a known directory is probably as good an approach as any. I don't see any reason to think that recording them in a regular file instead would provide better performance.Bassett
As for suggestions to make it faster: I know that it was not an optimal solution (I don't know the complexity, and there's useless I/O) and there are better ones. I don't think the use of MD5 is an issue, as according to this link, I would need 2^64 files to get to 50% chance of a random solution (unless someone is actively trying to generate collisions, but this is not a likely scenario here). My question was curiosity about the complexity of file checking, not asking for a better option =)Hustings
As an aside, a good tool for the job (and there are many, already written and actively maintained by 3rd parties) will not hash all files, but only run hashes at all in the case where multiple files exist with the same size.Shane
Also, a reasonably-intelligent algorithm will check that two potentially-identical directory entries don't point to the same inode number on the same filesystem before bothering to read contents -- because if they do, you already know they're identical, so you don't need to go into contents.Shane
V
1

I'll try to qualitatively answer how fast file existence tests are on tmpfs, and then I can suggest how you can make your whole program run faster.

First, tmpfs directory lookups rely (in the kernel) on directory entry cache hash table lookups, which aren't that sensitive to the number of files in your directory. They are affected, but sub-linearly. It has to do with the fact that properly-done hash table lookups take some constant time, O(1), regardless of the number of items in the hash table.

To explain, we can look at the work that is done by test -f, or [ -f X ], from coreutils (gitweb):

case 'e':
   unary_advance ();
   return stat (argv[pos - 1], &stat_buf) == 0;
... 
case 'f':                   /* File is a file? */
   unary_advance ();
   /* Under POSIX, -f is true if the given file exists
      and is a regular file. */
   return (stat (argv[pos - 1], &stat_buf) == 0
           && S_ISREG (stat_buf.st_mode));

So it uses stat() on the filename directly. No directory listing is done explicitly by test, but the runtime of stat may be affected by the number of files in the directory. The completion time for the stat call will depend on the unterlying filesystem implementation.

For every filesystem, stat will split up the path into directory components, and walk it down. For instance, for the path /tmp/hashes/the_md5: first /, gets its inode, then looks up tmp inside it, gets that inode (it's a new mountpoint), then gets hashes inode, and finally then the test filename and its inode. You can expect the inodes all the way to /tmp/hashes/ to be cached because they are repeated at each iteration, so those lookups are fast and likely don't require disk access. Each lookup will depend on the filesystem the parent directory is on. After the /tmp/ portion, lookups happen on tmpfs (which is all in memory, except if you ever run out of memory and need to use swap).

tmpfs in linux relies on simple_lookup to obtain the inode of a file in a directory. tmpfs is located under its old name in the tree linux mm/shmem.c . tmpfs, much like ramfs, doesn't seem to be implementing data structures of its own to keep track of virtual data, it simply relies on VFS directory entry caches (under Directory Entry Caches).

Therefore, I suspect the lookup for a file's inode, in a directory, is as simple as a hash table lookup. I'd say that as long as all your temporary files fit in your memory, and you use tmpfs/ramfs, it doesn't matter how many files are there -- it's O(1) lookup each time.

Other filesystems like Ext2/3, however, will incur a penalty linear with the number of files present in the directory.

storing them in memory

As others have suggested, you may also store MD5s in memory by storing them in bash variables, and avoid the filesystem (and associated syscall) penalties. Storing them on a filesystem has the advantage that you could resume from where you left if you were to interrupt your loop (your md5 could be a symlink to the file whose digest matches, which you could rely on, on subsequent runs), but is slower.

MD5=d41d8cd98f00b204e9800998ecf8427e
let SEEN_${MD5}=1
...
digest=$(md5hash_of <filename>)
let exists=SEEN_$digest
if [[ "$exists" == 1 ]]; then
   # already seen this file
fi

faster tests

And you may use [[ -f my_file ]] instead of [ -f my_file ]. The command [[ is a bash built-in, and is much faster than spawning a new process (/usr/bin/[) for each comparison. It will make an even bigger difference.

what is /usr/bin/[

/usr/bin/test and /usr/bin/[ are two different programs, but the source code for [ (lbracket.c) is the same as test.c (again in coreutils):

#define LBRACKET 1
#include "test.c"

so they're interchangeable.

Varicocele answered 2/12, 2015 at 2:24 Comment(0)
B
2

I'm a fan of using the right tool for the job. In this case, you only want to see duplicate files. I've tested this against several thousand files at my disposal and rereading the file did not seem to have any problems. Plus I noticed that I have hundreds of duplicate files. When I store hashes in separate files and then process this large quantity of files, my system slowly creeps along after about 10,000 hash files in one directory. Having all of the hashes in a single file greatly sped this up.

# This uses md5deep.  An alternate is presented later.
md5deep -r some_folder > hashes.txt

# If you do not have md5deep
find . -type f -exec md5sum \{\} \;

This gives you hashes of everything.

cut -b -32 hashes.txt | sort | uniq -d > dupe_hashes.txt

That will use cut to get the hash for each file, sort the hashes, then find any duplicated hashes. Those are written to dupe_hashes.txt without the filenames attached. Now we need to map hashes back to the files.

(for hash in $(cat dupe_hashes.txt); do
    grep "^$hash" hashes.txt | tail -n +2 | cut -b 35-
done) > dupe_files.txt

This does not appear to run slowly for me. The Linux kernel does a very good job of keeping files like this in memory instead of reading them from the disk frequently. If you prefer to force this to be in memory, you could just use /dev/shm/hashes.txt instead of hashes.txt. I've found that it was unnecessary in my tests.

That gives you every file that's a duplicate. So far, so good. You'll probably want to review this list. If you want to list the original one as well, remove the tail -n +2 | bit from the command.

When you are comfortable that you can delete every listed file you can pipe things to xargs. This will delete the files in groups of 50.

xargs -L 50 rm < dupe_files.txt
Bohemia answered 30/9, 2015 at 19:43 Comment(3)
Why would you calculate the md5sum of a file if nothing else has exactly the same size? There's no possible duplicate. Tools like fdupes already know this.Shane
And if there are only two files with a given size, you're better off comparing them in chunks so you can stop without reading either in entirety should you find a place where they differ early. There are lots of optimizations that the naive let's-just-md5sum-everything approach discards.Shane
(Checking that two directory entries for which stat returns identical size don't point to the same inode before hashing them twice is another such simple, cheap optimization).Shane
V
1

I'll try to qualitatively answer how fast file existence tests are on tmpfs, and then I can suggest how you can make your whole program run faster.

First, tmpfs directory lookups rely (in the kernel) on directory entry cache hash table lookups, which aren't that sensitive to the number of files in your directory. They are affected, but sub-linearly. It has to do with the fact that properly-done hash table lookups take some constant time, O(1), regardless of the number of items in the hash table.

To explain, we can look at the work that is done by test -f, or [ -f X ], from coreutils (gitweb):

case 'e':
   unary_advance ();
   return stat (argv[pos - 1], &stat_buf) == 0;
... 
case 'f':                   /* File is a file? */
   unary_advance ();
   /* Under POSIX, -f is true if the given file exists
      and is a regular file. */
   return (stat (argv[pos - 1], &stat_buf) == 0
           && S_ISREG (stat_buf.st_mode));

So it uses stat() on the filename directly. No directory listing is done explicitly by test, but the runtime of stat may be affected by the number of files in the directory. The completion time for the stat call will depend on the unterlying filesystem implementation.

For every filesystem, stat will split up the path into directory components, and walk it down. For instance, for the path /tmp/hashes/the_md5: first /, gets its inode, then looks up tmp inside it, gets that inode (it's a new mountpoint), then gets hashes inode, and finally then the test filename and its inode. You can expect the inodes all the way to /tmp/hashes/ to be cached because they are repeated at each iteration, so those lookups are fast and likely don't require disk access. Each lookup will depend on the filesystem the parent directory is on. After the /tmp/ portion, lookups happen on tmpfs (which is all in memory, except if you ever run out of memory and need to use swap).

tmpfs in linux relies on simple_lookup to obtain the inode of a file in a directory. tmpfs is located under its old name in the tree linux mm/shmem.c . tmpfs, much like ramfs, doesn't seem to be implementing data structures of its own to keep track of virtual data, it simply relies on VFS directory entry caches (under Directory Entry Caches).

Therefore, I suspect the lookup for a file's inode, in a directory, is as simple as a hash table lookup. I'd say that as long as all your temporary files fit in your memory, and you use tmpfs/ramfs, it doesn't matter how many files are there -- it's O(1) lookup each time.

Other filesystems like Ext2/3, however, will incur a penalty linear with the number of files present in the directory.

storing them in memory

As others have suggested, you may also store MD5s in memory by storing them in bash variables, and avoid the filesystem (and associated syscall) penalties. Storing them on a filesystem has the advantage that you could resume from where you left if you were to interrupt your loop (your md5 could be a symlink to the file whose digest matches, which you could rely on, on subsequent runs), but is slower.

MD5=d41d8cd98f00b204e9800998ecf8427e
let SEEN_${MD5}=1
...
digest=$(md5hash_of <filename>)
let exists=SEEN_$digest
if [[ "$exists" == 1 ]]; then
   # already seen this file
fi

faster tests

And you may use [[ -f my_file ]] instead of [ -f my_file ]. The command [[ is a bash built-in, and is much faster than spawning a new process (/usr/bin/[) for each comparison. It will make an even bigger difference.

what is /usr/bin/[

/usr/bin/test and /usr/bin/[ are two different programs, but the source code for [ (lbracket.c) is the same as test.c (again in coreutils):

#define LBRACKET 1
#include "test.c"

so they're interchangeable.

Varicocele answered 2/12, 2015 at 2:24 Comment(0)
C
0

The choice between reading the contents of a file containing hashes and finding a hash in a directory of filenames that are the hashes basically comes down to "is the kernel quicker at reading a directory or your program at reading a file". Both are going to involve a linear search for each hash, so you end up with much the same behaviour. You can probably argue that the kernel should be a little quicker, but the margin won't be great. Note that most often, the linear search will be exhaustive because the hash won't exist (unless you have lots of duplicate files). So, if you're processing a few thousand files, the searches will process a few million entries overall — it is quadratic behaviour.

If you have many hundreds or thousands of files, you'd probably do better with a two-level hierarchy — for example, a directory containing two-character sub-directories 00 .. FF, and then storing the rest of the name (or the full name) in the sub-directory. A minor variation of this technique is used in the terminfo directories, for example. The advantage is that the kernel only has to read relatively small directories to find whether the file is present or not.

Cameron answered 4/8, 2015 at 22:45 Comment(0)
S
0

I haven't "hashed" this out, but I'd try storing your md5sums in a bash hash.

See How to define hash tables in Bash?

Store the md5sum as the key, and if you want, the filename as the value. For each file, just see if the key already exists in the hash table. If so, you don't care about the value, but could use it to print out the original duplicate file's name. Then delete the current file (with the duplicate key). Not being a bash expert that's where I'd start looking.

Stunsail answered 6/8, 2015 at 15:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.