diff files comparing only first n characters of each line
Asked Answered
R

3

15

I have got 2 files. Let us call them md5s1.txt and md5s2.txt. Both contain the output of a

find -type f -print0 | xargs -0 md5sum | sort > md5s.txt

command in different directories. Many files were renamed, but the content stayed the same. Hence, they should have the same md5sum. I want to generate a diff like

diff md5s1.txt md5s2.txt

but it should compare only the first 32 characters of each line, i.e. only the md5sum, not the filename. Lines with equal md5sum should be considered equal. The output should be in normal diff format.

Retentive answered 18/5, 2011 at 15:15 Comment(0)
N
17

Easy starter:

diff <(cut -d' ' -f1 md5s1.txt)  <(cut -d' ' -f1 md5s2.txt)

Also, consider just

diff -EwburqN folder1/ folder2/
Niggle answered 18/5, 2011 at 15:43 Comment(2)
Extending this answer, if you really want n characters, something like: diff <(cut -b-80 dump.csv) <(cut -b-80 dump2.csv) (here, n=80)Dialectology
quick fwiw: extending the above (6 year old) comment, if you just want to check the md5, since it's a 32bit hex, the actual cut would be (specified as characters) diff <( cut -c-32 f1.txt | sort) <(cut -c-32 f2.txt | sort ), which could also be written as cut -b-32 or cut -c1-32 etc (but using cut -d' ' -f1 is convenient in that you don't have to count characters). Also, fwiw #2, all those diff options won't necessarily always be available (eg on macOS, no -E), but that diff doesn't solve the OP problem anyway. Last fwiw #3: I actually use fdupes for the OP orig problem.Symphony
B
3

Compare only the md5 column using diff on <(cut -c -32 md5sums.sort.XXX), and tell diff to print just the line numbers of added or removed lines, using --old/new-line-format='%dn'$'\n'. Pipe this into ed md5sums.sort.XXX so it will print only those lines from the md5sums.sort.XXX file.

diff \
    --new-line-format='%dn'$'\n' \
    --old-line-format='' \
    --unchanged-line-format='' \
    <(cut -c -32 md5sums.sort.old) \
    <(cut -c -32 md5sums.sort.new) \
    | ed md5sums.sort.new \
    > files-added
diff \
    --new-line-format='' \
    --old-line-format='%dn'$'\n' \
    --unchanged-line-format='' \
    <(cut -c -32 md5sums.sort.old) \
    <(cut -c -32 md5sums.sort.new) \
    | ed md5sums.sort.old \
    > files-removed

The problem with ed is that it will load the entire file into memory, which can be a problem if you have a lot of checksums. Instead of piping the output of diff into ed, pipe it into the following command, which will use much less memory.

diff … | (
    lnum=0;
    while read lprint; do
        while [ $lnum -lt $lprint ]; do read line <&3; ((lnum++)); done;
        echo $line;
    done
) 3<md5sums.sort.XXX
Brazilein answered 18/9, 2011 at 12:28 Comment(0)
D
1

If you are looking for duplicate files fdupes can do this for you:

$ fdupes --recurse

On ubuntu you can install it by doing

$ apt-get install fdupes
Drew answered 18/9, 2011 at 14:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.