Compare images and remove duplicates

Asked 5/5, 2015 at 17:23 Answered 20/6, 2022 at 19:52

I have two folders with images, they're all PNGs. One folder is a copy of the other with some images changed and some added. The filenames are the same but the image contents may be different. Other attributes like time stamps are completely random, unfortunately.

I want in the newer folder to remove the duplicates (by content) and just keep the updated and the new ones.

I installed ImageMagick to use the compare command but I can't figure it out. :-( Can you help me please? Thanks in advance!

Added: I'm on Mac OS X.

Stevie answered 5/5, 2015 at 17:23 Comment(3)

From the looks of it, ImageMagick's compare requires an output filename (that you can look at to see where the difference is). If the image is identical, why not use md5sum instead? – Encipher 5/5, 2015 at 17:28

If that works I'd be set. How would I do this with md5sum? – Stevie 5/5, 2015 at 18:16

md5sum will be fooled by timestamps and such. Use "identify" instead (see Mark's answer, below). – Koenraad 5/5, 2015 at 18:51

You don't say if you are on OSX/Linux or Windows, however, I can get you started. ImageMagick can calculate a hash (checksum) of all the pixel data in an image regardless of date or timestamp like this

identify -format "%# %f\n" *.png

25a3591a58550edd2cff65081eab11a86a6a62e006431c8c4393db8d71a1dfe4 blue.png
304c0994c751e75eac86bedac544f716560be5c359786f7a5c3cd6cb8d2294df green.png
466f1bac727ac8090ba2a9a13df8bfb6ada3c4eb3349087ce5dc5d14040514b5 grey.png
042a7ebd78e53a89c0afabfe569a9930c6412577fcf3bcfbce7bafe683e93e8a hue.png
d819bfdc58ac7c48d154924e445188f0ac5a0536cd989bdf079deca86abb12a0 lightness.png
b63ad69a056033a300f23c31f9425df6f469e79c2b9f3a5c515db3b52c323a65 montage.png
a42a5f0abac3bd2f6b4cbfde864342401847a120dacae63294edb45b38edd34e red.png
10bf63fd725c5e02c56df54f503d0544f14f754d852549098d5babd8d3daeb84 sample.png
e95042f227d2d7b2b3edd4c7eec05bbf765a09484563c5ff18bc8e8aa32c1a8e sat.png

So, if you do that in each folder you will have the checksums of all the files with their names beside them in a separate file for each folder.

If you then merge the two files and sort them you can find duplicates quite easily since the duplicated files will come up next to each other.

Let's say, you run the above command in two folders dira and dirb like this

cd dira
identify -format "%# %f\n" *.png > $HOME/dira

cd dirb
identify -format "%# %f\n" *.png > $HOME/dirb

Then you could do something like this in awk

awk 'FNR==NR{name[$1]=$2;next}
            { 
               if($1 in name){print $2 " duplicates " name[$1]}
            }' $HOME/dir*

So, the $HOME/dir* part passes both the files into awk. The piece in {} after FNR==NR only applies to the first file read in, and as it is read, we save an associative array indexed by the hash containing the filenames. Then, on the second pass, we check if each hash has been seen, and if it has, we say that that it is a duplicate and output the name we found on the first pass from the hash name[] and the name we found on the second pass from $2.

This won't work with filenames with spaces in them, so if that is a problem, change the identify command to put a colon between the hash and the filename like this:

identify -format "%#:%f\n" *.png

and change the awk to awk -F":" and it should work again.

Grillparzer answered 5/5, 2015 at 18:14 Comment(2)

Thank you, I will try this. I'm on OSX, by the way. – Stevie 5/5, 2015 at 18:18

Another version of the same idea: identify -format "%# %f\n" '*glob*' | sort -u -k1,1 | cut -d' ' -f2 – Zoologist 28/8, 2016 at 10:24

Here’s my ugly solution for Powershell (which is now a multi-platform solution) — I wrote it for a one-off but it should work. I tried to comment it a bit to compensate for how bad it is.

I’d back up your images before doing this, though. Just in case.

The catch here is that it only detects if each file is a duplicate of the previous one — if you need to check if each file is a duplicate of any other, you’ll want to nest another for() loop in there, which should be easy enough.

#get the list of files with imagemagick
#powershell handily populates $files as an array, split by line
#this will take a bit
$files = identify -format "%# %f\n" *.png

$arr = @()
foreach($line in $files) {
    #add 2 keys to the new array per line (hash and then filename)
    $arr += @($line.Split(" "))
}

#for every 2 keys (eg each hash)
for($i = 2; $i -lt $arr.Length; $i += 2) {
    #compare it to the last hash
    if($arr[$i] -eq $arr[$i-2]) {
        #print a helpful message and then delete
        echo "$($arr[$i].Substring(0,16)) = $($arr[$i-2].Substring(0,16)) (removing $($arr[$i+1]))"
        remove-item ($arr[$i+1])
    }
}

Bonus: To delete any images with a particular hash (an all black 640×480 png in my case):

for($i = 2; $i -lt $arr.Length; $i += 2) {
    if($arr[$i] -eq "f824c1a8a1128713f17dd8d1190d70e6012b509606d986e7a6c81e40b628df2b") {
        echo "$($arr[$i+1])"
        remove-item ($arr[$i+1])
    }
}

Double bonus: C code to check if a written image collides with a given hash in a hash/ folder and delete it if so — written for Windows/MinGW but shouldn’t be too hard to port if necessary. Might be superfluous but I figured I’d throw it out there in case it’s useful to anyone.

char filename[256] = "output/UNINITIALIZED.ppm";
unsigned long int timeint = time(NULL);
sprintf(filename, "../output/image%lu.ppm", timeint);
if(
    writeppm(
        filename,
        SCREEN_WIDTH,
        SCREEN_HEIGHT,
        screenSurface->pixels
        ) != 0
) {
    printf("image write error!\n");
    return;
}
char shacmd[256];
sprintf(shacmd, "sha256sum %s", filename);
FILE *file = popen(shacmd, "r");
if(file == NULL) {
    printf("failed to get image hash!\n");
    return;
}
//the hash is 64 characters but we need a 0 at the end too
char sha[96];
int i;
char c;
//get hash until the first space
for(i = 0; (i < 64) && (c != EOF) && (c != 0x32); i++) {
    sha[i] = c = fgetc(file);
}
pclose(file);

char hashfilename[256];
sprintf(hashfilename, "../output/hash/%s", sha);

if(_access(hashfilename, 0) != -1) {
    //file exists, delete img
    if(unlink(filename) != 0) {
        printf("image delete error!\n");
    }
} else {
    FILE *hashfile = fopen(hashfilename, "w");
    if(hashfile == NULL)
        printf("hash file write error!\nfilename: %s\n", hashfilename);
    fclose(hashfile);
}

Erbil answered 26/8, 2016 at 21:33 Comment(0)

For mcOS
- install fdupes with Homebrew
```
brew install fdupes
```
- delete duplicates immediately as they are encountered in current directory
```
fdupes -dI .   
```
- read the options
```
fdupes -h
```

Weka answered 20/6, 2022 at 19:52 Comment(0)

Recommended topics

Hot tags