Reverse Image search (for image duplicates) on local computer
Asked Answered
A

4

7

I have a bunch of poor quality photos that I extracted from a pdf. Somebody I know has the good quality photo's somewhere on her computer(Mac), but it's my understanding that it will be difficult to find them.

I would like to

  • loop through each poor quality photo
  • perform a reverse image search using each poor quality photo as the query image and using this persons computer as the database to search for the higher quality images
  • and create a copy of each high quality image in one destination folder.

Example pseudocode

for each image in poorQualityImages:
    search ./macComputer for a higherQualityImage of image
    copy higherQualityImage to ./higherQualityImages

I need to perform this action once. I am looking for a tool, github repo or library which can perform this functionality more so than a deep understanding of content based image retrieval.


There's a post on reddit where someone was trying to do something similar

imgdupes is a program which seems like it almost achieves this, but I do not want to delete the duplicates, I want to copy the highest quality duplicate to a destination folder


Update

Emailed my previous image processing prof and he sent me this

Off the top of my head, nothing out of the box.

No guaranteed solution here, but you can narrow the search space. You’d need a little program that outputs the MSE or SSIM similarity index between two images, and then write another program or shell script that scans the hard drive and computes the MSE between each image on the hard drive and each query image, then check the images with the top X percent similarity score.

Something like that. Still not maybe guaranteed to find everything you want. And if the low quality images are of different pixel dimensions than the high quality images, you’d have to do some image scaling to get the similarity index. If the poor quality images have different aspect ratios, that’s even worse.

So I think it’s not hard but not trivial either. The degree of difficulty is partly dependent on the nature of the corruption in the low quality images.


UPDATE

Github project I wrote which achieves what I want

Abiogenetic answered 2/5, 2020 at 3:1 Comment(5)
How you're planning to connect to the remote computer, it seems like two sub-tasks. Without the networking part this seemed a scalable task, but if you're thinking connecting to a remote device, more networking details are needed.Kt
@ZabirAlNazi I'd make an executable with Automator or something and just send it to that personAbiogenetic
You can use imgdupes with the --dry-run option to avoid deleting the images, then process the output information in a script to copy files as needed. Also I'm not sure what's the reason for the tensorflow, keras or pytorch tags, please avoid using tags unrelated to the question.Dickey
First of all you have to consider a clear definition of quality, an index for comparison. In other word you have to quantify the quality definition. As an example you can use algorithms which are developed for auto-focusing in cameras. The methods which discriminate between a focused (clear) image and an unfocused (blurred) image. Of course there are plenty of features that can be used for quantifying the concept of quality including edges, feature points and so on. Choosing the right features, depends on the texture of your images (Which you didn't share any of them!)Ious
@jdehesa You should post that as an answerAbiogenetic
A
3

What you are looking for is called image hashing . In this answer you will find a basic explanation of the concept, as well as a go-to github repo for plug-and-play application.

Basic concept of Hashing

From the repo page: "We have developed a new image hash based on the Marr wavelet that computes a perceptual hash based on edge information with particular emphasis on corners. It has been shown that the human visual system makes special use of certain retinal cells to distinguish corner-like stimuli. It is the belief that this corner information can be used to distinguish digital images that motivates this approach. Basically, the edge information attained from the wavelet is compressed into a fixed length hash of 72 bytes. Binary quantization allows for relatively fast hamming distance computation between hashes. The following scatter plot shows the results on our standard corpus of images. The first plot shows the distances between each image and its attacked counterpart (e.g. the intra distances). The second plot shows the inter distances between altogether different images. While the hash is not designed to handle rotated images, notice how slight rotations still generally fall within a threshold range and thus can usually be matched as identical. However, the real advantage of this hash is for use with our mvp tree indexing structure. Since it is more descriptive than the dct hash (being 72 bytes in length vs. 8 bytes for the dct hash), there are much fewer false matches retrieved for image queries. "

Another blogpost for an in-depth read, with an application example.

Available Code and Usage

A github repo can be found here. There are obviously more to be found. After importing the package you can use it to generate and compare hashes:

>>> from PIL import Image
>>> import imagehash
>>> hash = imagehash.average_hash(Image.open('test.png'))
>>> print(hash)
d879f8f89b1bbf
>>> otherhash = imagehash.average_hash(Image.open('other.bmp'))
>>> print(otherhash)
ffff3720200ffff
>>> print(hash == otherhash)
False
>>> print(hash - otherhash)
36

The demo script find_similar_images also on the mentioned github, illustrates how to find similar images in a directory.

Auburta answered 20/5, 2020 at 8:58 Comment(7)
Thanks! I adjusted the find_similar_images script to my specific use case and uploaded it to this Github repo github.com/samgermain/Copy_Duplicate_ImagesAbiogenetic
Feel free to use the repo as an example for your repo if you likeAbiogenetic
Do you know how I could add functionality to my script so that from the duplicates copied, only the highest quality one would be saved?Abiogenetic
Oh ok, you said "We have developed a new image hash..." so that's why I thought thatAbiogenetic
So it turns out that most of the images I have are cropped. Do you have any advice to make my program work?Abiogenetic
So you mean, an image pair consists of two crops from the same image, but are otherwise different?Auburta
Here's an example imgur.com/a/dOs0567 . I'm just realizing also that it looks like an inch of white space(snow) was added at the bottom of the lower quality imageAbiogenetic
R
1

Premise

I'll focus my answer on the image processing part, as I believe implementation details e.g. traversing a file system is not the core of your problem. Also, all that follows is just my humble opinion, I am sure that there are better ways to retrieve your image of which I am not aware. Anyway, I agree with what your prof said and I'll follow the same line of thought, so I'll share some ideas on possible similarity indexes you might use.

Answer

  • MSE and SSIM - This is a possible solution, as suggested by your prof. As I assume the low quality images also have a different resolution than the good ones, remember to downsample the good ones (and not upsample the bad ones).
  • Image subtraction (1-norm distance) - Subtract two images -> if they are equal you'll get a black image. If they are slightly different, the non-black pixels (or the sum of the pixel intensity) can be used as a similarity index. This is actually the 1-norm distance.
  • Histogram distance - You can refer to this paper: https://www.cse.huji.ac.il/~werman/Papers/ECCV2010.pdf. Comparing two images' histograms might be potentially robust for your task. Check out this question too: Comparing two histograms
  • Embedding learning - As I see you included tensorflow, keras or pytorch as tags, let's consider deep learning. This paper came to my mind: https://arxiv.org/pdf/1503.03832.pdf The idea is to learn a mapping from the image space to a Euclidian space - i.e. compute an embedding of the image. In the embedding hyperspace, images are points. This paper learns an embedding function by minimizing the triplet loss. The triplet loss is meant to maximize the distance between images of different classes and minimize the distance between images of the same class. You could train the same model on a Dataset like ImageNet. You could augment the dataset with by lowering the quality of the images, in order to make the model "invariant" to difference in image quality (e.g. down-sampling followed by up-sampling, image compression, adding noise, etc.). Once you can compute embedding, you could compute the Euclidian distance (as a substitute of the MSE). This might work better than using MSE/SSIM as a similarity indexes. Repo of FaceNet: https://github.com/timesler/facenet-pytorch. Another general purpose approach (not related to faces) which might help you: https://github.com/zegami/image-similarity-clustering.
  • Siamese networks for predicting similarity score - I am referring to this paper on face verification: http://bmvc2018.org/contents/papers/0410.pdf. The siamese network takes two images as input and outputs a value in the [0, 1]. We can interpret the output as the probability that the two images belong to the same class. You can train a model of this kind to predict 1 for image pairs of the following kind: (good quality image, artificially degraded image). To degrade the image, again, you can combine e.g. down-sampling followed by up-sampling, image compression, adding noise, etc. Let the model predict 0 for image pairs of different classes (e.g. different images). The output of the network can e used as a similarity index.

Remark 1

These different approaches can also be combined. They all provide you with similarity indexes, so you can very easily average the outcomes.

Remark 2

If you only need to do it once, the effort you need to put in implementing and training deep models might be not justified. I would not suggest it. Still, you can consider it if you can't find any other solution and that Mac is REALLY FULL of images and a manual search is not possible.

Rupp answered 16/5, 2020 at 21:10 Comment(2)
These are good suggestions, but I think OP was not only referring to the matching criteria here.Kt
You can quickly compare by the aspect ratio of the pictures, the average value, standard deviation and other statistical moments of higher order. This will immediately discard unequal pairs of images.Precautionary
D
0

If you look at the documentation of imgdupes you will see there is the following option:

--dry-run

dry run (do not delete any files)

So if you run imgdupes with --dry-run you will get a listing of all the duplicate images but it will not actually delete anything. You should be able to process that output to move the images around as you need.

Dickey answered 26/5, 2020 at 10:18 Comment(0)
O
0

Try similar image finder I have developed to address this problem. There is an explanation and the algorithm there, so you can implement your own version if needed.

Occasionalism answered 8/9, 2020 at 22:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.