Near-Duplicate Image Detection [closed]
Asked Answered
M

12

96

What's a fast way to sort a given set of images by their similarity to each other.

At the moment I have a system that does histogram analysis between two images, but this is a very expensive operation and seems too overkill.

Optimally I am looking for a algorithm that would give each image a score (for example a integer score, such as the RGB Average) and I can just sort by that score. Identical Scores or scores next to each other are possible duplicates.

0299393
0599483
0499994 <- possible dupe
0499999 <- possible dupe
1002039
4995994
6004994 

RGB Average per image sucks, is there something similar?

Monadism answered 23/6, 2009 at 20:15 Comment(4)
A key question, thinking about what you've written and about some of the answers to the related question that Naaff pointed out, you might want to more clearly define what "similarity" means. Would an image that is identical, but five pixels offset, be "similar"? Visually yes...but to an algorithm...probably not, unless you've thought of it, and accounted for it. Can you provide any more details? Would the duplicates be exact, or just "close"? Are you looking at scans where they could differ by a slight angle measure? How about intensity? There are a lot of variables here...Groping
How do 'duplicates' differ? e.g. Would they be images of the same location with different pose / shift? You seem to want something that is O(nlog(n)) with the number of images. Does anyone know if this is possible? It seems like it may be..Definite
@The Unknown: If you're not satisfied with any of the current answers, could you give us some more guidance? We've done our best to answer your question, but without any feedback we're unlikely to come up with something better.Intonation
This is currently one of the great unsolved problems in Computer Science. Good luck buddy.Kyanite
I
71

There has been a lot of research on image searching and similarity measures. It's not an easy problem. In general, a single int won't be enough to determine if images are very similar. You'll have a high false-positive rate.

However, since there has been a lot of research done, you might take a look at some of it. For example, this paper (PDF) gives a compact image fingerprinting algorithm that is suitable for finding duplicate images quickly and without storing much data. It seems like this is the right approach if you want something robust.

If you're looking for something simpler, but definitely more ad-hoc, this SO question has a few decent ideas.

Intonation answered 23/6, 2009 at 20:59 Comment(1)
that paper is from 2004, not sure if this is still the best answer?Subscribe
S
50

I would recommend considering moving away from just using an RGB histogram.

A better digest of your image can be obtained if you take a 2d Haar wavelet of the image (its a lot easier than it sounds, its just a lot of averaging and some square roots used to weight your coefficients) and just retain the k largest weighted coefficients in the wavelet as a sparse vector, normalize it, and save that to reduce its size. You should rescale R G and B using perceptual weights beforehand at least or I'd recommend switching to YIQ (or YCoCg, to avoid quantization noise) so you can sample chrominance information with reduced importance.

You can now use the dot product of two of these sparse normalized vectors as a measure of similarity. The image pairs with the largest dot products are going to be very similar in structure. This has the benefit of being slightly resistant to resizing, hue shifting and watermarking, and being really easy to implement and compact.

You can trade off storage and accuracy by increasing or decreasing k.

Sorting by a single numeric score is going to be intractable for this sort of classification problem. If you think about it it would require images to only be able to 'change' along one axis, but they don't. This is why you need a vector of features. In the Haar wavelet case its approximately where the sharpest discontinuities in the image occur. You can compute a distance between images pairwise, but since all you have is a distance metric a linear ordering has no way to express a 'triangle' of 3 images that are all equally distant. (i.e. think of an image that is all green, an image that is all red and an image that is all blue.)

That means that any real solution to your problem will need O(n^2) operations in the number of images you have. Whereas if it had been possible to linearize the measure, you could require just O(n log n), or O(n) if the measure was suitable for, say, a radix sort. That said, you don't need to spend O(n^2) since in practice you don't need to sift through the whole set, you just need to find the stuff thats nearer than some threshold. So by applying one of several techniques to partition your sparse vector space you can obtain much faster asymptotics for the 'finding me k of the images that are more similar than a given threshold' problem than naively comparing every image against every image, giving you what you likely need... if not precisely what you asked for.

In any event, I used this a few years ago to good effect personally when trying to minimize the number of different textures I was storing, but there has also been a lot of research noise in this space showing its efficacy (and in this case comparing it to a more sophisticated form of histogram classification):

http://www.cs.princeton.edu/cass/papers/spam_ceas07.pdf

If you need better accuracy in detection, the minHash and tf-idf algorithms can be used with the Haar wavelet (or the histogram) to deal with edits more robustly:

http://cmp.felk.cvut.cz/~chum/papers/chum_bmvc08.pdf

Finally, Stanford has an image search based on a more exotic variant of this kind of approach, based on doing more feature extraction from the wavelets to find rotated or scaled sections of images, etc, but that probably goes way beyond the amount of work you'd want to do.

http://wang14.ist.psu.edu/cgi-bin/zwang/regionsearch_show.cgi

Socha answered 2/7, 2009 at 20:23 Comment(4)
It seems like you're indirectly describing kd-trees and the like for searching the space for potential candidates. It might be worth noting this.Loney
Well, the reason I didn't specify techniques beyond sort of a vague allusion is that kd-trees work well when you have a relatively small number of dimensions in your space. Here you likely have ~128 or more dimensions which are sparsely populated. Since they are sparse the majority of the values will be zero, so going round-robin across the dimensions to partition in kd-style is actually almost useless. By the same token R-trees break down, leaving most likely as your best bet: X-trees. Unfortunately, they are also near the limit of their performance when faced with that many dimensions.Socha
" and just retain the k largest weighted coefficients in the wavelet as a sparse vector, " - retain per row or for entire wavelet?Fraktur
"You should rescale R G and B using perceptual weights beforehand at least or I'd recommend switching to YIQ (or YCoCg, to avoid quantization noise) so you can sample chrominance information with reduced importance." - and what then? Do wavelet for Y only or do it for all channels? If do for all channels - how to measure similarity of images with multiple channels? add dot products of each channel and account this as similarity measure or should be some weighted addition?Fraktur
H
15

I implemented a very reliable algorithm for this called Fast Multiresolution Image Querying. My (ancient, unmaintained) code for that is here.

What Fast Multiresolution Image Querying does is split the image into 3 pieces based on the YIQ colorspace (better for matching differences than RGB). Then the image is essentially compressed using a wavelet algorithm until only the most prominent features from each colorspace are available. These points are stored in a data structure. Query images go through the same process, and the prominent features in the query image are matched against those in the stored database. The more matches, the more likely the images are similar.

The algorithm is often used for "query by sketch" functionality. My software only allowed entering query images via URL, so there was no user interface. However, I found it worked exceptionally well for matching thumbnails to the large version of that image.

Heliogravure answered 2/7, 2009 at 7:20 Comment(2)
Can it still recognize rotated images?Obnubilate
I doubt it would work very well for that. You'd probably want to encode the images for each rotation to maximize pertinent matches.Heliogravure
I
10

A picture has many features, so unless you narrow yourself to one, like average brightness, you are dealing with an n-dimensional problem space.

If I asked you to assign a single integer to the cities of the world, so I could tell which ones are close, the results wouldn't be great. You might, for example, choose time zone as your single integer and get good results with certain cities. However, a city near the north pole and another city near the south pole can also be in the same time zone, even though they are at opposite ends of the planet. If I let you use two integers, you could get very good results with latitude and longitude. The problem is the same for image similarity.

All that said, there are algorithms that try to cluster similar images together, which is effectively what you're asking for. This is what happens when you do face detection with Picasa. Even before you identify any faces, it clusters similar ones together so that it's easy to go through a set of similar faces and give most of them the same name.

There is also a technique called Principle Component Analysis, which lets you reduce n-dimensional data down to any smaller number of dimensions. So a picture with n features could be reduced to one feature. However, this is still not the best approach for comparing images.

Inchworm answered 30/6, 2009 at 23:23 Comment(2)
It's a moot point, but you CAN use a single integer to represent the combination of any number of features, if, for instance, feature x = 2 and feature y = 3 and feature z = 5 and feature aa = 7, et cetera, then the power to which that prime base was raised in the factorized form of a single integer would be the value of the feature for that specific image. Again, a moot point because the size of the number would be absurd. Although that size could be further reduced... we're just talking about structured data.Vagina
True. But the real point is to arrange the numbers so that similar images are close together numerically. In spite of what I said above, this is possible. In short, you could solve the Traveling Salesperson problem to find a minimum (or near-minimum) path through the images in n-dimensional space (where n is the number of features you want to use to compare the images). But that is expensive.Inchworm
H
8

There's a C library ("libphash" - http://phash.org/) that will calculate a "perceptual hash" of an image and allow you to detect similar images by comparing hashes (so you don't have to compare each image directly against every other image) but unfortunately it didn't seem to be very accurate when I tried it.

Helicograph answered 24/6, 2009 at 0:33 Comment(0)
C
5

You have to decide what is "similar." Contrast? Hue?

Is a picture "similar" to the same picture upside-down?

I bet you can find a lot of "close calls" by breaking images up into 4x4 pieces and getting an average color for each grid cell. You'd have sixteen scores per image. To judge similarity, you would just do a sum of squares of differences between images.

I don't think a single hash makes sense, unless it's against a single concept like hue, or brightness, or contrast.

Here's your idea:

0299393
0599483
0499994 <- possible dupe
0499999 <- possible dupe
1002039
4995994
6004994

First of all, I'm going to assume these are decimal numbers that are R*(2^16)+G*(2^8)+B, or something like that. Obviously that's no good because red is weighted inordinately.

Moving into HSV space would be better. You could spread the bits of HSV out into the hash, or you could just settle H or S or V individually, or you could have three hashes per image.


One more thing. If you do weight R, G, and B. Weight green highest, then red, then blue to match human visual sensitivity.

Cufic answered 23/6, 2009 at 21:34 Comment(0)
B
5

In the age of web services you could try http://tineye.com

Baptize answered 2/7, 2009 at 6:16 Comment(3)
The code behind tineye is seems to be exactly what the questioner is after, but I don't think as a web-service it's very useful, since there's no (obvious) way to give it two image and ask "are these the same?" - the second image would have to be on a web-page, and indexed by tineyeImponderable
Maybe the are providing API for business users? They should be contacted about that.Baptize
There is a commercial API that provides exactly that services.tineye.com/MatchEngine.Witting
O
2

The question Good way to identify similar images? seems to provide a solution for your question.

Odin answered 10/7, 2010 at 0:3 Comment(0)
C
1

i assumed that other duplicate image search software performs an FFT on the images, and stores the values of the different frequencies as a vectors:

Image1 = (u1, u2, u3, ..., un)
Image2 = (v1, v2, v3, ..., vn)

and then you can compare two images for equalness by computing the distance between the weight vectors of two images:

distance = Sqrt(
     (u1-v1)^2 +
     (u2-v2)^2 +
     (u2-v3)^2 +
     ...
     (un-vn)^2);
Conlen answered 23/6, 2009 at 21:25 Comment(1)
Most natural images have very similar frequency content, so I doubt that this would be a very good metric.Maggee
A
1

One solution is to perform a RMS/RSS comparison on every pair of pictures required to perform a bubble sort. Second, you could perform an FFT on each image and do some axis averaging to retrieve a single integer for each image which you would use as an index to sort by. You may consider doing whatever comparison on a resized (25%, 10%) version of the original depending on how small a difference you choose to ignore and how much speedup you require. Let me know if these solutions are interesting, and we can discuss or I can provide sample code.

Ari answered 26/6, 2009 at 11:55 Comment(2)
FFT only provides you with color information and no information about position. Resizing ignores all features below a given size regardless of impact on the resulting image. A grey image and a checkerboard can be identical under that measure. A wavelet approach (Daubechies, Haar, etc.) has the benefits of providing both position and color information by trading off the proportion of positional and color information in each data point.Socha
No, the FFT of an image contains all the spatial information of the original. You can reconstruct the original from the FFT. homepages.inf.ed.ac.uk/rbf/HIPR2/fourier.htm A histogram, however, which may be what you were thinking of, does not.Ari
T
1

Most modern approaches to detect Near duplicate image detection use interesting points detection and descriptors describing area around such points. Often SIFT is used. Then you can quatize descriptors and use clusters as visual word vocabulary.

So if we see on ratio of common visual words of two images to all visual words of these images you estimate similarity between images. There are a lot of interesting articles. One of them is Near Duplicate Image Detection: minHash and tf-idf Weighting

Tucson answered 17/1, 2010 at 17:34 Comment(0)
D
1

For example using IMMI extension and IMMI you can examine many different ways how to measure similarity between images: http://spl.utko.feec.vutbr.cz/en/component/content/article/46-image-processing-extension-for-rapidminer-5

By defining some threshold and selecting some method you can measure similarity.

Diva answered 24/1, 2012 at 14:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.