Using pHash to search agaist a huge image database, what is the best approach?

Asked 15/8, 2013 at 16:56 Answered 6/12, 2014 at 6:6

Solved java image-processing duplicates cbir phash

I need to search a huge image database to find possible duplicate using pHash assuming those image records have the hash code generated using the pHash.

Now I have to compare a new image and I have to create the hash for this using pHash against existing records. But as per my understanding the has comparison is NOT straight forward like

hash1 - has2 < threshold

Looks like I need to pass the both hash codes into a pHash API to do the matching.So I have to retrieve all hash codes from DB in batches and compare one by one using the pHash API.

But this looks not the best approach if I have about 1000 images in queue to be compared against the millions of already exiting images.

I need to know the followings.

Is my understanding/approach on using pHash to compare with existing image db is correct?
Is there a better approach to handle this (without using cbir libraries like lire)?
I heard that there is an algorithm called dHash which also can be used for image comparison with hash codes..is there any java libraries for this and can this be used together with pHash to optimize this task of large image and repeated image processing tasks.

Thanks in advance.

Pemmican answered 15/8, 2013 at 16:56 Comment(0)

I think some part of this question is discussed on the pHash support forum.

You will need to use the mvptree storage mechanism

http://lists.phash.org/htdig.cgi/phash-support-phash.org/2011-May/000122.html and http://lists.phash.org/htdig.cgi/phash-support-phash.org/2010-October/000103.html

Depending on your definition of "huge", a good solution here is to implement a BK-Tree hash tree (human readable description).

I'm working with a similar project, and I implemented a BK tree in cython. It's fairly performant (searching with a hamming distance of 2 takes less then 50 ms for a 12 million item dataset, and touches ~0.01-0.02% of the tree nodes).

Larger scale searches (edit distance of 8) take longer (~500 ms) and touch about 5% of the tree nodes.

This is with a 64 bit hash size.

Gar answered 6/12, 2014 at 6:6 Comment(0)

Recommended topics

Hot tags