You could use n
random projections to split pHash space into 2^n
buckets, then similar images are most likely found from the same bucket. You could even XOR the hash with all 64 possible integers with Hamming weight 1 to check neighboring buckets conveniently and be sure you'd find all approximate matches.
This is efficient only if you are interested on images with almost identical hashes (small Hamming distance). If you want to tolerate larger hamming distances (such as 8) then it gets tricky to find all matches efficiently and accurately. I got very good performance by scanning through the whole table by GPU, even my 3 year old laptop's GT 650M could check 700 million hashes / second!
Edit 1: You can think 64-bit hash as a single corner on a 64-dimensional cube, math is easier if your normalize corner coordinates to -1
and 1
(this way its center is in the origin). You can express m
images as a matrix M
of size m x 64
(one row / image, one bit of hash / column).
Simplest way to split this to 2^n
distinct groups is to generate n
64-dimensional vectors v_0, v_1, ..., v_n
(pick each vector element from normal distribution N(0,1)), this can be expressed as a matrix V
of size 64 x n
(one column / vector). There could be orthogonality enforcement as mentioned at Random projection but I'll skip it here.
Now by calculating A = (M * V) > 0
you get m x n
matrix (one image / row, one projection / column). Next convert each row's binary representation to a number, you get 2^n
different possibilities and similar hashes are most likely to end up to the same bucket.
This algorithm works for any orthogonal representation of data (such as SURF features), not just binary strings. I'm sure there are simpler (and computationally more efficient) algorithms for binary hashes but this is one way to implement random projections.
I suggested XORring because if images don't have identical hashes then they aren't guaranteed to end up in the same bucket. By checking all possible small deviations from the original hash you can see which other bins are possible for likely matches.
In a way this is similar how a computer game engine might split the 2D map into a grid of cells of size x
, then to find all units within a radius x
from a point you only need to check 9 cells (the one containing the point + 8 surrounding cells) to get a 100% accurate answer.