Currently I'm studying how to find a nearest neighbor using Locality-sensitive hashing. However while I'm reading papers and searching the web I found two algorithms for doing this:
1- Use L number of hash tables with L number of random LSH functions, thus increasing the chance that two documents that are similar to get the same signature. For example if two documents are 80% similar, then there's an 80% chance that they will get the same signature from one LSH function. However if we use multiple LSH functions, then there's a higher chance to get the same signature for the documents from one of the LSH functions. This method is explained in wikipedia and I hope my understanding is correct:
http://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search
2- The other algorithm uses a method from a paper (section 5) called: Similarity Estimation Techniques from Rounding Algorithms by Moses S. Charikar. It's based on using one LSH function to generate the signature and then apply P permutations on it and then sort the list. Actually I don't understand the method very well and I hope if someone could clarify it.
My main question is: why would anyone use the second method rather than the first method? As I find it's easier and faster.
I really hope someone can help!!!
EDIT: Actually I'm not sure if @Raff.Edward were mixing between the "first" and the "second". Because only the second method uses a radius and the first just uses a new hash family g composed of the hash family F. Please check the wikipedia link. They just used many g functions to generate different signatures and then for each g function it has a corresponding hash table. In order to find the nearest neighbor of a point you just let the point go through the g functions and check the corresponding hash tables for collisions. Thus how I understood it as more function ... more chance for collisions.
I didn't find any mentioning about radius for the first method.
For the second method they generate only one signature for each feature vector and then apply P permutations on them. Now we have P lists of permutations where each contains n signatures. Now they then sort each list from P. After that given a query point q, they generate the signature for it and then apply the P permutations on it and then use binary search on each permuted and sorted P list to find the most similar signature to the query q. I concluded this after reading many papers about it, but I still don't understand why would anyone use such a method because it doesn't seem fast in finding the hamming distance!!!!
For me I would simply do the following to find the nearest neighbor for a query point q. Given a list of signatures N, I would generate the signature for the query point q and then scan the list N and compute the hamming distance between each element in N and the signature of q. Thus I would end up with the nearest neighbor for q. And it takes O(N)!!!