I am implementing a near-neighbor search application which will find similar documents. So far I have read a good portion of LSH related materials (theory behind LSH is some kind of confusing and I am not able to comphrened it 100% yet).
My code is able to compute the signature matrix using the minhash functions (I am close to the end). I also apply the banding strategy on the signature matrix. However I am not able to understand how to hash signature vectors (of columns) in a band into buckets.
My last question may be the most important one, but I have to ask some introduction
questions:
q1: Will hash function map only the same vectors to the same bucket? (Assuming we have enough buckets)
q2: Should the hash function map the similar vectors to the same bucket? If yes, what is the degree/definition of this similarity, since I am not computing a comparison, but doing hashing.
q3: Depending on the questions above, what kind of hash table algorithm should I use?
q4: I think my weakiest point is that I have no idea how to generate a hash function that takes vectors as input and select a bucket as output. I can implement one by myself depending on q1 and q2... Any suggestions on generating hash functions for LSH bucketing
?