How to hash vectors into buckets in Locality Sensitive Hashing (using jaccard distance)?

Asked 8/4, 2014 at 20:4 Answered 3/5, 2016 at 6:21

c hash machine-learning locality-sensitive-hash minhash

I am implementing a near-neighbor search application which will find similar documents. So far I have read a good portion of LSH related materials (theory behind LSH is some kind of confusing and I am not able to comphrened it 100% yet).

My code is able to compute the signature matrix using the minhash functions (I am close to the end). I also apply the banding strategy on the signature matrix. However I am not able to understand how to hash signature vectors (of columns) in a band into buckets.

My last question may be the most important one, but I have to ask some introduction questions:

q1: Will hash function map only the same vectors to the same bucket? (Assuming we have enough buckets)

q2: Should the hash function map the similar vectors to the same bucket? If yes, what is the degree/definition of this similarity, since I am not computing a comparison, but doing hashing.

q3: Depending on the questions above, what kind of hash table algorithm should I use?

q4: I think my weakiest point is that I have no idea how to generate a hash function that takes vectors as input and select a bucket as output. I can implement one by myself depending on q1 and q2... Any suggestions on generating hash functions for LSH bucketing?

Mckinney answered 8/4, 2014 at 20:4 Comment(0)

q1: You're not supposed to hash the entire vector, but parts of it. Say you have vectors of length 100 that represents each item, you can hash 5 slices of length 20.

q2: This is the main idea behind the entire thing: You measure similarity by comparing parts of things for equality. If you consider sentences in a text as vectors, it would be unlikely for 2 sentences to be exactly the same (have the same hash output). However, if you split them in parts and hashed the parts separately, the hashes for some matching individual words in the same positions, would return the same hash output, hence you could get an idea about the sentences' similarity.

The number and length of slices are important parameters here that affect the accuracy of your similarity result. Too many slices would give a lot of false positives while too few slices would only be able to identify the highest degrees of similarity.

You can find more info about this in the book "Mining of Massive Datasets" , found here: http://infolab.stanford.edu/~ullman/mmds.html

q3: You need a data structure that for every slice level, it can keep the results of the hash for every vector-slice, along with which vector produced it. Then, when want to find the similar neighbors of Vector X, you can check your data structure for each slice and see if the hash-output you got was also outputted by another vector.

q4: I'm not sure what you mean here. If you hash an object, you usually get as output a bitstring or an integer or a float, depending on the language. That's the bucket. If you get the same output with the same hash function on a different object, that means they hashed on the same bucket.

Filmy answered 9/4, 2014 at 17:23 Comment(0)

One simple way to generate a hash function for LSH is as follows:

For a given min-hash signature i for each band b, compute the sum of rows in the band, call it S_ib. Create a bucket for S_ib. For the complete set, the bucket will be appended with entries where the sum matches S_ib, otherwise a new bucket is generated.

from collections import defaultdict

....
LSHdictlist = [defaultdict(list) for b in range(bands)]
....
tempsum = np.int(np.sum(M[i,b*rowsinband:(b+1)*rowsinband]))
        LSHdictlist[b][tempsum].append(i)

you can use product instead of sum as well.

Emend answered 3/5, 2016 at 6:21 Comment(0)

Recommended topics

Hot tags