Non-empty buckets in LSH
Asked Answered
S

1

0

I'm reading this survey about LSH, in particular citing the last paragraph of section 2.2.1:

To improve the recall, L hash tables are constructed, and the items lying in the L (L ′ , L ′ < L) hash buckets h_1 (q), · · · , h_L (q) are retrieved as near items of q for randomized R-near neighbor search (or randomized c- approximate R-near neighbor search). To guarantee the precision, each of the L hash codes, y_i , needs to be a long code, which means that the total number of the buckets is too large to index directly. Thus, only the nonempty buckets are retained by resorting to convectional hashing of the hash codes h_l (x).

I have 3 questions:

  1. The bold sentence is not clear to me: what does it mean by "resorting to convenctional hashing of the hash codes h_l (x)"?
  2. Always about the bold sentence, I'm not sure that I got the problem: I totally understand that h_l(x) can be a long code and so the number of possible buckets can be huge. For example, if h_l(x) is a binary code and length is h_l(x)'s length, then we have in total L*2^length possible buckets (since we use L hash tables)...is that correct?
  3. Last question: once we find which bucket the query vector q belongs to, in order to find the nearest neighbor we have to use the original vector q and the original distance metric? For example, let suppose that the original vector q is in 128 dimensions q=[1,0,12,...,14.3]^T and it uses the euclidean distance in our application. Now suppose that our hashing function (supposing that L=1 for simplicity) used in LSH maps this vector to a binary space in 20 dimensions y=[0100...11]^T in order to decide which bucket assign q to. So y has the same index of the bucket B, and which already contains 100 vectors. Now, in order to find the nearest neighbor, we have to compare q with all the others 100 128-dimensions vectors using euclidean distance. Is this correct?
Silage answered 13/6, 2016 at 6:39 Comment(0)
N
0

Approach they are using to improve recall constructs more hash tables and essentially stores multiple copies of the ID for each reference item, hence space cost is larger [4]. If there are a lot of empty buckets which increases the retrieval cost, the double-hash scheme or fast search algorithm in the Hamming space can be used to fast retrieve the hash buckets. I think in this case they are using double hash function to retrieve non-empty buckets.

No of buckets/memory cells [1][2][3] -> O(nL)

References:

[1] http://simsearch.yury.name/russir/03nncourse-hand.pdf

[2] http://joyceho.github.io/cs584_s16/slides/lsh-12.pdf

[3] https://users.soe.ucsc.edu/~niejiazhong/slides/kumar.pdf

[4] http://research.microsoft.com/en-us/um/people/jingdw/Pubs%5CLTHSurvey.pdf

Neisa answered 1/8, 2016 at 9:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.