LSH Libraries in Java
Asked Answered
S

5

22

I'm looking for a lightweight Java library that supports Nearest Neighbor Searches by Locality Sensitive Hashing for nearly equally distributed data in a high dimensional (in my case 32) dataset with some hundreds of thousands data points.

It's totally good enough to get all entries in a bucket for a query. Which ones i really need could then be processed in a different way under consideration of some filter parameters my problem include.

I already found likelike but hope that there is something a bit smaller and without need of any other tools (like Apache Hadoop in the case of likelike).

Sawtelle answered 28/3, 2012 at 14:57 Comment(4)
Did you find anything? I was looking for the same with Euclidean distance as my metric for kNN.Eterne
Not really. But I think I'll have to come up with an implementation by myself. The question however is still how to choose good hash functions...Sawtelle
You can start with the hash function in the matlab implementation at ttic.uchicago.edu/~gregory/download.htmlEterne
@Sawtelle did you find a solution?Doublejointed
G
5

Maybe this one:

"TarsosLSH is a Java library implementing Locality-sensitive Hashing (LSH), a practical nearest neighbour search algorithm for multidimensional vectors that operates in sublinear time. It supports several Locality Sensitive Hashing (LSH) families: the Euclidean hash family (L2), city block hash family (L1) and cosine hash family. The library tries to hit the sweet spot between being capable enough to get real tasks done, and compact enough to serve as a demonstration on how LSH works."

Code can be found here

Gaffer answered 24/2, 2014 at 7:44 Comment(0)
P
2

Apache Spark has an LSH implementation: https://spark.apache.org/docs/2.1.0/ml-features.html#locality-sensitive-hashing (API).

After having played with both the tdebatty and TarsosLSH implementations, I'll likely use Spark, as it supports sparse vectors as input. The tdebatty requires a non-sparse array of booleans or int's, and the TarsosLSH Vector implementation is a non-sparse array of doubles. This severely limits the number of dimensions one can reasonably support.

This page provides links to more projects, as well as related papers and information: https://janzhou.org/lsh/.

Photoreconnaissance answered 8/1, 2019 at 17:44 Comment(0)
M
1

There is this one: http://code.google.com/p/lsh-clustering/

I haven't had time to test it but at least it compiles.

Maximamaximal answered 22/5, 2012 at 18:53 Comment(0)
C
1

Here another one: https://github.com/allenlsy/knn

It uses LSH for KNN. I'm currently investigating it's usability =)

Colville answered 3/5, 2013 at 12:12 Comment(0)
H
1

The ELKI data mining framework comes with an LSH index. It can be used with most algorithms included (anything that uses range or nn searches) and sometimes works very well.

In other cases, LSH doesn't seem to be a good approach. It can be quite tricky to get the LSH parameters right: if you choose some parameters too high, runtime grows a lot (all the way to a linear scan). If you choose them too low, the index becomes too approximative and loses to many neighbors.

It's probably the biggest challenge with LSH: finding good parameters, that yield the desired speedup and getting a good enough accuracy out of the index...

Harmaning answered 5/4, 2014 at 10:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.