Locality Sensitive Hash Implementation? [closed]
Asked Answered
M

4

20

Are there any relatively simple to understand (and simple to implement) locality-sensitive hash examples in C/C++/Java/C#?

I'd like to learn more about the concept and so want to try an implementation on a few text files just to see how it works, so I don't need anything high-performance or anything... just an example of a hash function that returns similar hashes for similar inputs. I can learn more from it by example afterwards. :)

Mccormick answered 24/4, 2011 at 10:10 Comment(0)
F
9

For strings you can use approximate matching algorithm.

If the strings are equidistant from a reference string then chances are that they are similar to each other. And there you go you have a locality senitive hash implementation for strings.

You can create different hash buckets for a range of distances.

EDIT: You can try other variations of string distance. A simpler algorithm would just return no. of common characters between two strings.

Fourthly answered 24/4, 2011 at 11:2 Comment(7)
+1 it seems to be exactly what I'm looking for, I'll look at it a bit more. Thanks a lot! :)Mccormick
@Hasan: I'm a bit confused... the strings "abcd" and "xyzw" are both a distance 4 from a random string like "6pGO", but they're completely different. How does that work? (It's 4 for pretty much any random string of length 4...)Mccormick
@Mehrdad this algorithm tells you how many changes you need to transform your input string into the reference (random) string. You can also try a simpler algorithm in which distance is no. of the characters common between the random string and your input string.Fourthly
@Hasan: Yes I understood what the algorithm does, but the problem is that it doesn't work well at all, like I showed. Do you happen to know of any better algorithms? (The one you just mentioned seems to be great, I'll think about it.)Mccormick
Such a hashing solution would work well similar strings but for irrelevant strings it will give you not useful results but if you look at it in that case you don't have similar strings anyway so any algorithm won't work. In the first case all irrelevant ones would fall under hashcode 4 with in that you can then maintain another hashlist for further breaking it down.Fourthly
You can get better results by generating n random strings each representing a single bucket (list). You find distance of input string with each bucket head (random string) and insert the string in the bucket with minimum distance. This is called clustering. Later on you can inspect the buckets individually and swap the head with item in the list which represents majority of strings in that list. For more algorithms try en.wikipedia.org/wiki/String_metricFourthly
Hi @MuhammadHasanKhan, thanks for your insightful answers. Can i use the same approach in this problem #54512095Tarsometatarsus
A
6

Well there is an excellent into article at MSDN blogs here: http://blogs.msdn.com/b/spt/archive/2008/06/11/locality-sensitive-hashing-lsh-and-min-hash.aspx

Also there is at least once C++ library which you can inspect the source code of here: http://sourceforge.net/projects/lshkit/

Atmospherics answered 24/4, 2011 at 10:45 Comment(1)
Same issue as with @Mehran's link, but thanks anyway, +1.Mccormick
S
2

There is also a Java Implementation on Hadoop. it does a good job on documents.

it's called LikeLike

Currently Likelike supports only Min-Wise independent permutations. Min-Wise independent permutations is applied to the recommendation of Google News

Streamlined answered 24/4, 2011 at 13:37 Comment(1)
Thanks for the link, but it seems a bit too complicated for my case... I'm trying to avoid large libraries until I understand a simple one. :-) Thanks anyway, +1.Mccormick
O
2

I realise you explicitly asked for C/C++/C#, but there is a Python port of the nilsimsa hash which might be easier to grok than other, larger libraries.

Orgasm answered 27/5, 2011 at 15:57 Comment(1)
+1 that's pretty great, I'm also learning Python so it kills two birds with one stone. Thanks for sharing this! :)Mccormick

© 2022 - 2024 — McMap. All rights reserved.