Generating Random Hash Functions for LSH Minhash Algorithm

Asked 10/7, 2014 at 12:11 Answered 10/7, 2014 at 20:27

Solved java algorithm hash locality-sensitive-hash minhash

I'm programming a minhashing algorithm in Java that requires me to generate an arbitrary number of random hash functions (240 hash functions in my case), and run any number of integers through it (2000 at the moment).

In order to do that, I've been generating random numbers a, b, and c (from the range 1 - 2001) for each of the 240 hash functions. Then, my hash function returns h = ((a*x) + b) % c, where h is the return value and x is one of the integers run through it.

Is this an efficient implementation of random hashing, or is there a more common/acceptable way to do it?

This post was asking a similar question, but I'm still somewhat confused by the wording of the answer: Minhash implementation how to find hash functions for permutations

Dolomites answered 10/7, 2014 at 12:11 Comment(2)

Computers are never random. They're Pseudorandom. This sounds like an academic issue so it might be worth noting the distinctions. – Colleague 10/7, 2014 at 12:14

Also try out bottom-k hashing, where you just use one hash function, but keep the k smallest values, rather than only one. – Heterogamete 6/10, 2016 at 12:25

When I was working with Bloom filters a few years ago, I ran across an article that describes how to generate multiple hash functions very simply, with a minimum of code. The method he describes works very well. See Less Hashing, Same Performance: Building a Better Bloom Filter.

The basic idea is to create two hash functions, call them h1 and h2, with which you can then simulate multiple hash functions, g1 through gk, using the formula:

gi = h1(x) + i*h2(x)

i varies from 1 to k (the number of hash functions you want).

The paper is well worth reading, even if you decide not to implement his idea. Although after reading it I can't imagine not wanting to implement it. It made my Bloom filter code a whole lot more tractable and didn't negatively impact performance.

Relate answered 10/7, 2014 at 20:27 Comment(0)

So the method that I described above was almost correct. The numbers a and b should be randomly generated. However, c needs to be a prime number that is slightly larger than the maximum possible value of x. Once those numbers have been chosen, finding hash value h using h = ((a*x)+b) % c is the standard, accepted way to generate hash functions.

Also, a and b should be random numbers from the range 1 to c-1.

Dolomites answered 10/7, 2014 at 14:2 Comment(0)

Recommended topics

Hot tags