Bloom filters and its multiple hash functions
Asked Answered
V

3

16

I'm implementing a simple Bloom filter as an exercise.

Bloom filters require multiple hash functions, which for practical purposes I don't have.

Assuming I want to have 3 hash functions, isn't it enough to just take the hash of the object I'm checking membership for, hashing it (with murmur3) and then add +1, +2, +3 (for the 3 different hashes) before hashing them again?

As the murmur3 function has a very good avalanche effect (really spreads out results) wouldn't this for all purposes be reasonable?

Pseudo-code:

function generateHashes(obj) {
  long hash = murmur3_hash(obj);
  long hash1 = murmur3_hash(hash+1);
  long hash2 = murmur3_hash(hash+2);
  long hash3 = murmur3_hash(hash+3);
  (hash1, hash2, hash3)
}

If not, what would be a simple, useful approach to this? I'd like to have a solution that would allow me to easily scale for more hash functions if needed be.

Varrian answered 11/2, 2018 at 0:58 Comment(0)
A
14

AFAIK, the usual approach is to not actually use multiple hash functions. Rather, hash once and split the resulting hash into 2, 3, or how many parts you want for your Bloom filter. So for example create a hash of 128 bits and split it into 2 hashes 64 bit each.

https://github.com/Claudenw/BloomFilter/wiki/Bloom-Filters----An-overview

Antinomian answered 27/7, 2018 at 7:58 Comment(2)
Best answer and link, thanks. This is what I was searching for.Calore
Before asking a new question, this seems to be related to what I need: I want to implement Bloom Filters with nvidia CUDA on gpus and for this purpose I need multiple hash values as well. I also want to use the murmurhash function. However, the nvidia library (github.com/NVIDIA/cuCollections/blob/dev/include/cuco/detail/…) only allows for 32bit or 64bit INPUT into the hash function. Since my input data is larger, I now wonder if I can compute multiple 32bit hashes and combine them with XOR or if there is any better way?Hydromel
D
4

The hashing functions of Bloom filter should be independent and random enough. MurmurHash is great for this purpose. So your approach is correct, and you can generate as many new hashes your way. For the educational purposes it is fine.

But in real world, running hashing function multiple times is slow, so the usual approach is to create ad-hoc hashes without actually calculating the hash.

To correct @memo, this is done not by splitting the hash into multiple parts, as the width of the hash should remain constant (and you can't split 64 bit hash to more than 64 parts ;) ). The approach is to get a two independent hashes and combine them.

function generateHashes(obj) {
  // initialization phase
  long h1 = murmur3_hash(obj);
  long h2 = murmur3_hash(h1);

  int k = 3; // number of desired hash functions
  long hash[k];

  // generation phase
  for (int i=0; i<k; i++) {
    hash[i] = h1 + (i*h2);
  }

  return hash;
}

As you see, this way creating a new hash is a simple multiply-add operation.

Disinterest answered 30/7, 2018 at 8:40 Comment(4)
This approach is broken. To be a proper bloom filter, your k hash functions should be independent, but these ones are not, since the value of the first two hashes (modulo m, the size of your filter) determines the others. This necessarily increases false positives; for instance, with your system, the chance that two objects get the same h1 and h2 modulo m and thus k identical hashes is 1/m² - much higher than in a true bloom filter! I tested with k=10, m=1000, n=20 and the false positive rate with your system was several orders of magnitude higher than with independent hashes.Excursionist
For a second, you made me hesitate. Please see this paper (more than 200 citations) that explains in depth the usage of double hashing for Bloom filters. I am reverting the answer to it's previous form.Disinterest
Hmm. So in the paper, they present various hashing schemes, including yours, concede that they have a worse false positivity rate than a true Bloom filter, but prove they have an equal "asymptotic false positive rate" - which I think means the rate converges to that of an ideal Bloom filter as the size of the filter, m, tends to infinity? So for sufficiently large m, the false positive rate will be good enough? All that may be, but absent a compelling reason to use this scheme, I think I'd rather use memo's instead, and avoid having to spend time figuring out if my m is large enough.Excursionist
There always be a difference between theoretically perfect and practically "good enough" solutions. Anyways, in my answer I wanted to correct @Antinomian on splitting the hash function. You just can't "split" hashes. If you follow his link, the section "Constructing a Bloom Filter" references the same paper. The mix up is from the fact that originally they suggest to generate single 128 bit hash (run timely operation only once), split it into two parts 64 bit long (h1 and h2) and use them to generate a set of 64 bit hashes.Disinterest
L
1

It would not be a good approach. Let me try and explain. Bloom filter allows you to test if an element most likely belongs to a set, or if it absolutely doesn’t. In others words, false positives may occur, but false negatives won’t.

Reference: https://sc5.io/posts/what-are-bloom-filters-and-why-are-they-useful/

Let us consider an example:

You have an input string 'foo' and we pass it to the multiple hash functions. murmur3 hash gives the output K, and subsequent hashes on this hash value give x, y and z

Now assume you have another string 'bar' and as it happens, its murmur3 hash is also K. The remaining hash values? They will be x, y and z because in your proposed approach the subsequent hash functions are not dependent on the input, but instead on the output of first hash function.

long hash1 = murmur3_hash(hash+1);
long hash2 = murmur3_hash(hash+2);
long hash3 = murmur3_hash(hash+3);

As explained in the link, the purpose is to perform a probabilistic search in a set. If we perform search for 'foo' or for 'bar' we would say that it is 'likely' that both of them are present. So the % of false positives will increase.

In other words this bloom filter will behave like a simple hash-function. The 'bloom' aspect of it will not come into picture because only the first hash function is determining the outcome of search.

Hope I was able to explain sufficiently. Let me know in comments if you have some more follow-up queries. Would be happy to assist.

Lippi answered 30/7, 2018 at 18:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.