Generating k pairwise independent hash functions
Asked Answered
M

2

10

I'm trying to implement a Count-Min Sketch algorithm in Scala, and so I need to generate k pairwise independent hash functions.

This is a lower-level than anything I've ever programmed before, and I don't know much about hash functions except from Algorithms classes, so my question is: how do I generate these k pairwise independent hash functions?

Am I supposed to use a hash function like MD5 or MurmurHash? Do I just generate k hash functions of the form f(x) = ax + b (mod p), where p is a prime and a and b are random integers? (i.e., the universal hashing family everyone learns in algorithms 101)

I'm looking more for simplicity than raw speed (e.g., I'll take something 5x slower if it's simpler to implement).

Mindamindanao answered 25/8, 2012 at 8:11 Comment(1)
MD5 is cryptographic. MurmurHash is good, but not cryptographically strong.Most
M
5

Scala already has MurmurHash implemented (it's scala.util.MurmurHash). It's very fast and very good at distributing values. A cryptographic hash is overkill--you'll just take tens or hundreds of times longer than you need to. Just pick k different seeds to start with and, since it's nearly cryptographic in quality, you'll get k largely independent hash codes. (In 2.10, you should probably switch to using scala.util.hashing.MurmurHash3; the usage is rather different but you can still do the same thing with mixing.)

If you only need near values to be mapped to randomly far values this will work; if you want to avoid collisions (i.e. if A and B collide using hash 1 they will probably not also collide using hash 2), then you'll need to go at least one more step and hash not the whole object but subcomponents of it so there's an opportunity for the hashes to start out different.

Most answered 25/8, 2012 at 16:38 Comment(7)
Does your point about avoiding collisions mean that the hash functions generated from MurmurHash using different seeds won't be (by default) pairwise independent? I'm only hashing integers in my case.Mindamindanao
@Mindamindanao - Oh, integers will be fine. What I mean is that if object A hashes to value x using .hashValue and object B also hashes to value x, then A and B will collide regardless of what seed you use (as you start with a seed and then mix in x). If you are hashing integers, that's not a concern: A and B have the same intrinsic hash value if and only if A == B.Most
Ah, got it, thanks! To pick the k different seeds, does running scala.util.Random.nextInt() k different times work, or do I need to do something else?Mindamindanao
@Mindamindanao - That should be fine. If you want your code to be deterministic (despite being pseudorandom) so you get the same answer every time, you'll want to create a new scala.util.Random with a seed you select. Otherwise the default nextInt is an adequately good random number generator.Most
@RexKerr I don't think it is possible to change the seed in the new MurmurHash3 implementation.Pronunciamento
@Pronunciamento - Many of the methods are overloaded with a variant that takes a seed, and you can in any case always just start with the seed value and mix in the next hash value to it.Most
Can seeds e.g. 1,2,3 be used for different hash functions, or should seeds be random numbers?Ichnite
C
2

Probably the simplest approach is to take some cryptographic hash function and "seed" it with different sequences of bytes. For most practical purposes, the results should be independent, as this is one of the key properties a cryptographic hash function should have (if you replace any part of a message, the hash should be completely different).

I'd do something like:

// for each 0 <= i < k generate a sequence of random numbers
val randomSeeds: Array[Array[Byte]] = ... ; // initialize by random sequences

def hash(i: Int, value: Array[Byte]): Array[Byte] = {
    val dg = java.security.MessageDigest.getInstance("SHA-1");
    // "seed" the digest by a random value based on the index
    dg.update(randomSeeds(i));
    return dg.digest(value);
    // if you need integer hash values, just take 4 bytes
    // of the result and convert them to an int
}

Edit: I don't know the precise requirements of the Count-Min Sketch, maybe a simple has function would suffice, but it doesn't seem to be the simplest solution.

I suggested a cryptographic hash function, because there you have quite strong guarantees that the resulting hash functions will be very different, and it's easy to implement, just use the standard libraries.

On the other hand, if you have two hash functions of the form f1(x) = ax + b (mod p) and f2(x) = cx + d (mod p), then you can compute one using another (without knowing x) using a simple linear formula f2(x) = c / a * (f1(x) - b) + d (mod p), which suggests that they aren't very independent. So you could run into unexpected problems here.

Caroncarotene answered 25/8, 2012 at 9:1 Comment(2)
Is there any advantage to using a cryptographic hash function (as opposed to f(x) = ax + b mod p) in the case of creating something like a Bloom filter or a Count-Min Sketch? AFAICT, a cryptographic hash function seems a little overkill, since I don't need the cryptographic properties, but I might be missing something.Mindamindanao
@Mindamindanao - ax+b mod p has a way of falling into cycles which can create patterns in your sampling that could be problematic, depending on the assumptions of your sampling. And then if you don't want exactly the full range you run into issues of high order vs. low order bits, etc.. It's good for a little haphazard scrambling, but there are fairly-fast alternatives that work much better.Most

© 2022 - 2024 — McMap. All rights reserved.