Obtaining a k-wise independent hash function
Asked Answered
S

5

7

I need to use a hash function which belongs to a family of k-wise independent hash functions. Any pointers on any library or toolkit in C, C++ or python which can generate a set of k-wise independent hash functions from which I can pick a function.

Background: I am trying to implement this algorithm here: http://researcher.watson.ibm.com/researcher/files/us-dpwoodru/knw10b.pdf for the Distinct Elements problem.

I have looked at this thread: Generating k pairwise independent hash functions which mentions using Murmur hash to generate a pairwise independent hash function. I was wondering if there is anything similar for k-wise independent hash functions. If there is none available, would it be possible for me to construct such a set of k-wise independent hash functions.

Thanks in advance.

Staffan answered 29/4, 2013 at 17:5 Comment(1)
You could use an encryption algorithm with k different keys. I like RC4 for stuff like this. It's not cryptographically secure. But for your purpose it ought to be close enough to avoid collisions. In addition, it's both simple to implement and fast.Heteronomy
P
2

This is one of many solutions, but you could use for example the following open-source hash algorithm: https://github.com/Cyan4973/xxHash

Then, to generate different hashes, you just have to provide different seeds.

Considering the main function declaration :

unsigned int XXH32 (const void* input, int len, unsigned int seed);

So if you need k different hash values, just re-use the same algorithm k times, with k different seeds.

Pagel answered 29/4, 2013 at 20:45 Comment(3)
:Thanks a ton. Just needed a clarification: If I generate k different hash functions with k random seeds , does it imply that these are k-wise independent ? (or atleast statistically close to k-wise independent). I am already using pyhash libary and using murmur32 hash function with different seeds.Staffan
It entirely depends on the quality of the hash algorithm. Both MurmurHash and above xxHash have been thoroughly tested with SMHasher, proving there is no bit dependency between 2 results using different seeds. Nothing is perfect in this world, but these 2 ones have close to perfect random distribution.Pagel
Note that xxHash is not k-independent, so it may fail predictably for some keysets. If you can afford it, it would be better to use the polynomial hash below, with which the algorithm is proven to work.Tench
T
13

The simplest k-wise independent hash function (mapping positive integer x < p to one of m buckets) is just

h

where p is some big random prime (261-1 will work) and ai are some random positive integers less than p, a0 > 0.

2-wise independent hash: h(x) = (ax + b) % p % m

again, p is prime, a > 0, a,b < p (i.e. a can't be zero but b can when that is a random choice)

These formulas define families of hash functions. They work (in theory) if you select a hash function randomly from corresponding family (i.e. if you generate random a's and b) each time you run your algorithm.

Terrill answered 12/11, 2016 at 17:13 Comment(2)
Thanks, this is helpful. Could you clarify: 1) It seems that one gets bijective result only if p is mersenne prime, is that really the case? 2) How difficult it is to compute preimages (original x if we know only output), provided that we know only p, m, a, b, h(x) - especially if the random polynomial used is of high degree to make its factorization frustrating. Better yet, if you know of some literature elaborating on the topic wrt cryptography - i know only of fi.muni.cz/~xbouda1/teaching/2009/IV111/… which is quite dated to say the leastSludge
Is there a link that proves the above formula is k-wise independent?Transilluminate
C
6

There is no such thing as "a k-wise independent hash function". However, there are k-wise independent families of functions.

As a reminder, a family of functions is k-wise independent when if h is picked randomly from the family and x_1 .. x_k and y_1 .. y_k are picked arbitrarily, the probability that "for all i, h(x_i) = y_i" is Y^-k, where Y is the size of the co-domain from which the y_i were selected.

There are a few families of functions that are known to be k-wise independent for small k like 2, 3, 4, and 5. For arbitrary k, you will likely need to use polynomial hashing. Note that there are two variants of this, one of which is not even 2-independent, so be careful when implementing it.

The polynomial hash family can hash from a field F to itself using k constants a_0 through a_{k-1} and is defined by the sum of a_i x^i, where x is the key you are hashing. Field arithmetic can be implemented on your computer by taking letting F be the integers modulo a prime p. That's probably not convenient, as it is often better to have the domain and range be uint32_t or the like. In that case you can use the field F_{2^32}, and you can use polynomial multiplication over Z_2 and then division by an irreducible polynomial in that field. Otherwise, you can operate in Z_p where p is larger than 2^32 (or 64) and take the result of the polynomial mod 2^32, I think. That will only be almost k-wise independent, but sometimes that's sufficient for the analysis to go through. It will not be easy to re-analyze the KNW algorithm to change its hash families.

To generate a member of a k-wise independent family, use your favorite random number generator to pick the function randomly. In the case of polynomila hashing, that means picking the as referenced above. /dev/random should suffice.

The paper you point to, "An Optimal Algorithm for the Distinct Elements Problem", is a nice one and has been cited many times. However, it is not easy to implement, and it may be slower or even take more space than HyperLogLog, due to hidden constants in the big-O notations. A number of papers have noted the complexity of this algorithm and even called it infeasible compared to HyperLogLog. If you want to implement an estimator for the number of distinct elements, you might start with an earlier algorithm. There is plenty of complexity there if your goal is education. If your goal is practicality, you also want to stay away from KNW, because it could be a lot of work just to make something less practical that HyperLogLog.

As another piece of advice, you should probably ignore the suggestions to "just use Murmur hash" or "pick k values from xxhash" if you want to learn about and understand this algorithm or other random algorithms that use hashing. Murmur/xx might be fine in practice, but they are not k-wise independent families, and some of that advice on this page is not even semantically well-formed. For instance, "if you need k different hash, just re-use the same algorithm k times, with k different seeds" isn't relevant to k-wise independent families. For this algorithm you want to implement, you'll end up apply the hash functions an arbitrary number of times. You don't "need k different hash", you need n different hash values generated by first picking randomly from a k-independent hash family and second applying the chosen function to the streaming keys that are the input to algorithms like this.

Collocutor answered 13/11, 2016 at 5:22 Comment(3)
Thanks for the nice explanation. Can you please suggest some libraries which are fast and optimized in giving hash function h [n]->[k] from k wise independent hash family where k<5. I have a critical application where I was just generating n random numbers from 1-k which seems to be slow for my purpose. ThanksBosky
Hi @Collocutor , do you have a good source for this answer?Sophrosyne
CLRS should do, I believe.Collocutor
P
2

This is one of many solutions, but you could use for example the following open-source hash algorithm: https://github.com/Cyan4973/xxHash

Then, to generate different hashes, you just have to provide different seeds.

Considering the main function declaration :

unsigned int XXH32 (const void* input, int len, unsigned int seed);

So if you need k different hash values, just re-use the same algorithm k times, with k different seeds.

Pagel answered 29/4, 2013 at 20:45 Comment(3)
:Thanks a ton. Just needed a clarification: If I generate k different hash functions with k random seeds , does it imply that these are k-wise independent ? (or atleast statistically close to k-wise independent). I am already using pyhash libary and using murmur32 hash function with different seeds.Staffan
It entirely depends on the quality of the hash algorithm. Both MurmurHash and above xxHash have been thoroughly tested with SMHasher, proving there is no bit dependency between 2 results using different seeds. Nothing is perfect in this world, but these 2 ones have close to perfect random distribution.Pagel
Note that xxHash is not k-independent, so it may fail predictably for some keysets. If you can afford it, it would be better to use the polynomial hash below, with which the algorithm is proven to work.Tench
M
1

Just use a good non-cryptographic hash function. This advice perhaps will make me unpopular with my colleagues in theoretical computer science, but consider your adversary.

  1. Nature. Yeah, maybe it'll hit the minuscule fraction inputs that cause your hash function to behave badly, but there are plenty of other ways for things to go wrong that a k-wise independent hash family won't fix (e.g., the random number generator that chose the hash function didn't do a good job, bugs, etc.), so you need to test end-to-end anyway.

  2. Oblivious adversary. This is what the theory assumes. Oblivious adversaries cannot look at your random bits. If only they were so nice in real life!

  3. Non-oblivious adversary. Randomness is pointless. Use a binary tree.

Marcellus answered 29/4, 2013 at 21:25 Comment(3)
I'm not sure what to make of this answer. The question is "I need a hash function" and the response is "Just use a hash function" ???Henry
@BenFulton The question is "I need a k-wise independent hash function" and my answer is "no you don't; k-wise independence is a property that's much more important in theory than in practice".Marcellus
My understanding is that 5-wise independence specifically is actually needed for linear probing to work well in practice, as per "On the k-independence required by linear probing and minwise independence"Coda
R
-2

I'm not 100% sure what you mean by "k-wise independent hash functions", but you can get k distinct hash functions by coming up with two hash functions, and then using linear combinations of them.

I have an example in my bloom filter module: http://stromberg.dnsalias.org/svn/bloom-filter/trunk/bloom_filter_mod.py Ignore the get_bitno_seed_rnd function, look at hash1, hash2 and get_bitno_lin_comb

Redcap answered 29/4, 2013 at 20:36 Comment(5)
Isn't this exactly what you don't want to do for a bloom filter? I thought the efficiency of a bloom filter depends on using k independent hash functions. A linear combination of hash function is by-definition not independent.Aeronautics
I believe it's pretty common to use linear combinations with a bloom filter. See #11954586 . Also, the bloom filter implementation I worked on uses linear combinations, and the unit tests confirm that the distribution of values is pretty good. Also, at least in python, using a random number generator's seed worked pretty poorly - that's what I replaced. The unit tests confirmed this.Redcap
Agreed, I've heard the same thing from others who have validated the false-positive rate in a bloom filter that combines hash functions very much like yours does. I just can't wrap my head around why this works.Aeronautics
FWIW, it's not super-different from Knuth's description of Linear Congruential Pseudo Random Numbers - which of course scatters values pretty well. For the kth hash function, you doing hash1 + k*hash2 mod n where n is the number of bits in the filter, and hash1 and hash2 are 0 <= hash_value < n. So you multiply it high and mod it back.Redcap
@ChrisWarth The intuitive hand-wavey explanation for why using two functions is okay is that it's EXTREMELY unlikely for two numbers to produce the same two "base" hash values. The more common case is that the two "base" hashes add up and (after modding them) map to the same values in the hash map. Asymptotically, the first case becomes negligibly uncommon. The second case has the same chance of happening as a regular bloom filter false positive. Read Kirsch & Mitzenmacher's paper called Less Hashing, Same Performance for the proper math on this.Strophic

© 2022 - 2024 — McMap. All rights reserved.