How many hash functions are required in a minhash algorithm
Asked Answered
C

5

34

I am keen to try and implement minhashing to find near duplicate content. http://blog.cluster-text.com/tag/minhash/ has a nice write up, but there the question of just how many hashing algorithms you need to run across the shingles in a document to get reasonable results.

The blog post above mentioned something like 200 hashing algorithms. http://blogs.msdn.com/b/spt/archive/2008/06/10/set-similarity-and-min-hash.aspx lists 100 as a default.

Obviously there is an increase in the accuracy as the number of hashes increases, but how many hash functions is reasonable?

To quote from the blog

It is tough to get the error bar on our similarity estimate much smaller than [7%] because of the way error bars on statistically sampled values scale — to cut the error bar in half we would need four times as many samples.

Does this mean that mean that decreasing the number of hashes to something like 12 (200 / 4 / 4) would result in an error rate of 28% (7 * 2 * 2)?

Contradiction answered 31/10, 2013 at 7:58 Comment(0)
G
19

Pretty much.. but 28% would be the "error estimate", meaning reported measurements would frequently be inaccurate by +/- 28%.

That means that a reported measurement of 78% could easily come from only 50% similarity.. Or that 50% similarity could easily be reported as 22%. Doesn't sound accurate enough for business expectations, to me.

Mathematically, if you're reporting two digits the second should be meaningful.

Why do you want to reduce the number of hash functions to 12? What "200 hash functions" really means is, calculate a decent-quality hashcode for each shingle/string once -- then apply 200 cheap & fast transformations, to emphasise certain factors/ bring certain bits to the front.

I recommend combining bitwise rotations (or shuffling) and an XOR operation. Each hash function can combined rotation by some number of bits, then XORing by a randomly generated integer.

This both "spreads" the selectivity of the min() function around the bits, and as to what value min() ends up selecting for.

The rationale for rotation, is that "min(Int)" will, 255 times out of 256, select only within the 8 most-significant bits. Only if all top bits are the same, do lower bits have any effect in the comparison.. so spreading can be useful to avoid undue emphasis on just one or two characters in the shingle.

The rationale for XOR is that, on it's own, bitwise rotation (ROTR) can 50% of the time (when 0 bits are shifted in from the left) converge towards zero, and that would cause "separate" hash functions to display an undesirable tendency to coincide towards zero together -- thus an excessive tendency for them to end up selecting the same shingle, not independent shingles.

There's a very interesting "bitwise" quirk of signed integers, where the MSB is negative but all following bits are positive, that renders the tendency of rotations to converge much less visible for signed integers -- where it would be obvious for unsigned. XOR must still be used in these circumstances, anyway.

Java has 32-bit hashcodes builtin. And if you use Google Guava libraries, there are 64-bit hashcodes available.

Thanks to @BillDimm for his input & persistence in pointing out that XOR was necessary.

Gobbet answered 31/10, 2013 at 8:17 Comment(6)
Following what Bill said below, I'd like to suggest supplementing 'rotation' with 'XOR' to both change which bits are 'most selective' and randomize which values the 'min' function is selecting for. Thanks @BillDimm!Gobbet
@BillDimm and I have investigated actual behaviour of hash functions in my Java example code -- turns out that ROTATE should always be combined with XOR. ROTATE on it's own can produce excessive collisions, since 50% of single-bit rotations converge towards zero and "minimum" may tend to select the same shingles. Thanks Bill for your assistance in pointing this out!Gobbet
I don't think that the XOR method is good either. Certainly the proofs of the nice properties of minhash do not work for such hash functions, because they require that the hash functions be drawn from a 2-uniform family, which this is not.Anesthetic
Thanks @RobertObryk. According to my reading of en.wikipedia.org/wiki/Universal_hashing an ADDITION from uniformly-distributed random numbers gives strong independence for all hash functions, an XOR from uniformly-distributed random numbers gives strong independence only where hash usage is bitwise, ie where modulo M is a power of 2. So I would read Robert's suggestion as saying, that ADDITION may be better than MODULO. Any comments?Gobbet
en.wikipedia.org/wiki/Tabulation_hashing and en.wikipedia.org/wiki/Rolling_hash articles may also be worth reading, if you're building a Minhash algorithm.Gobbet
@ThomasW: Is there any criteria for picking how many bits should be rotated? Or, any constant number of shifting would do?Shriner
S
24

One way to generate 200 hash values is to generate one hash value using a good hash algorithm and generate 199 values cheaply by XORing the good hash value with 199 sets of random-looking bits having the same length as the good hash value (i.e. if your good hash is 32 bits, build a list of 199 32-bit pseudo random integers and XOR each good hash with each of the 199 random integers).

Do not simply rotate bits to generate hash values cheaply if you are using unsigned integers (signed integers are fine) -- that will often pick the same shingle over and over. Rotating the bits down by one is the same as dividing by 2 and copying the old low bit into the new high bit location. Roughly 50% of the good hash values will have a 1 in the low bit, so they will have huge hash values with no prayer of being the minimum hash when that low bit rotates into the high bit location. The other 50% of the good hash values will simply equal their original values divided by 2 when you shift by one bit. Dividing by 2 does not change which value is smallest. So, if the shingle that gave the minimum hash with the good hash function happens to have a 0 in the low bit (50% chance of that) it will again give the minimum hash value when you shift by one bit. As an extreme example, if the shingle with the smallest hash value from the good hash function happens to have a hash value of 0, it will always have the minimum hash value no matter how much you rotate the bits. This problem does not occur with signed integers because minimum hash values have extreme negative values, so they tend to have a 1 at the highest bit followed by zeros (100...). So, only hash values with a 1 in the lowest bit will have a chance at being the new lowest hash value after rotating down by one bit. If the shingle with minimum hash value has a 1 in the lowest bit, after rotating down one bit it will look like 1100..., so it will almost certainly be beat out by a different shingle that has a value like 10... after the rotation, and the problem of the same shingle being picked twice in a row with 50% probability is avoided.

Swob answered 31/10, 2013 at 16:14 Comment(10)
My argument would be, that rotating the bits in the hash (first) is more likely to emphasise different shingles than XORing. Your comments about "roughly 50%" apply to any bitwise operation.. since any single bit has a 50% chance of being set. Rotating selects different bits to be most significant, XOR changes what value/ranges are effectively "least" but will still be primarily selecting on a function of those topmost bits. For that reason a purely XOR-based approach will always select most-significantly on those same topmost bits, and may not use other hash bits effectively.Gobbet
I think your understanding & logic as to bitwise rotation, is faulty here. Rotating right by 3, for example, puts 3 bits of the LSB in front the previous MSB. Selecting for "least" of that will target completely different shingles. The argument that 50% of LSBs are 1 is and 50% are 0 is spurious, as is the mention of "good" hash values -- it's kinda like the man waiting to deploy his parachute but he gets to 10 feet and says it's OK -- "I can jump from here!". Similar logical error -- He thinks 10 feet doesn't make a difference, you think 1 bit doesn't make a difference.Gobbet
However, I do like your approach to XORing & would be happy to use it in conjunction with rotations, table of pseudo-random integers chosen in conjunction with bitwise shuffles or rotations should be a good solid basic approach.Gobbet
@ThomasW, you can certainly add bit rotations to the XORing, but you're really not getting much for the extra CPU cycles. The shingle that gives the minimum hash when XORing is going to be the one that has the most high-order bits in common with the random number being XORed since the bits they have in common are going to give zeros in the result (a tie for the number of high bits in common with the random number brings lower bits into play). To the extent that your 199 random numbers are very different in the high order bits, the shingles selected will be very different (random).Swob
@ThomasW, I don't think there is any flaw in my analysis of bitwise rotations. Suppose h0() is our "good" hash function. h1() shifts down by 1 bit, h2() shifts down by 2 bits, etc. If the LSB for h0(x) is 0, h1(h0(x)) = h0(x)/2. If the LSB is 1, h1(h0(x)) = 2^31 + h0(x)/2. The 2^31 means a shingle with an LSB of 1 for h0 is NOT going to give the minimum hash for h1(h0()). So, if the i value that gives minimum h0(shingle[i]) has 0 LSB for h0, then the exact same i gives minimum h1(h0(shingle[i])) because minimizing h1(h0()) is the same as minimizing h0()/2 for the shingles that matter.Swob
@ThomasW Shifting down 1 bit puts the shingles into 2 groups: those with no chance of giving a min (LSB=1), and those that have a chance (LSB=0). For the latter the new hash vals are just the old ones divided by 2. If the shingle that gives min h0 has LSB=0 it will also give min for h1(h0()). 50% chance the LSB is 0 for the min shingle -> 50% chance the same shingle is selected by h1 as h0. You could have a million shingles that you are supposedly choosing from randomly, but rotating gives a 50% chance that the first 2 choices are the same, a 25% chance that the first 3 are the same, etc.Swob
Quick empirical exercise disagrees. Running minhash over 192 shingles (from the very text of your post above) with 32 rotates vs 32 XORs, gave 1 duplicate for rotates but an average of 3.3 duplicates for XOR (over 3 runs with different rand seeds -- 2, 4, 4 dupes). Varied text -- similar but XOR dupes are 0,1,3 while ROTR dupes still just 1. And 200 XOR functions give only 110.6 unique shingles on average (104, 112, 116 from the 3 different runs). Nominate a place if you'd like me to post the source somewhere.Gobbet
@ThomasW With true random selection you are going to get collisions (same shingle chosen multiple times) unless sqrt(num_shingles) is much larger than the number of random samples (aka the "birthday paradox"), so the results you got for XOR don't surprise me, and are not evidence of a flaw. The fact that you got so few collisions with pure rotation does surprise me, so I would like to see that. If you could email the code to me (first initial + last name + at hotneuron.com) or post it anywhere that is convenient for you, I would appreciate it. Thanks.Swob
@ThomasW Thank you for sharing your code. All of my previous comments assumed unsigned integer comparisons when looking for the minimum hash value. Using signed integers instead of unsigned avoids the problem that I described. Using rotations to generate 32 hashes gives 1 dupe when using signed comparisons, but 13 dupes when using unsigned comparisons (and the dupes are mostly adjacent values, i.e. values differing by one bit rotation, consistent with my description of the problem). I'll update my answer to make it clear that my criticism of bit rotation only applies for unsigned integers.Swob
Thanks @BillDimm, your original criticism is correct -- and probably still applies to some degree to signed ints as well. I've updated my answer to reflect that ROTR should always be combined with XOR, and credit you for that insight ;)Gobbet
G
19

Pretty much.. but 28% would be the "error estimate", meaning reported measurements would frequently be inaccurate by +/- 28%.

That means that a reported measurement of 78% could easily come from only 50% similarity.. Or that 50% similarity could easily be reported as 22%. Doesn't sound accurate enough for business expectations, to me.

Mathematically, if you're reporting two digits the second should be meaningful.

Why do you want to reduce the number of hash functions to 12? What "200 hash functions" really means is, calculate a decent-quality hashcode for each shingle/string once -- then apply 200 cheap & fast transformations, to emphasise certain factors/ bring certain bits to the front.

I recommend combining bitwise rotations (or shuffling) and an XOR operation. Each hash function can combined rotation by some number of bits, then XORing by a randomly generated integer.

This both "spreads" the selectivity of the min() function around the bits, and as to what value min() ends up selecting for.

The rationale for rotation, is that "min(Int)" will, 255 times out of 256, select only within the 8 most-significant bits. Only if all top bits are the same, do lower bits have any effect in the comparison.. so spreading can be useful to avoid undue emphasis on just one or two characters in the shingle.

The rationale for XOR is that, on it's own, bitwise rotation (ROTR) can 50% of the time (when 0 bits are shifted in from the left) converge towards zero, and that would cause "separate" hash functions to display an undesirable tendency to coincide towards zero together -- thus an excessive tendency for them to end up selecting the same shingle, not independent shingles.

There's a very interesting "bitwise" quirk of signed integers, where the MSB is negative but all following bits are positive, that renders the tendency of rotations to converge much less visible for signed integers -- where it would be obvious for unsigned. XOR must still be used in these circumstances, anyway.

Java has 32-bit hashcodes builtin. And if you use Google Guava libraries, there are 64-bit hashcodes available.

Thanks to @BillDimm for his input & persistence in pointing out that XOR was necessary.

Gobbet answered 31/10, 2013 at 8:17 Comment(6)
Following what Bill said below, I'd like to suggest supplementing 'rotation' with 'XOR' to both change which bits are 'most selective' and randomize which values the 'min' function is selecting for. Thanks @BillDimm!Gobbet
@BillDimm and I have investigated actual behaviour of hash functions in my Java example code -- turns out that ROTATE should always be combined with XOR. ROTATE on it's own can produce excessive collisions, since 50% of single-bit rotations converge towards zero and "minimum" may tend to select the same shingles. Thanks Bill for your assistance in pointing this out!Gobbet
I don't think that the XOR method is good either. Certainly the proofs of the nice properties of minhash do not work for such hash functions, because they require that the hash functions be drawn from a 2-uniform family, which this is not.Anesthetic
Thanks @RobertObryk. According to my reading of en.wikipedia.org/wiki/Universal_hashing an ADDITION from uniformly-distributed random numbers gives strong independence for all hash functions, an XOR from uniformly-distributed random numbers gives strong independence only where hash usage is bitwise, ie where modulo M is a power of 2. So I would read Robert's suggestion as saying, that ADDITION may be better than MODULO. Any comments?Gobbet
en.wikipedia.org/wiki/Tabulation_hashing and en.wikipedia.org/wiki/Rolling_hash articles may also be worth reading, if you're building a Minhash algorithm.Gobbet
@ThomasW: Is there any criteria for picking how many bits should be rotated? Or, any constant number of shifting would do?Shriner
F
12

What you want can be be easily obtained from universal hashing. Popular textbooks like Corman et al as very readable information in section 11.3.3 pp 265-268. In short, you can generate family of hash functions using following simple equation:

h(x,a,b) = ((ax+b) mod p) mod m
  • x is key you want to hash
  • a is any odd number you can choose between 1 to p-1 inclusive.
  • b is any number you can choose between 0 to p-1 inclusive.
  • p is a prime number that is greater than max possible value of x
  • m is a max possible value you want for hash code + 1

By selecting different values of a and b you can generate many hash codes that are independent of each other.

An optimized version of this formula can be implemented as follows in C/C++/C#/Java:

(unsigned) (a*x+b) >> (w-M)

Here, - w is size of machine word (typically 32) - M is size of hash code you want in bits - a is any odd integer that fits in to machine word - b is any integer less than 2^(w-M)

Above works for hashing a number. To hash a string, get the hash code that you can get using built-in functions like GetHashCode and then use that value in above formula.

For example, let's say you need 200 16-bit hash code for string s, then following code can be written as implementation:

public int[] GetHashCodes(string s, int count, int seed = 0)
{
    var hashCodes = new int[count];
    var machineWordSize = sizeof(int);
    var hashCodeSize = machineWordSize / 2; 
    var hashCodeSizeDiff = machineWordSize - hashCodeSize;
    var hstart = s.GetHashCode();
    var bmax = 1 << hashCodeSizeDiff;
    var rnd = new Random(seed);     

    for(var i=0; i < count; i++) 
    {
        hashCodes[i] = ((hstart * (i*2 + 1)) + rnd.Next(0, bmax)) >>  hashCodeSizeDiff;
    }
}

Notes:

  1. I'm using hash code word size as half of machine word size which in most cases would be 16-bit. This is not ideal and has far more chance of collision. This can be used by upgrading all arithmetic to 64-bit.
  2. Normally you want to select a and b both randomly within above said ranges.
Frostbitten answered 3/8, 2014 at 11:6 Comment(5)
Universal hashing is fine, but this code doesn't adhere to the rules of a hashcode. A hash should be deterministic, and this implementation uses a psuedo-random number generator.Baseboard
@Baseboard - you are correct. Changed this to use random seed. Alternatively, one can use random numbers generated by simple mod operation as well.Frostbitten
In the CLRS book, a is not required to be odd. Any reason you make it odd here?Ambros
Hi, a quick question regarding this. In case a very large number belonging in a finite field of GF(2^q) has to be hashed, which approach should be chosen? Should this number be converted to bits first to be hashed or not?Incept
Hi, a similar code in python can be found at https://mcmap.net/q/451486/-hash-functions-family-generator-in-pythonPrincipally
V
4

Just use 1 hash function! (and save the 1/(f ε^2) smallest values.)

Check out this article for the state of the art practical and theoretical bounds. It has this nice graph (below), explaining why you probably want to use just one 2-independent hash function and save the k smallest values.

When estimating set sizes the paper shows that you can get a relative error of approximately ε = 1/sqrt(f k) where f is the jaccard similarity and k is the number of values kept. So if you want error ε, you need k=1/(fε^2) or if your sets have similarity around 1/3 and you want a 10% relative error, you should keep the 300 smallest values.

graph

Vilberg answered 6/10, 2016 at 12:23 Comment(2)
Thanks for focusing on the original question of how many hash functions / values are needed. But isn't it my goal to find the similarity with minhash? So how can i use the similarity to estimate the number of values i should keep? Also: Shouldn't the number of shingles or rows in the document have an impact on k?Kick
You are right that it's somewhat annoying that you need a similarity estimate to find the similarity. That's because we're asking for relative error rather than additive error. Note that it suffices to have a lower bound, so you can just set it as low as you are willing to afford. You could also make a more expensive experiment first to get a rough idea. Finally, you are right that it's surprising that more values are not needed when your documents grow, but such is the magic of sketching :-)Vilberg
S
1

It seems like another way to get N number of good hashed values would be to salt the same hash with N different salt values.

In practice, if applying the salt second, it seems you could hash the data, then "clone" the internal state of your hasher, add the first salt and get your first value. You'd reset this clone to the clean cloned state, add the second salt, and get your second value. Rinse and repeat for all N items.

Likely not as cheap as XOR against N values, but seems like there's possibility for better quality results, at a minimal extra cost, especially if the data being hashed is much larger than the salt value.

Slum answered 28/1, 2015 at 20:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.