Is there a collision rate difference between one 32-bit hash vs two 16 bit hashes?
Asked Answered
L

1

7

I am working on a system where hash collisions would be a problem. Essentially there is a system that references items in a hash-table+tree structure. However the system in question first compiles text-files containing paths in the structure into a binary file containing the hashed values instead. This is done for performance reasons. However because of this collisions are very bad as the structure cannot store 2 items with the same hash value; the part asking for an item would not have enough information to know which one it needs.

My initial thought is that 2 hashes, either using 2 different algorithms, or the same algorithm twice, with 2 salts would be more collision resistant. Two items having the same hash for different hashing algorithms would be very unlikely.

I was hoping to keep the hash value 32-bits for space reasons, so I thought I could switch to using two 16-bit algorithms instead of one 32-bit algorithm. But that would not increase the range of possible hash values...

I know that switching to two 32-bit hashes would be more collision resistant, but I am wondering if switching to 2 16-bit hashes has at least some gain over a single 32-bit hash? I am not the most mathematically inclined person, so I do not even know how to begin checking for an answer other than to bruit force it...

Some background on the system:

Items are given names by humans, they are not random strings, and will typically be made of words, letters, and numbers with no whitespace. It is a nested hash structure, so if you had something like { a => { b => { c => 'blah' }}} you would get the value 'blah' by getting value of a/b/c, the compiled request would be 3 hash values in immediate sequence, the hashe values of a, b, and then c.

There is only a problem when there is a collision on a given level. A collision between an item at the top level and a lower level is fine. You can have { a => {a => {...}}}, almost guaranteeing collisions that are on different levels (not a problem).

In practice any given level will likely have less than 100 values to hash, and none will be duplicates on the same level.

To test the hashing algorithm I adopted (forgot which one, but I did not invent it) I downloaded the entire list of CPAN Perl modules, split all namespaces/modules into unique words, and finally hashed each one searching for collisions, I encountered 0 collisions. That means that the algorithm has a different hash value for each unique word in the CPAN namespace list (Or that I did it wrong). That seems good enough to me, but its still nagging at my brain.

Larissa answered 6/4, 2011 at 4:58 Comment(0)
I
9

If you have 2 16 bit hashes, that are producing uncorrelated values, then you have just written a 32-bit hash algorithm. That will not be better or worse than any other 32-bit hash algorithm.

If you are concerned about collisions, be sure that you are using a hash algorithm that does a good job of hashing your data (some are written to merely be fast to compute, this is not what you want), and increase the size of your hash until you are comfortable.

This raises the question of the probability of collisions. It turns out that if you have n things in your collection, there are n * (n-1) / 2 pairs of things that could collide. If you're using a k bit hash, the odds of a single pair colliding are 2-k. If you have a lot of things, then the odds of different pairs colliding is almost uncorrelated. This is exactly the situation that the Poisson distribution describes.

Thus the number of collisions that you will see should approximately follow the Poisson distribution with λ = n * (n-1) * 2-k-1. From that the probability of no hash collisions is about e. With 32 bits and 100 items, the odds of a collision in one level are about 1.1525 in a million. If you do this enough times, with enough different sets of data, eventually those one in a million chances will add up.

But note that you have many normal sized levels and a few large ones, the large ones will have a disproportionate impact on your risk of collision. That is because each thing you add to a collection can collide with any of the preceeding things - more things equals higher risk of collision. So, for instance, a single level with 1000 data items has about 1 chance in 10,000 of failing - which is about the same risk as 100 levels with 100 data items.

If the hashing algorithm is not doing its job properly, your risk of collision will go up rapidly. How rapidly depends very much on the nature of the failure.

Using those facts and your projections for what the usage of your application is, you should be able to decide whether you're comfortable with the risk from 32-bit hashes, or whether you should move up to something larger.

Ignacioignacius answered 6/4, 2011 at 5:1 Comment(6)
I'd be a bit worried about using the same 16 bit hash algorithm with 2 different salt values; the two hash values are then implicitly correlated.Dade
@IraBaxter I said salt, but I think I was incorrect. I meant use the same algorithm, but the second time prefix a value. The algorithm slurps the string in and iterates each character changing the has each time such that "ab" and "ba" will have different values. And since I don't have to worry about collisions on identical strings (the point of a hash) prefixing a value to the second run should be sufficient for 2 items with the same hash after the first run to have a different hash in the second. (Then again I might want to confirm that)Larissa
@ira-baxter: If the hash algorithm is cryptographically secure, there should be no such correlation. However that's an if that shouldn't be ignored.Ignacioignacius
@Exodist: I'm not a mathematician, but if your two hash functions have an algorithmic relation, then I'd expect the bits in the two results to be correlated. Not in ways that are easy to see. Frankly, considering that building 32 bit hash functions isn't hard, I wouldn't take the risk.Dade
@IraBaxter I think I will compromise, I will use my 32-bit hash, however the compiling phase will take the time to calculate a second 32-bit hash. When the request is made it will send twice as much information, the code around the structure will simply ignore the second hash, and for most items never compute it. However when a collision occurs it will calculate the second hash in order to store the second item, when the request comes in for such a colliding pair the second hash will not be ignored. I can probably optimise this by having the second hashes all trail the request in proper order.Larissa
If there is an algorithmic relation between differently seeded hashes, it would probably indicate a flaw in the algorithm. A good hash function should have no correlation between different seeds. If there is, it is not a true seeded algorithm. SMHasher does a good job of testing for this.Pine

© 2022 - 2024 — McMap. All rights reserved.