Are there circumstances where a hash algorithm can be guaranteed unique?

Asked 19/2, 2010 at 21:50 Answered 22/2, 2010 at 2:44

If I'm hashing size-constrained similar data (social security numbers, for example) using a hash algorithm with a larger byte size than the data (sha-256, for example), will the hash guarantee the same level of uniqueness as the original data?

Nava answered 19/2, 2010 at 21:50 Comment(0)

The probability of a hash collision has nothing to do with the size of the input string (except to the extent that it indicates how many inputs you need to keep uniqueness among). It's possible to have a hash collision when you hash 0 and 1 using a perfect hash algorithm, although the possibility is 1/(2^bit-length). Which in the case of SHA-256 is effectively zero.

Hash collisions are a birthday paradox problem. In the case of a 256 bit hash, the probability of a collision among two inputs is purely dependent on the count of inputs and is:

1 - (2^256)! / ((2^256^inputcount) * (2^256-inputcount)!) or as others have said -- basically zero for reasonable numbers of inputs.

Micronesia answered 19/2, 2010 at 22:36 Comment(4)

True. I'm not questioning the security implications, though. I'm asking for probability of uniqueness from a hash when the size of the data is less than the size of the hash. (I need the resulting value to be deterministic/repeatable, so performing a random salt of x bytes doesn't work for me. I might "salt" by adding constant characters per implementation - for instance, I might append characters like '593jra' to the ssn before hashing). – Nava 19/2, 2010 at 22:49

Isn't the birthday paradox based on the pigeonhole principle? If so, in theory I don't have a pigeonhole scenario. – Nava 20/2, 2010 at 1:14

The pigeonhole principle is the simple notion that when you have more items than pigeonholes, you're guaranteed to have a collision. The birthday paradox just says you're really really likely to get a collision if your ratio of items to pigeonholes is "high". Where "high" is defined by the above formula. – Micronesia 27/2, 2010 at 1:37

I believe you don't need the 1 - part in your formula - unless you're trying to express the probability of no collisions. By the way, can you give us a source for that formula? – Myrnamyrobalan 30/5, 2016 at 15:10

You can always create a customized hash that guarantees uniqueness. For data in a known domain (like SSN's), the exercise is relatively simple.

If your target hash value actually has more bits available than what you're hashing, the hash simply maps input values to one of the available output values. This will be a simple linear mapping from input value as a multi-byte integer to the output as a multi-byte integer.

When your target hash value has fewer bits than what's being hashed, then uniqueness cannot ever be guaranteed.

Portwin answered 19/2, 2010 at 22:1 Comment(2)

Thanks. I am considering hashing ssn's and an "account" identifier which may vary with each implementation. So if I can use a hash function instead of a pre-generated, it would be preferable. – Nava 19/2, 2010 at 22:31

If masking social security numbers is the goal, then implementing a one to one linear mapping function would not suffice, as it would be rather easy to calculate the original input from some samples of the output. Besides, the length of the input string definitely does not affect the effectiveness of a cryptographically secure hash function, so using a known hash algorithm is the way to go – Suicidal 19/2, 2010 at 23:21

Others have pointed out that collisions should not be a concern; that is the whole point of cryptographically secure hash functions. I would just like to add the following:

If your input set is small enough (e.g. data is SSN -- there are less than a billion of them), then the absence of collision is amenable to verification: just test it exhaustively.
If the input set is too big to be exhaustively scanned, then it is expected that the absence of collision cannot be proven. Good hash functions are expected to act as random oracles, and on a random oracle you cannot prove such a property without trying exhaustively. Being able to prove the absence of collision would suspiciously look like a weakness of the function.

Transvalue answered 22/2, 2010 at 2:44 Comment(0)

If you're using a cryptographic hash like SHA, then the short answer is yes.

Kirbie answered 19/2, 2010 at 21:53 Comment(5)

Thanks. I was thinking so, but I couldn't find a reference to back it up and I'm not smart enough to dig into the math and conclude one way or the other! – Nava 19/2, 2010 at 21:58

As noted above, a cryptographic hash just says that collisions are extraordinarily unlikely, not impossible. – Panelboard 19/2, 2010 at 22:25

@Novelcrat, The short answer to the original question is yes. While in theory a collision is possible, the mean time to finding a collision is considerably longer than the time it will take the sun to evolve into a red giant and destroy the earth. – Kirbie 21/2, 2010 at 15:52

@Novelcrat. P.S. If you can post two 10-digit SSNs that produce the same SHA-256 hashes, I will pay you $1,000 US. – Kirbie 21/2, 2010 at 16:2

@DieinSente I found them! Pay me and I'll let you know. :P – Pendulum 26/10, 2017 at 17:38

One key feature of a cryptographically secure hash function is that you are safe from collisions beyond reasonable doubt, regardless of the input. This is also valid for input shorter than the output's size, which is the same of a longer message with little entropy. So you can use SHA-2 without worrying about collisions.

Suicidal answered 19/2, 2010 at 22:29 Comment(0)

Recommended topics

Hot tags