Chance of a duplicate hash when using first 8 characters of SHA1
Asked Answered
S

2

15

If I have an index of URLs, and ID them by the first 8 characters of a SHA1 hash, what is the probability of two different URLs having identical IDs?

Springer answered 31/5, 2015 at 18:29 Comment(2)
These guys got hit by a collision with truncation to 7 hex digits in their public website. 8 is a little better, but maybe not worth the risk. blog.getsolid.io/birthday-paradox-coding-solidDear
The above URL is not in here: getsolid.io/blog/birthday-paradox-coding-solidPedal
C
33

@Teepeemm has correctly answered the related question ‘given a particular sequence of 8 hex digits, what is the chance of another SHA-1 hash appearing with the same 8 digits?’ It's a very small number.

What's at stake in this question, though, is a different question: ‘given a large number of 8-hex-digit sequences, what's the chance of any two of them being the same?’ As the first comment to the question points out, this is related to the birthday paradox, which is not ‘what's the chance of someone in the room having the same birthday as me?’, but instead ‘what's the chance of any two people in this room having the same birthday?’ As is reasonably well known, the chance of that is 50% with only 23 people.

The hash-collision problem is essentially the same problem, but generalised from N=365 days to N=16^8 8-byte sequences, which is about 4.30e9. That's the ‘generalised birthday problem’. Using the expression quoted there (n=sqrt(2*d*ln(1/(1-p))), with d=4.30e9 and p=0.5, we find a 50% chance of a collision with only 77000 trials. If you plot the corresponding function, you see the probability increases quite rapidly as the number of trials increases.

Even with 16 bytes of hash (so d=16^16) there's a 50% chance of a collision after only 5 billion trials.

Happy birthday!

Claribel answered 1/6, 2015 at 3:11 Comment(1)
Good point; I hadn't thought of that. I guess it comes down to why OP is doing this. If it's to conveniently hash a few urls, then a collision isn't a big deal. If it's vital to avoid collisions, that's another story.Kellar
K
4

A SHA-1 hash has 40 base-16 digits. If you’re only looking at the first 8 of them, then the chance that a second url has the same 8 digits is (1/16)^8 ~ 2.32e-10. Actually, this doesn't depend on there being 40 digits to begin with, or even that it's SHA-1. The only assumption you need is that SHA-1 has the first 8 digits independent and identically distributed.

Kellar answered 1/6, 2015 at 1:4 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.