How many random elements before MD5 produces collisions?
Asked Answered
O

8

168

I've got an image library on Amazon S3. For each image, I md5 the source URL on my server plus a timestamp to get a unique filename. Since S3 can't have subdirectories, I need to store all of these images in a single flat folder.

Do I need to worry about collisions in the MD5 hash value that gets produced?

Bonus: How many files could I have before I'd start seeing collisions in the hash value that MD5 produces?

Ongoing answered 14/10, 2008 at 15:43 Comment(2)
Related: Are there two known strings which have the same MD5 hash value?Ecclesiolatry
The literal answer is that the second file could have the same MD5 as the first. However the odds are extremely small.Weatherford
E
317

Probability of just two hashes accidentally colliding is 1/2128 which is 1 in 340 undecillion 282 decillion 366 nonillion 920 octillion 938 septillion 463 sextillion 463 quintillion 374 quadrillion 607 trillion 431 billion 768 million 211 thousand 456.

However if you keep all the hashes then the probability is a bit higher thanks to birthday paradox. To have a 50% chance of any hash colliding with any other hash you need 264 hashes. This means that to get a collision, on average, you'll need to hash 6 billion files per second for 100 years.

Earldom answered 13/11, 2008 at 22:6 Comment(15)
Not strictly true. The probability of a collision is much higher than this as a new URL could potentially collide with any existing item in the table. See This posting (disclaimer, I wrote it) for a run-down on the maths, and a small python script that can be adapted to compute the probability for a particular number of URLs.Mascle
@ConcernedOfTunbridgeWells: I did take correction for birthday paradox, which is why answer is in billions, not quintillions. I was unable to verify probability with your script PV=2**128; SS=2**64: OverflowError: long int too large to convert to intEarldom
"probability of collision is 1/2^64" - what? The probability of collision is dependent on the number of items already hashed, it's not a fixed number. In fact, it's equal to exactly 1 - sPn/s^n, where s is the size of the search space (2^128 in this case), and n is the number of items hashed. What you are probably thinking of is 2^64, which is the approximate number of items you'd need to MD5 hash to have a 50% chance of collision.Superimpose
@BlueRaja-DannyPflughoeft that's what I had in mind indeed. Thanks for the correction.Earldom
Unfortunately, you are still not correct. You are assuming that the hash function is truly random. It is not. This means that the collision probability is higher.Medulla
JørgenFogh: And all laws of physics are "not correct" either. Such level of pedantism is unnecessary because it doesn't change the answer in any meaningful way.Earldom
(This means that to get a collision, on average, you'll need to hash 6 billion files per second for 100 years.); incorrect. this means that by the time you've been hashing 6 billion files per second for 100 years, 50% of the hashes you are generating would collide with previously-generated hashes.Drool
@yaauie No, that's ridiculously impossible. I'm talking about generating 2^64 hashes out of 2^128 possible ones. That's one quadrillionth of a percent of all possible hashes generated.Earldom
Intuitively if we ignore the birthday paradox and just look at an approximate solution: Add 2^64 hashes into a list. Now add one more hash to that list. That one more hash has 1 / 2^128 times 2^64 chance of a collision, i.e. that one more hash has a 1 / 2^64 chance of a collision. Now add another 2^64 hashes to the list and you should get a collision. Do the same calculation for 2^63 (and note 2^63 + 2^63 = 2^64).Cormack
So you’re saying there’s a chance!Valeta
Can I use this hash algorithm for filenames? Like hash the contents of files, set the name of those files to their respective hashes and store them in a directory? Maximum number of files in the directory at the same time is around 3000.Telegony
@AmirhoseinAl yes, for all practical purposes it will be as unique as the filenames.Earldom
do this means "Don't worry" ? As my DB primary key are MD5 hashes !Pushover
@AnuragVohra Yes, you don't have to worry. The most probable collision there is an asteroid hitting earth.Earldom
If we take 2^64 random hashes out of 2^128, then according to the approximated formula from Birthday attack we have 0.39 chance of at least one value is chosen more than once, whereas for 2.2 * 10^19 hashes to choose we have 50% chance of at least one collision (see the table in the article)Putman
S
28

S3 can have subdirectories. Just put a "/" in the key name, and you can access the files as if they were in separate directories. I use this to store user files in separate folders based on their user ID in S3.

For example: "mybucket/users/1234/somefile.jpg". It's not exactly the same as a directory in a file system, but the S3 API has some features that let it work almost the same. I can ask it to list all files that begin with "users/1234/" and it will show me all the files in that "directory".

Spurlock answered 14/10, 2008 at 15:46 Comment(0)
A
19

So wait, is it:

md5(filename) + timestamp

or:

md5(filename + timestamp)

If the former, you are most of the way to a GUID, and I wouldn't worry about it. If the latter, then see Karg's post about how you will run into collisions eventually.

Angola answered 14/10, 2008 at 15:47 Comment(4)
Please elaborate on how including the timestamp increases the chance of collisionAmorette
@BradThomas : It does not. The MD5 risk of collision is the same whether it is on the filename or the combination of filename+timestamp. But in the first scenario, you would need to have both a MD5 collision and a timestamp collision.Classical
This still leaves a 2^(128^60) chance of a collission with two users per minute. Literally unusable.Guesswarp
@BradThomas To be clearer: md5(filename) + timestamp reduces the collision risk massively because you would need to have an md5 collision for exactly the same timestamp to have a collision overall. md5(filename + timestamp) is the same as md5(filename), assuming that filename is random to start with (because adding more randomness to something random only changes the individual md5 result and the birthday problem still exists across all the md5 hashes).Cormack
J
10

A rough rule of thumb for collisions is the square-root of the range of values. Your MD5 sig is presumably 128 bits long, so you're going to be likely to see collisions above and beyond 2^64 images.

Jewelljewelle answered 14/10, 2008 at 15:45 Comment(1)
en.wikipedia.org/wiki/Birthday_Problem Some more information about the problem.What
A
7

Although random MD5 collisions are exceedingly rare, if your users can provide files (that will be stored verbatim) then they can engineer collisions to occur. That is, they can deliberately create two files with the same MD5sum but different data. Make sure your application can handle this case in a sensible way, or perhaps use a stronger hash like SHA-256.

Anthracosis answered 5/5, 2009 at 0:45 Comment(3)
using a salt would take care of the user engineering problem, no?Clywd
It depends on how the salt is applied. It would need to be a prefix of the user-supplied data, or better yet the key for an HMAC. It's still probably a good idea to practice defense in depth though.Anthracosis
Note although SHA256 is 256 bits long, you can trade off the risk of collisions with the length of the key you are storing by truncating the SHA256 to fewer bits e.g. use SHA256 but truncate it to 128 bits (which is more secure than using MD5 even though they have the same number of bits).Cormack
B
5

While there have been well publicized problems with MD5 due to collisions, UNINTENTIONAL collisions among random data are exceedingly rare. On the other hand, if you are hashing on the file name, that's not random data, and I would expect collisions quickly.

Bontebok answered 14/10, 2008 at 15:48 Comment(2)
The only problem I have with taylors example is that if someone gets a copy of your database they could probably figure out the credit card numbers using a rainbow table ...Irresolvable
While I wouldn't choose to use MD5 for credit cards, a Rainbow table of all valid credit card numbers between 10,000,000 (8 digits being the smallest length credit card I've seen) and 9,999,999,999,999,999 (largest 16 digit number) is still a big table to generate. There are probably easier ways to steal those numbers.Bontebok
F
2

Doesn't really matter how likely it is; it is possible. It could happen on the first two things you hash (very unlikely, but possible), so you'll need to support collisions from the beginning.

Fantast answered 14/10, 2008 at 15:45 Comment(7)
There may of course be many other bad things which can happen with a probability of 1/2^128. You might not want to single-out this one to worry about.Jewelljewelle
The worst thing that can happen here is you can get a photo. For a relatively small number I would not worry. Now if your software is controlling an autopilot landing an aircraft, thats another story.Sponger
You can't be serious. You'll need to hash 6 billion files per second, every second for 100 years to get good chance of collision. Even if you're very very unlucky, it would probably take more than entire capacity of S3 used for longer than a human lifetime.Earldom
It's billions of times more likely that your database and its backups will all fail. Collisions are not worth worrying about.Outride
Use the collision prevention time building a bunker to put your server! Those pesky meteors can hit you (very unlikely, but possible), so you'll need to support meteor shelter from the begging.Unschooled
It would take 100 years to get a 50% chance of collision at 6G files / sec. You have a good chance of collision decades earlier.Cellulosic
Bad thing is that it someone could upload colliding files ON PURPOSE, which may lead to bugs or even worse - security breach, for example it could allow to override the file with other file. avira.com/en/blog/md5-the-broken-algorithmProthalamium
W
1

MD5 collision is extremely unlikely. If you have 9 trillion MD5s, there is only one chance in 9 trillion that there will be a collision.

Weatherford answered 12/7, 2016 at 0:12 Comment(3)
Many of the other Answers talk about the probability of a collision when adding one more item. I think my Answer are more useful because it talks about the probably of the entire table having a dup.Weatherford
This has nothing to do with MD5 and is not correct. It's like saying that if you have 9 trillion cats there is a 1 in 9 trillion chance that someone else has a identical cat. The key problem here is that you can get same hash with more than one value.Communicate
@JoonasAlhonen - Yes, that is true. And a lot of poor people use that as an excuse to buy yet another Lottery ticket they cannot afford.Weatherford

© 2022 - 2024 — McMap. All rights reserved.