For each of our binary assets we generate a MD5 hash. This is used to check whether a certain binary asset is already in our application. But is it possible that two different binary assets generate the same MD5 hash. So is it possible that two different strings generate the same MD5 hash?
For a set of even billions of assets, the chances of random collisions are negligibly small -- nothing that you should worry about. Considering the birthday paradox, given a set of 2^64 (or 18,446,744,073,709,551,616) assets, the probability of a single MD5 collision within this set is 50%. At this scale, you'd probably beat Google in terms of storage capacity.
However, because the MD5 hash function has been broken (it's vulnerable to a collision attack), any determined attacker can produce 2 colliding assets in a matter of seconds worth of CPU power. So if you want to use MD5, make sure that such an attacker would not compromise the security of your application!
Also, consider the ramifications if an attacker could forge a collision to an existing asset in your database. While there are no such known attacks (preimage attacks) against MD5 (as of 2011), it could become possible by extending the current research on collision attacks.
If these turn out to be a problem, I suggest looking at the SHA-2 series of hash functions (SHA-256, SHA-384 and SHA-512). The downside is that it's slightly slower and has longer hash output.
MD5 is a hash function – so yes, two different strings can absolutely generate colliding MD5 codes.
In particular, note that MD5 codes have a fixed length so the possible number of MD5 codes is limited. The number of strings (of any length), however, is definitely unlimited so it logically follows that there must be collisions.
Yes, it is possible. This is in fact a Birthday problem. However the probability of two randomly chosen strings having the same MD5 hash is very low.
Yes, it is possible that two different strings can generate the same MD5 hash code.
Here is a simple test using very similar binary message in hex string:
$ echo '4dc968ff0ee35c209572d4777b721587d36fa7b21bdc56b74a3dc0783e7b9518afbfa200a8284bf36e8e4b55b35f427593d849676da0d1555d8360fb5f07fea2' | xxd -r -p | tee >/dev/null >(md5) >(sha1sum)
c6b384c4968b28812b676b49d40c09f8af4ed4cc -
008ee33a9d58b51cfeb425b0959121c9
$ echo '4dc968ff0ee35c209572d4777b721587d36fa7b21bdc56b74a3dc0783e7b9518afbfa202a8284bf36e8e4b55b35f427593d849676da0d1d55d8360fb5f07fea2' | xxd -r -p | tee >/dev/null >(md5) >(sha1sum)
c728d8d93091e9c7b87b43d9e33829379231d7ca -
008ee33a9d58b51cfeb425b0959121c9
They generate different SHA-1 sum, but the same MD5 hash value. Secondly the strings are very similar, so it's difficult to find the difference between them.
The difference can be found by the following command:
$ diff -u <(echo 4dc968ff0ee35c209572d4777b721587d36fa7b21bdc56b74a3dc0783e7b9518afbfa200a8284bf36e8e4b55b35f427593d849676da0d1555d8360fb5f07fea2 | fold -w2) <(echo 4dc968ff0ee35c209572d4777b721587d36fa7b21bdc56b74a3dc0783e7b9518afbfa202a8284bf36e8e4b55b35f427593d849676da0d1d55d8360fb5f07fea2 | fold -w2)
--- /dev/fd/63 2016-02-05 12:55:04.000000000 +0000
+++ /dev/fd/62 2016-02-05 12:55:04.000000000 +0000
@@ -33,7 +33,7 @@
af
bf
a2
-00
+02
a8
28
4b
@@ -53,7 +53,7 @@
6d
a0
d1
-55
+d5
5d
83
60
Above collision example is taken from Marc Stevens: Single-block collision for MD5, 2012; he explains his method, with source code (alternate link to the paper).
Another test:
$ echo '0e306561559aa787d00bc6f70bbdfe3404cf03659e704f8534c00ffb659c4c8740cc942feb2da115a3f4155cbb8607497386656d7d1f34a42059d78f5a8dd1ef' | xxd -r -p | tee >/dev/null >(md5) >(sha1sum)
756f3044edf52611a51a8fa7ec8f95e273f21f82 -
cee9a457e790cf20d4bdaa6d69f01e41
$ echo '0e306561559aa787d00bc6f70bbdfe3404cf03659e744f8534c00ffb659c4c8740cc942feb2da115a3f415dcbb8607497386656d7d1f34a42059d78f5a8dd1ef' | xxd -r -p | tee >/dev/null >(md5) >(sha1sum)
6d5294e385f50c12745a4d901285ddbffd3842cb -
cee9a457e790cf20d4bdaa6d69f01e41
Different SHA-1 sum, the same MD5 hash.
Difference is in one byte:
$ diff -u <(echo 0e306561559aa787d00bc6f70bbdfe3404cf03659e704f8534c00ffb659c4c8740cc942feb2da115a3f4155cbb8607497386656d7d1f34a42059d78f5a8dd1ef | fold -w2) <(echo 0e306561559aa787d00bc6f70bbdfe3404cf03659e744f8534c00ffb659c4c8740cc942feb2da115a3f415dcbb8607497386656d7d1f34a42059d78f5a8dd1ef | fold -w2)
--- /dev/fd/63 2016-02-05 12:56:43.000000000 +0000
+++ /dev/fd/62 2016-02-05 12:56:43.000000000 +0000
@@ -19,7 +19,7 @@
03
65
9e
-70
+74
4f
85
34
@@ -41,7 +41,7 @@
a3
f4
15
-5c
+dc
bb
86
07
Above example is adapted from Tao Xie and Dengguo Feng: Construct MD5 Collisions Using Just A Single Block Of Message, 2010.
Related:
Yes, of course: MD5 hashes have a finite length, but there are an infinite number of possible character strings that can be MD5-hashed.
Yes, it is possible. It is called a Hash collision.
Having said that, algorithms such as MD5 are designed to minimize the probability of a collision.
The Wikipedia entry on MD5 explains some vulnerabilities in MD5, which you should be aware of.
Yes, it is! Collision will be a possibility (although, the risk is very small). If not, you would have a pretty effective compression method!
EDIT: As Konrad Rudolph says: A potentially unlimited set of input converted to a finite set of output (32 hex chars) will results in an endless number of collisions.
Just to be more informative.
From a math point of view, Hash functions are not injective.
It means that there is not a 1 to 1 (but one way) relationship between the starting set and the resulting one.
EDIT: to be complete injective hash functions exist: it's called Perfect hashing.
I think we need to be careful choosing the hashing algorithm as per our requirement, as hash collisions are not as rare as I expected. I recently found a very simple case of hash collision in my project. I am using Python wrapper of xxhash for hashing. Link: https://github.com/ewencp/pyhashxx
s1 = 'mdsAnalysisResult105588'
s2 = 'mdsAlertCompleteResult360224'
pyhashxx.hashxx(s1) # Out: 2535747266
pyhashxx.hashxx(s2) # Out: 2535747266
It caused a very tricky caching issue in the system, then I finally found that it's a hash collision.
As other people have said, yes, there can be collisions between two different inputs. However, in your use case, I don't see that being a problem. I highly doubt that you will run into collisions - I've used MD5 for fingerprinting hundreds of thousands of image files of a number of image (JPG, bitmap, PNG, raw) formats at a previous job and I didn't have a collision.
However, if you are trying to fingerprint some kind of data, perhaps you could use two hash algorithms - the odds of one input resulting in the same output of two different algorithms is near impossible.
I realize this is old, but thought I would contribute my solution. There are a 2^128 possible hash combinations. And thus a 2^64 probability of a birthday paradox. While the solution below won't eliminate possibility of collisions, it surely will reduce the risk by a very substantial amount.
2^64 = 18,446,744,073,709,500,000 possible combinations
What I have done is I put a few hashes together based on the input string to get a much longer resulting string that you consider your hash...
So my pseudo-code for this is:
Result = Hash(string) & Hash(Reverse(string)) & Hash(Length(string))
That is to practical improbability of a collision. But if you want to be super paranoid and can't have it happen, and storage space is not an issue (nor is computing cycles)...
Result = Hash(string) & Hash(Reverse(string)) & Hash(Length(string))
& Hash(Reverse(SpellOutLengthWithWords(Length(string))))
& Hash(Rotate13(string)) Hash(Hash(string)) & Hash(Reverse(Hash(string)))
Okay, not the cleanest solution, but this now gets you a lot more play with how infrequently you will run into a collision. To the point I might assume impossibility in all realistic senses of the term.
For my sake, I think the possibility of a collision is infrequent enough that I will consider this not "surefire" but so unlikely to happen that it suits the need.
Now the possible combinations goes up significantly. While you could spend a long time on how many combinations this could get you, I will say in theory it lands you SIGNIFICANTLY more than the quoted number above of
2^64 (or 18,446,744,073,709,551,616)
Likely by a hundred more digits or so. The theoretical max this could give you would be
Possible number of resulting strings:
528294531135665246352339784916516606518847326036121522127960709026673902556724859474417255887657187894674394993257128678882347559502685537250538978462939576908386683999005084168731517676426441053024232908211188404148028292751561738838396898767036476489538580897737998336
It looks like theory understanding doesn't help when talking about theory in practice and need to know what means only 2 numbers 1 and 0 it means 1111111111, so 100 means 10 times of that.
To have at all hashes used you need on one filesystem or one birthday system every person in world would need to have 18446744073709551616/8000000000=2305843009.21 files for every person and if its in 1mb size then its 2305843009 MB or 2305843 GB or 2305 TB or 153722 Google drives free 15 GB per each person.
If we make files bigger, then more space used and less file count means less hashes. So we still wont have smaller size files but only bigger.
Calculate someone how big files needs to be so that we could have MD5 all hashes filled.
If average file size is in 2002 3.22 MB in 2005 8.92 and we can assume we still use same quality of file size. so even google filesystem would never have so many files on one system since if 15gb free google drive full with average a lot small 3 mb files for every 8 milliard people in world would make 40000000000000 that's from all MD5 hashes 0.0000021684% of possible of all hashable file sizes.
Talking about non related things like birthdays of 100 birth year day of 2 people would be comparing 2 days or 0.02 and in 365 of 2 people would be comparing 0.00547% MD5 files 2/18446744073709551616=0.0000000000000000000108420217% of all files if so many would exist at all.
It like asking in world of adam and eve if they have the same hash birthday when there no 365 people in world or in file system files or so many password at all.
So collisions of trying to hack are so many that are impossible in real life secured server.
If MD5 full limit is 18,446,744,073,709,551,616 then you will never have so much files in whole world.
MD5 is example of having all world strings counted into hashes, which will never exist so long, so its just a problem of MD5 being short, but do we will have trillion amount of strings of huge length having really the same hash?
Actually it would be like comparing 365 different day babies with 366 baby to find out which birthday is the same.
As you see all answers are theoretically answering yes, but fail to prove real life examples. If its password, then only very long string might be same as short one.
If its file identification hashing then use different hashing or combination of them.
Birthday problem is as one person is word "abcd" a 4 letter word while other person DNA could be the same only if its "abcdfijdfj".
If you read wikipedia about birthday problem, its not only like birthday date but birthday birth date, hour, second, ms and more like DNA problem.
With hash you can have same DNA and birthday with twins? Nope. With someone else even sometimes.
Birthday paradox is certainly probability confidence math trick result possibility of 365 options or days, while hash is from how much? Much more. So if you have 2 different matching string, its just because MD5 hash is too short for too many files, so use something longer then MD5.
Its not comparing 50 babies in 365 days, its comparing 2 hashes if they are the same from multiple length strings been hashed like abcd same as 25 letter abcdef...zdgdege and 150 letter sadiasdjsadijfsdf.sdaidjsad.dfijsdf.
So if its password, then its birthday sibling will be much longer that doesn't even exist, since no one makes birth of 25 letter password.
For file size comparing, I'm not sure how big the probability is but its not 97% and not even 0.0000001%.
Ok let's be more specific.
If its file then can occur of huge system since files will be different but needs to be not a problem in practice since 5 quadrillion or 5 000 000 000 000 000 files should be on same system for UUID and for MD5.
And if it is a password, then 10 years to try every second, but could try every millisecond, but then in 3 wrong guesses blocking ip for 1minute would make guessing millions of years.
When I see something wrong, then I know it's wrong. Theory promises vs reality.
© 2022 - 2024 — McMap. All rights reserved.