What's the shortest pair of strings that causes an MD5 collision?
Asked Answered
S

3

60

Up to what string length is it possible to use MD5 as a hash without having to worry about the possibility of a collision?

This would presumably be calculated by generating an MD5 hash for every possible string in a particular character set, in increasing length, until a hash appears for a second time (a collision). The maximum possible length of a string without a collision would then be one character less than the longest of the colliding pair.

Has this already been tested for MD5, SHA1, etc?

Skipper answered 4/1, 2010 at 14:27 Comment(4)
It is unfortunate that both MD5 and SHA1 are considered as nearly cracked, because the general answer for a cryptographic hash function of good reputation is: "Do not worry about collisions. Act as if they never happen. Even someone committed to find a collision will not find one by brute-force enumeration before the end of the world".Sundberg
you overemphasize weaknesses. For MD5, there are known collision attacks but not yet any known useful preimage attacks. cs.cmu.edu/~perspectives/md5.html Anyone using an off-the-shelf tool or algorithm should know its strengths and weaknesses.Gob
If you need a hash function, the SHA-2 series hash functions (SHA-224, SHA-256, SHA-384, SHA-512) are still secure against collision and preimage attacks. SHA-1 and MD5 should only be used for legacy applications, not new ones.Wallenstein
mscs.dal.ca/~selinger/md5collisionUnequivocal
W
75

Update

Ironically, a few weeks after I posted the previous answer, two Chinese researchers, Tao Xie and Dengguo Feng, published a new single-block collision for MD5. I was unaware of that paper until now. A single MD5 block means that the input size is 64 bytes or 512 bits. Note that the inputs are mostly the same, differing only in 2 bits.

Their methodology won't be published until January 2013, but their collision can be verified now, using numbers from the paper:

>>> from array import array
>>> from hashlib import md5
>>> input1 = array('I',  [0x6165300e,0x87a79a55,0xf7c60bd0,0x34febd0b,0x6503cf04,
    0x854f709e,0xfb0fc034,0x874c9c65,0x2f94cc40,0x15a12deb,0x5c15f4a3,0x490786bb,
    0x6d658673,0xa4341f7d,0x8fd75920,0xefd18d5a])
>>> input2 = array('I', [x^y for x,y in zip(input1,
    [0, 0, 0, 0, 0, 1<<10, 0, 0, 0, 0, 1<<31, 0, 0, 0, 0, 0])])
>>> input1 == input2
False
>>> md5(input1).hexdigest()
'cee9a457e790cf20d4bdaa6d69f01e41'
>>> md5(input2).hexdigest()
'cee9a457e790cf20d4bdaa6d69f01e41'

Update: The paper has been published in March 2013: Tao Xie and Fanbao Liu and Dengguo Feng - Fast Collision Attack on MD5

However, if you have more room to play with, collisions of a few kilobytes are MUCH faster to calculate -- they can be calculated within hours on ANY regular computer.

Old answer

The previous shortest collision used at least two MD5 blocks worth of input -- that's 128 bytes, 1024 bits. A prefix in the first block can be chosen arbitrarily by the attacker, the rest would be computed and appear as gibberish.

Here's an example of two different colliding inputs, you can try it yourself in Python:

>>> from binascii import unhexlify
>>> from hashlib import md5
>>> input1 = 'Oded Goldreich\nOded Goldreich\nOded Goldreich\nOded Go' + unhexlify(
... 'd8050d0019bb9318924caa96dce35cb835b349e144e98c50c22cf461244a4064bf1afaecc582'
... '0d428ad38d6bec89a5ad51e29063dd79b16cf67c12978647f5af123de3acf844085cd025b956')
>>> len(input1)
128
>>> md5(input1).hexdigest()
'd320b6433d8ebc1ac65711705721c2e1'
>>> input2 = 'Neal Koblitz\nNeal Koblitz\nNeal Koblitz\nNeal Koblitz\n' + unhexlify(
... '75b80e0035f3d2c909af1baddce35cb835b349e144e88c50c22cf461244a40e4bf1afaecc582'
... '0d428ad38d6bec89a5ad51e29063dd79b16cf6fc11978647f5af123de3acf84408dcd025b956')
>>> md5(input2).hexdigest()
'd320b6433d8ebc1ac65711705721c2e1'

Generating these two particular inputs took 2 days on a 215-node Playstation 3 cluster, by Mark Stevens :)

Wallenstein answered 6/12, 2010 at 23:1 Comment(4)
215 PS3 running for 2 days, very interesting fact!Neumark
January 2013 has passed — could you edit this excellent answer adding a link to Tao Xie and Dengguo Feng’s methodology?Swearword
Fwiw, I just verified these example colliding inputs using the Java JRE 7 standard MD5 implementation.Jessalin
This is interesting info, but I don't understand how this answers the question at all. As Jason S. answered, it's likely that there are collisions within 9 bytes (~11 printable characters). Those strings would, of course, be completely unrelated. That's quite different than the research that shows how two very similar 512-bit blocks produce the same hash (which, as I said, is still interesting).Garaway
G
10

The mathematics of the birthday paradox make the inflection point of probability of collision roughly around sqrt(N), where N is the number of distinct bins in the hash function, so for a 128-bit hash, as you get around 64 bits you are moderately likely to have 1 collision. So my guess is for the complete set of 8 byte strings it's somewhat likely to have a collision, and for 9 byte strings it's extremely likely.

edit: this assumes that the MD5 hash algorithm causes a mapping from input bytestring to output hash that is close to "random". (vs. one that distributes strings more evenly among the set of possible hashes, in which case it would be more close to 16 bytes.)

Also for a more specific numerical answer, if you look at one of the approximations for calculating collision probability, you get

p(k) ≈ 1 - e-k(k-1)/(2*2128) where k = the size of the space of possible inputs = 2m where the input bytestring is m bits long.

the set of 8 byte strings: p(264) ≈ 1 - e-0.5 ≈ 0.3935

the set of 9 byte strings: p(272) ≈ 1 - e-2144/(2*2128) = 1 - e-215 = 1 - e-32768 ≈ 1

Also note that these assume the complete set of m/8 byte strings. If you only use alphanumeric characters, you'd need more bytes to get a probable collision.

Gob answered 4/1, 2010 at 14:55 Comment(5)
Collisions are just a mathematical fact when you map an infinite set onto the set of 128 bit numbers. Developers assuming hash uniqueness is a great source of WTF bugs. CCP blogged about a bug (though they used a 32 bit hash) eveonline.com/devblog.asp?a=blog&bid=371Snob
I like this explanation. It seems like the first collision's probably going to occur within 8 or 9 bytes and - as others have commented - if the strings are shorter than this then it's probably not worth hashing them anyway.Skipper
@KenFox Assuming a hash function has no collision is perfectly fine. They obviously exist, but for good cryptographic hashes (say SHA-256) the chances of ever running into one is smaller than the probability of random hardware errors.Methodology
around sqrt(N), where N is the number of distinct bins in the hash function, so for a 128-bit hash, as you get around 64 bits you are moderately likely to have 1 collision I'm confused.. where did the 64 come from??Method
2^64 is the square root of 2^128.Gob
A
1

I doubt if there is any useful length where you're not going to have possible collisions. Those algorithms are not really used for that purpose. It's meant to try to be unique for slight changes in the data (like corrupted files) rather than unique over all possible sets of data.

Albertinaalbertine answered 4/1, 2010 at 14:50 Comment(2)
Quite wrong, MD5 is a cryptographic hash function. Cryptographic hash functions are designed to be collision-resistant. MD5 was considered collision-resistant for some time, until weaknesses were discovered in 2004.Wallenstein
@Wallenstein It's actually correct to say "rather than unique over all possible sets of data." A SHA-256 hash has, by nature, 2^256 possible values. It is represented by a 64 digit hex string. Which means it takes at most 65 hex digits to find hashes duplicated in the set of all possible 64 digit hex strings. It can also be represented with 43 alphanumeric (there's 62 of them) characters (256 / log2(62)), which means that all permutations of 43-char alphanumeric strings will hash to all possible SHA-256 hashes, including one for every length string that is longer.Garaway

© 2022 - 2024 — McMap. All rights reserved.