substr md5 collision
Asked Answered
G

3

6

I need a 4-character hash. At the moment I am taking the first 4 characters of a md5() hash. I am hashing a string which is 80 characters long or less. Will this lead to collision? or, what is the chance of collision, assuming I'll hash less than 65,536 (164) different elements?

Grating answered 13/1, 2011 at 15:42 Comment(0)
D
1

Surprisingly high indeed. As you can see from this graph of an approximate collision probability (formula from the wikipedia page), with just a few hundred elements your probability of having a collision is over 50%.

Note, of course, if you're facing the possibility of an attacker providing the string, you can probably assume that it's 100% - scanning to find a collision in a 16-bit search space can be done almost instantaneously on any modern PC. Or even any modern cell phone, for that matter.

Diplomatics answered 13/1, 2011 at 15:59 Comment(0)
A
7

Well, each character of md5 is a hex bit. That means it can have one of 16 possible values. So if you're only using the first 4 "hex-bits", that means you can have 16 * 16 * 16 * 16 or 16^4 or 65536 or 2^16 possibilities.

So, that means that the total available "space" for results is only 16 bits wide. Now, according to the Birthday Attack/Problem, there are the following chances for collision:

  • 50% chance -> 300 entries
  • 1% chance -> 36 entries
  • 0.0000001% chance -> 2 entries.

So there is quite a high chance for collisions.

Now, you say you need a 4 character hash. Depending on the exact requirements, you can do:

  • 4 hex-bits for 16^4 (65,536) possible values
  • 4 alpha bits for 26^4 (456,976) possible values
  • 4 alpha numeric bits for 36^4 (1,679,616) possible values
  • 4 ascii printable bits for about 93^4 (74,805,201) possible values (assuming ASCII 33 -> 126)
  • 4 full bytes for 256^4 (4,294,967,296) possible values.

Now, which you choose will depend on the actual use case. Does the hash need to be transmitted to a browser? How are you storing it, etc.

I'll give an example of each (In PHP, but should be easy to translate / see what's going on):

4 Hex-Bits:

$hash = substr(md5($data), 0, 4);

4 Alpha bits:

$hash = substr(base_convert(md5($data), 16, 26)0, 4);
$hash = str_replace(range(0, 9), range('S', 'Z'), $hash);

4 Alpha Numeric bits:

$hash = substr(base_convert(md5($data), 16, 36), 0, 4);

4 Printable Assci Bits:

$hash = hash('md5', $data, true); // We want the raw bytes
$out = '';
for ($i = 0; $i < 4; $i++) {
    $out .= chr((ord($hash[$i]) % 93) + 33);
}

4 full bytes:

$hash = substr(hash('md5', $data, true), 0, 4); // We want the raw bytes
Austrasia answered 24/1, 2011 at 18:8 Comment(2)
Just a quick bug fix, for your '4 Alpha bits' solution, I think the second line should be: $hash = str_replace(range(0, 9), range('Q', 'Z'), $hash);Farinose
Come to think of it, the second one gives a-z + Q-Z as a range = 36 ^ 4 possible values. .. and the whole should read: $hash = substr(base_convert(md5($data), 16, 26),0, 4); $hash = str_replace(range(0, 9), range('q', 'z'), $hash);Farinose
D
1

Surprisingly high indeed. As you can see from this graph of an approximate collision probability (formula from the wikipedia page), with just a few hundred elements your probability of having a collision is over 50%.

Note, of course, if you're facing the possibility of an attacker providing the string, you can probably assume that it's 100% - scanning to find a collision in a 16-bit search space can be done almost instantaneously on any modern PC. Or even any modern cell phone, for that matter.

Diplomatics answered 13/1, 2011 at 15:59 Comment(0)
H
0

4 first characters contains 4*4 = 16 bits of data, so collision will be definitely at 65536 elements, and, due to birthday attack, it will be found much faster. You should use more bits of hash.

Holophrastic answered 13/1, 2011 at 15:53 Comment(2)
shouldn't you do 16^4 instead of 4*4 ? cause each characters have 16 variations and md5 uses Hex characters onlyGrating
He's counting bits, not the count of possible values there.Diplomatics

© 2022 - 2024 — McMap. All rights reserved.