Recursive MD5 and probability of collision
Asked Answered
W

3

7

I wonder if it is 'safe' to hash a bunch of MD5 hash values together to create a new hash or whether this will in any way increase the probability of collisions.

The background: I have a couple of files with dependencies. Each file has an associated hash value which is calculated based on it's content. Let's call this the 'single-file' hash value. In addition to this, the file should also have a hash value which includes all the dependent files, the 'multi-file' hash value.

So the question is: Can I just take all the single-file MD5 hash values of the dependent files, concatenate them and then calculate an MD5 over the concatenated values to get the multi-file hash value. Or will this result in an MD5 hash that is more likely to collide than if I would concatenate the content of all dependent files together.

Alternatively, could I xor the single-file hash values together to generate a multi-file hash value, or would this likely result in more collisions?

Walley answered 18/9, 2011 at 12:39 Comment(0)
E
3

Sounds like you need a Merkel Tree

Engrain answered 18/9, 2011 at 14:13 Comment(2)
Thanks, that one looks interesting :)Walley
Even though it didn't really answer my question about MD5 in particular, this does solve my problem, as I'm gonna use Tiger hashes now, which seem to be perfect for my purpose :)Walley
A
1

MD5 has a lot of collision problems, see MD5 entry on Wikipedia.

However, if you use MD5 not for security but as a unique marker to check dependencies, even hashing contatenated hashes should be pretty safe.

Or, if it's not too late, switch to SHA-1.

Aggie answered 18/9, 2011 at 12:50 Comment(4)
From my understanding, the collision problems are mainly relevant under the assumption of an active attacker, who's sole purpose is to provoke a collision, but not that random collisions are considerably more likely than using SHA-1. Since it's solely to generate a unique id and not for security purposes, active attackers are a non-issue and performance is the main concern.Walley
For checksums (and it's pretty obvious from the question that this is what OP is doing), forged collisions don't matter. Only the last part of the second sentence adresses the question at all, and it definitely needs elaboration - how did you arrive at this conculsion?Revkah
I believe that you should be pretty safe then. You're mapping files to points in 2^128 space. Then you're mapping another small file (contatenation) to another point in 2^128 space. If you trust MD5 to hash your files uniformly — you should trust it to hash your concatenations just as uniformly.Aggie
@squadette: It could be (especially given that its easy to artificially construct collisions) that an MD5 hash itself has some properties that makes them poor bases for hashing again, using MD5. For instance it could theoretically be that the infinite application of MD5 hashes converges to a specific hash value.Walley
Y
1

I think the risks of a collision is about the same for hashing the concatenated files, as to hashing the concatenated file hashes.

Yerkovich answered 18/9, 2011 at 13:45 Comment(5)
I think so, too. :) However, I'd like to better be safe than sorry and end up with tons of collisions all the time :)Walley
Hashing is all about chance. If you want to be 100% sure, you shouldn't use hashes. Just know that hashes are designed to be as randomly distributed as possible to chances of collisions are very near 1/hashsize (1/2^128 for md5). You won't get tons of collisions, but you code should not break sometimes a collision occurs.Yerkovich
I'm not afraid of a 1:1^128 collision chance, but, as I wrote to squadette, theoretically applying MD5 recursively could converges to a specific hash value, who am I to know. Hence I'd like to get some info by people knowing the algorithm, if it has such problems or not.Walley
First you are not hashing a hash, but hashing a string of concatenated hashes. Second you are only going 2 levels deep, not like rehashing a 1000 times. Converging will not occur after just 2 round.Yerkovich
True, I'm only going 1 level deep in my example, maybe I should generalize the question more. But if it would converge, going one level deep would definitely increase the probability of collisions, it's just a question of by how much. As I said, I do not believe it to be an issue, but I would still like to get some mathematical facts, or at least some inside knowledge of people knowing the inner workings of MD5, that it's save to use it this way.Walley

© 2022 - 2024 — McMap. All rights reserved.