Are hash collisions with different file sizes just as likely as same file size?

D

5

14

I'm hashing a large number of files, and to avoid hash collisions, I'm also storing a file's original size - that way, even if there's a hash collision, it's extremely unlikely that the file sizes will also be identical. Is this sound (a hash collision is equally likely to be of any size), or do I need another piece of information (if a collision is more likely to also be the same length as the original).

Or, more generally: Is every file just as likely to produce a particular hash, regardless of original file size?

Dromedary answered 14/3, 2010 at 15:31 Comment(3)

@bmargulies: I suppose I'm asking generally, but I'm currently using SHA1, considering switching to something like SHA256. I'm just wondering how long a hash is necessary if I'm also keying on file size. – Dromedary 14/3, 2010 at 15:39

I had the exact same idea. We need to hash files, but we need maximum speed (i.e. MD5) and the files vary wildly in sizes. If it is possible to get the same MD5 hash on two different file sizes, then it may be worth storing both the MD5 + size for an extra layer of safety. We are hashing through millions (maybe even a billion) files, so in our case it may be worth including the file size. – Soubriquet 10/12, 2015 at 3:33

If you're comparing a large number of files for identical matches, then it's only worth computing the hash of files that have the same size (because it takes a long time to compute hashes). For example, if most of your files are up to 100MB in size, but only one of them is 800TB, then it's not worth computing the hash of that single 800TB file, because it's obvious that no other file will match. – Funny 26/7, 2022 at 11:46

E

6

Depends on your hash function, but in general, files that are of the same size but different content are less likely to produce the same hash as files that are of different size. Still, it would probably be cleaner to simply use a time-tested hash with a larger space (e.g. MD5 instead of CRC32, or SHA1 instead of MD5) than bet on your own solutions like storing file size.

Elvaelvah answered 14/3, 2010 at 15:39 Comment(4)

I was considering using a hash in combination with file size - that way, in the unlikely even of a collision, I'd check the file size as an additional key to see if it was really the same file. – Dromedary 14/3, 2010 at 16:34

I understand what you are aiming for, but my point is that instead of taking extra N bits to store a filesize, you should simply take a hash function whose hash is N bits larger than your current one. It is much more likely to produce fewer collisions that way, since filesize is arbitrary, while hash functions are specifically designed to avoid collisions, therefore these extra bits will be better utilized that way. – Elvaelvah 14/3, 2010 at 17:46

Ah - that makes sense. I figured I'd be better of picking a "larger" hash function anyways, so maybe that's what I'll end up doing. – Dromedary 14/3, 2010 at 23:45

@MaxShawabkeh do you have a source for the statement "files of the same size but different are less likely to produce the same hash than different sized files" I'm curious if hashes are weighted like this. – Thyratron 7/3, 2013 at 11:53

T

13

Hash functions are generally written to evenly distribute the data across all result buckets.

If you assume that your files are evenly distributed over a fixed range of available sizes, lets say that there are only 1024 (2^10) evenly distributed distinct sizes for your files. Storing file size at best only reduces the chance of a collision by the number of distinct file sizes.

Note: we could assume it's 2^32 evenly distributed and distinct sizes and it still doesn't change the rest of the math.

It is commonly accepted that the general probability of a collision on MD5 (for example) is 1/(2^128).

Unless there is something that is specifically built into a hash function that says otherwise. Given any valid X such that Probability of P(MD5(X) == MD5(X+1)) remains the same as any two random values {Y, Z} That is to say that P(MD5(Y) == MD5(Z)) = P(MD5(X) == MD5(X+1)) = 1/(2^128) for any values of X, Y and Z.

Combining this with the 2^10 of distinct files means that by storing file size you are at most getting an additional 10 bits that signify if items are different or not (again this is assuming your files are evenly distributed for all values).

So at the very best all you are doing is adding another N bytes of storage for <=N bytes worth of unique values (it can never be >N). Therefore you're much better off to increase the bytes returned by your hash function using something such as SHA-1/2 instead as this will be more likely to give you an evenly distributed data of hash values than storing the file size.

In short, if MD5 isn't good enough for collisions use a stronger hash, if the stronger hashes are too slow then use a fast hash with low chance of collisions such a as MD5, and then use a slower hash such as SHA-1 or SHA256 to reduce the chance of a collision, but if SHA256 is fast enough and the doubled space isn't a problem then you probably should be using SHA256.

Thyratron answered 7/3, 2013 at 12:18 Comment(1)

I understand that using 128 bits for md5 + 32 bits for file size is less effective at preventing collisions than using the same 160 bits for sha1. But suppose you need to know the file size for an unrelated purpose, so that you have to use those 32 bits to store the file size regardless of any hashing-related concerns. In that case, isn't reducing the odds of collisions by a factor of ~1,000 at the trivial cost of a single integer comparison still a good idea? – Stinker 2/5 at 0:46

E

6

Depends on your hash function, but in general, files that are of the same size but different content are less likely to produce the same hash as files that are of different size. Still, it would probably be cleaner to simply use a time-tested hash with a larger space (e.g. MD5 instead of CRC32, or SHA1 instead of MD5) than bet on your own solutions like storing file size.

Elvaelvah answered 14/3, 2010 at 15:39 Comment(4)

I was considering using a hash in combination with file size - that way, in the unlikely even of a collision, I'd check the file size as an additional key to see if it was really the same file. – Dromedary 14/3, 2010 at 16:34

I understand what you are aiming for, but my point is that instead of taking extra N bits to store a filesize, you should simply take a hash function whose hash is N bits larger than your current one. It is much more likely to produce fewer collisions that way, since filesize is arbitrary, while hash functions are specifically designed to avoid collisions, therefore these extra bits will be better utilized that way. – Elvaelvah 14/3, 2010 at 17:46

Ah - that makes sense. I figured I'd be better of picking a "larger" hash function anyways, so maybe that's what I'll end up doing. – Dromedary 14/3, 2010 at 23:45

@MaxShawabkeh do you have a source for the statement "files of the same size but different are less likely to produce the same hash than different sized files" I'm curious if hashes are weighted like this. – Thyratron 7/3, 2013 at 11:53

C

2

Hash functions are designed the way that it's very difficult to get the collision, otherwise they won't be effective.
If you have hash collision that is ~~absolutely unbelievable~~ about 1 : number_of_possible_hashes probability that says nothing about file size.

If you really want to be double-sure about hash collisions, you can calculate two different hashes for the same file - it will be less error-prone than saving hash + file size.

Coessential answered 14/3, 2010 at 15:39 Comment(2)

I was actually considering doing this - see my other question, #2437845. I figured that saving two hashes (like SHA1 and MD5), as well as the filesize, will make collisions so astronomically unlikely that I never have to worry about it. – Dromedary 14/3, 2010 at 15:42

Pretend you are using sha256 which gives you 2^256 possible hash values and you have billion of files with million versions each that is 1 000 000 000 * 1 000 000 approximating to 2^50 so you are ending up with average of 2^200 possible hash values for each file without any collision threat. Isn't it immense? To be more precise you can try evaluating the probability of hash collision by calculating 1 - ((2^256)! / ((2^256) - 10^15)! ) / ((2^256)^(10^15)) or if not so accurate 1 - (1 - (10^15)/(2*2^256))^(10^15) that will give you 4e-48 chance of collision. – Coessential 14/3, 2010 at 17:47

S

2

The size of the hash is the same regardless of the size of the original data. As there is only a limited number of possible hashes it is theoretically possible that two files with different sizes may have the same hash. However, this means that it is also possible that two files with the same size may have the same hash.

Salmon answered 14/3, 2010 at 15:41 Comment(0)

S

0

The whole point of the family of cryptographic hashes (MD5, SHA-x, etc) is to make collisions vanishingly unlikely. The notion is that official legal processes are prepared to depend on it being impractical to manufacture a collision on purpose. So, really, it's a bad use of space and CPU time to add a belt to the suspenders of these hashes.

Showy answered 14/3, 2010 at 19:14 Comment(0)

Recommended topics

Hot tags