Hash functions are generally written to evenly distribute the data across all result buckets.
If you assume that your files are evenly distributed over a fixed range of available sizes, lets say that there are only 1024 (2^10) evenly distributed distinct sizes for your files. Storing file size at best only reduces the chance of a collision by the number of distinct file sizes.
Note: we could assume it's 2^32 evenly distributed and distinct sizes and it still doesn't change the rest of the math.
It is commonly accepted that the general probability of a collision on MD5 (for example) is 1/(2^128)
.
Unless there is something that is specifically built into a hash function that says otherwise. Given any valid X
such that Probability of P(MD5(X) == MD5(X+1))
remains the same as any two random values {Y
, Z
} That is to say that P(MD5(Y) == MD5(Z))
= P(MD5(X) == MD5(X+1))
= 1/(2^128)
for any values of X
, Y
and Z
.
Combining this with the 2^10 of distinct files means that by storing file size you are at most getting an additional 10 bits that signify if items are different or not (again this is assuming your files are evenly distributed for all values).
So at the very best all you are doing is adding another N bytes of storage for <=N bytes worth of unique values (it can never be >N). Therefore you're much better off to increase the bytes returned by your hash function using something such as SHA-1/2 instead as this will be more likely to give you an evenly distributed data of hash values than storing the file size.
In short, if MD5 isn't good enough for collisions use a stronger hash, if the stronger hashes are too slow then use a fast hash with low chance of collisions such a as MD5, and then use a slower hash such as SHA-1 or SHA256 to reduce the chance of a collision, but if SHA256 is fast enough and the doubled space isn't a problem then you probably should be using SHA256.