Is the uniqueness of CRC-32-hashes sufficient to uniquely identify strings containing filenames?
Asked Answered
V

1

7

I have sorted lists of filenames concatenated to strings and want to identify each such string by a unique checksum.

The size of these strings is a minimum of 100 bytes, a maximum of 4000 bytes, and an average of 1000 bytes. The total number of strings could be anything but more likely be in the range of ca. 10000.

Is CRC-32 suited for this purpose?

E.g. I need each of the following strings to have a different fixed-length (, preferably short,) checksum:

"/some/path/to/something/some/other/path"
"/some/path/to/something/another/path"
"/some/path"
...
# these strings can get __very__ long (very long strings are the norm)

Is the uniqueness of CRC-32 hashes increased by input length?

Is there a better choice of checksum for this purpose?

Varicelloid answered 14/4, 2016 at 19:9 Comment(7)
If you already have a unique checksum then what is the issue?Luther
min 100 bytes, max 4000 bytes, average 1000 bytes. The issue is to shorten these strings, recalculate them, recalculate the checksum, then see if the checksum has been calculated before. I want to be sure that crc-32 is suited for this because I don't know much of hash functions and minimize collision probability.Varicelloid
How many entries are you expecting in total? guess? 1000, 10000 (1e5), one million (1e6), more?Luther
total entries could be any number but more likely be in the range of 10.000Varicelloid
Thank you! (I really do not know much of cryptography.) (If you write an answer I'll accept it).Varicelloid
Would you mind adding the values in the comments to your question.? imo, it will help others to answer your question as they know the 'size of the problem'. Thanks for the information - it helps us.Luther
imo, I would look at 'md5' hash for your application. Fast and unlikey to give collisions. It is not to be used for anything to do with security, imo, it would be a quick lookup of the filename.Luther
C
15

No.

Unless your filenames were all four characters or less, there is no assurance that the CRCs will be unique. With 10,000 names, the probability of at least two of them having the same CRC is about 1%.

This would be true for any 32-bit hash value.

The best way to assign a unique code to each name is to simply start a counter at zero for the first name, and increment for each name, assigning the counter as the code for that name. However that won't help you compute the code given just the name.

You can use a hash, such as a CRC or some other hash, but you will need to deal with collisions. There are several common approaches in the literature. You would keep a list of hashes with names assigned, and if you have a collision you could just increment the hash until you find one not used and assign that one. Then when looking up a name, you start at the computed hash and do a linear search for the name until you find it or an unused slot.

As for the hash, I would recommend XXH64. It is a very fast 64-bit hash. You do not need a cryptographic hash for this application, which would be unnecessarily slow.

Cortico answered 14/4, 2016 at 19:47 Comment(6)
Gotta trust Adler on this :DAutumnautumnal
Thank you! Sadly, I cannot just use a counter in this case. So ... to minimize collisions and to maximize speed, which hash function would you recommend?Varicelloid
You can minimize collisions all you like with longer hashes, but the probability of a collision will never be zero. Unless you are happy with a program that will only probably work, then you will need to deal with collisions.Cortico
@Varicelloid Git has been working pretty well with 160-bit SHA-1, but it is still vulnerable to collisions. Not that you're ever probable to come across one by accident.Autumnautumnal
This really makes me think. I mean hashes are used to verify file integrity of arbitrary-length files ... Of course I do not want a program that just 'probably' works ... but i do not see a way around a hash function here :( @Antti Haapala: Thank you!Varicelloid
@Mark Adler: Thank you for explaining how to handle the 'same hashes for different input' problem!Varicelloid

© 2022 - 2024 — McMap. All rights reserved.