Fast hash function with collision possibility near SHA-1
Asked Answered
B

7

11

I'm using SHA-1 to detect duplicates in a program handling files. It is not required to be cryptographic strong and may be reversible. I found this list of fast hash functions https://code.google.com/p/xxhash/ (list has been moved to https://github.com/Cyan4973/xxHash)

What do I choose if I want a faster function and collision on random data near to SHA-1?

Maybe a 128 bit hash is good enough for file deduplication? (vs 160 bit sha-1)

In my program the hash is calculated on chuncks from 0 - 512 KB.

Bashibazouk answered 22/2, 2015 at 16:46 Comment(9)
Use the one that git uses. If it is good enough for git, it is good enough for you!Soakage
Git uses SHA-1 and the 'hot loop' of the Git workflow is clearly not Git commit. The OP and myself are interested in hash functions that are sensible to use for the ~hot loop (think an in-mem database for example) and are offer very strong collision guarantees and bit independence, etc.Coffelt
CPU "Fast" is probably irrelevant -- The I/O is likely to be nearly all the elapsed time.Myrlemyrlene
granted.but consider rehashing a in-mem k/v. But you do have a point.Coffelt
@Bashibazouk Have you looked at Blake2B? blake2.net The C version is as fast as MD5 and is cryptographic. I wrote a Java version but I can not get it to go faster than SHA-1. github.com/alphazero/blake2bCoffelt
@Coffelt I didn't know it before now - but it has the cryptographic property that I do not need. Actually I'm testing Murmur3 128bit at the moment and it looks pretty fast. Also there is several Java implementation of Murmur3.Bashibazouk
While looking at options I found github.com/gpnuma/fsbench, a benchmark that you can run specifically on your machines and files and compare the performance of different hash algorithms.Bouillabaisse
How much faster do you need? What's your target hardware platform? SHA-1 is I/O bound on any modern PC. Remember that you can use any symmetric cipher, which may give you useful options if you're in a CPU/RAM constrained environment.Bigoted
"SHA-1 is I/O bound on any modern PC" - do you have any source on this?Dulcia
H
9

Maybe this will help you: https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed

collisions rare: FNV-1, FNV-1a, DJB2, DJB2a, SDBM & MurmurHash

I don't know about xxHash but it looks also promising.

MurmurHash is very fast and version 3 supports 128bit length, I would choose this one. (Implemented in Java and Scala.)

Honorific answered 8/4, 2015 at 13:48 Comment(2)
Thank you. The accepted answer is empirical in nature and the sample set is 2^20 which is tiny :)Coffelt
there is no general best hash function. it always depends on what you want to achive and what your real live data is. do your own tests for your usecase ;)Honorific
D
7

Since the only relevant property of hash algorithms in your case is the collision probability, you should estimate it and choose the fastest algorithm which fulfills your requirements.

If we suppose your algorithm has absolute uniformity, the probability of a hash collision among n files using hashes with d possible values will be:

enter image description here

For example, if you need a collision probability lower than one in a million among one million of files, you will need to have more than 5*10^17 distinct hash values, which means your hashes need to have at least 59 bits. Let's round to 64 to account for possibly bad uniformity.

So I'd say any decent 64-bit hash should be sufficient for you. Longer hashes will further reduce collision probability, at a price of heavier computation and increased hash storage volume. Shorter caches like CRC32 will require you to write some explicit collision handling code.

Dulcia answered 14/4, 2015 at 13:1 Comment(5)
didn't realize bounty would default. If up to me would have given it to you.Coffelt
That's very flattering, thanks! Bear in mind that auto-awarded bounties lose half of their value, so it's always best to award them manually, even if you pick the answer with the most upvotes.Dulcia
supposing those algorithm have absolute uniformity is like saying the earth is a perfect sphere. This might be fine for some cases but useless if you care about the details. Its just good for a estimation of your needed hash length.Honorific
beside that you are most likely IO bound anyway when doing hashing a lotHonorific
@A.Binzxxxxxx Well, that's exactly what I do here: estimating the lower bound of the hash length given a target collision probability.Dulcia
B
4

Google developed and uses (I think) FarmHash for performance-critical hashing. From the project page:

FarmHash is a successor to CityHash, and includes many of the same tricks and techniques, several of them taken from Austin Appleby’s MurmurHash.

...

On CPUs with all the necessary machine instructions, about six different hash functions can contribute to FarmHash's lineup. In some cases we've made significant performance gains over CityHash by using newer instructions that are now commonly available. However, we've also squeezed out some more speed in other ways, so the vast majority of programs using CityHash should gain at least a bit when switching to FarmHash.

(CityHash was already a performance-optimized hash function family by Google.)

It was released a year ago, at which point it was almost certainly the state of the art, at least among the published algorithms. (Or else Google would have used something better.) There's a good chance it's still the best option.

Bouillabaisse answered 9/4, 2015 at 10:11 Comment(4)
actually was just looking at an HN discussion where they ripped CityHash. news.ycombinator.com/item?id=4600425 :(Coffelt
Do you know if it also applies to FarmHash? Either way, we're talking non-cryptographic hashes, so all bets are off against malicious input.Bouillabaisse
(possibly dup) No I don't. I hear you but that may be only a semantic distinction. Why is it so easy to fish out a collision? Also, let's remember: bugs .. :)Coffelt
FYI I'm going to go with SipHash. No word on FarmHash but check this out for published attacks on various functions, including Murmur3: 131002.net/siphash (paper is worth the read.)Coffelt
R
3

The facts:

  1. Good hash functions, specially the cryptographic ones (like SHA-1), require considerable CPU time because they have to honor a number of properties that wont be very useful for you in this case;
  2. Any hash function will give you only one certainty: if the hash values of two files are different, the files are surely different. If, however, their hash values are equal, chances are that the files are also equal, but the only way to tell for sure if this "equality" is not just a hash collision, is to fall back to a binary comparison of the two files.

The conclusion:
In your case I would try a much faster algorithm like CRC32, that has pretty much all the properties you need, and would be capable of handling more than 99.9% of the cases and only resorting to a slower comparison method (like binary comparison) to rule out the false positives. Being a lot faster in the great majority of comparisons would probably compensate for not having an "awesome" uniformity (possibly generating a few more collisions).

Rancid answered 11/4, 2015 at 5:16 Comment(3)
A large enough probability can be considered certain for practical use. If you consider writing code that has a probability of 1/1000000 to be executed in a program's lifetime, you might as well avoid going outside because lightning accidents are twice as probable!Dulcia
Odds of collisions in this case are probably not as low as you might think if you have a few million files to test (see this: preshing.com/20110504/hash-collision-probabilities). Example: the chance is as high as 1/2 when using a 32 bit hash function with just 77k files! While the odds drop dramatically with 160 or even 64 bit functions, my point is that it's probably faster to use CRC32 to eliminate 99.99% of the cases, even assuming you'd need a slower hash function to deal with a small number of cases, than doing all in one shot by calculating a 160 bit hash function for every file.Rancid
CRC-32 is for very small data with short burst errors or single bit errors. It is only 32-bits wide and slower than non-cryptographic hashes such as MurmurHash3 (128-bit), SpookyHash (128-bit) and xxHash (64-bit), etc. There are much better, faster and robust choices out there.Sumbawa
S
3

128 bits is indeed good enough to detect different files or chunks. The risk of collision is infinitesimal, at least as long as no intentional collision is being attempted.

64 bits can also prove good enough if the number of files or chunks you want to track remain "small enough" (i.e. no more than a few millions ones).

Once settled the size of the hash, you need a hash with some very good distribution properties, such as the ones listed with Q.Score=10 in your link.

Saladin answered 11/4, 2015 at 8:33 Comment(0)
P
1

It kind of depends on how many hashes you are going to compute over in an iteration. Eg, 64bit hash reaches a collision probability of 1 in 1000000 with 6 million hashes computed.

Refer to : Hash collision probabilities

Pretoria answered 14/4, 2015 at 13:45 Comment(3)
"probability of 1 in 100000 with 6 million hashes computed" I think you missed a zero there, the actual probability would be about 10 times less.Dulcia
An interesting case is say when hash output is, e.g 128b, > 64b, & the bits are independent, and you mask off 64b to use as keys. Consider that a collision may be in the bits that were masked off (i.e. not part of the key obtained). Intuitively it seems we would have better probabilities in that case. (Haven't done the math.)Coffelt
A collision is when ALL bits of the hashes match, so if you mask off some of the bits that's still a collision. You can introduce new collisions when masking though. Not sure what practical use such masking could have - you end up spending CPU time computing bits you won't use.Dulcia
S
1

Check out MurmurHash2_160. It's a modification of MurmurHash2 which produces 160-bit output.

It computes 5 unique results of MurmurHash2 in parallel and mixes them thoroughly. The collision probability is equivalent to SHA-1 based on the digest size.

It's still fast, but MurmurHash3_128, SpookyHash128 and MetroHash128 are probably faster, albeit with a higher (but still very unlikely) collision probability. There's also CityHash256 which produces a 256-bit output which should be faster than SHA-1 as well.

Sumbawa answered 16/4, 2018 at 15:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.