"Fastest" hash function implemented in Java, comparing part of file
Asked Answered
P

2

20

I need to compare two different files of the instance "File" in Java and want to do this with a fast hash function.

Idea: - Hashing the 20 first lines in File 1 - Hashing the 20 first lines in File 2 - Compare the two hashes and return true if those are equal.

I want to use the "fastest" hash function ever been implemented in Java. Which one would you choose?

Paternal answered 12/4, 2011 at 8:48 Comment(2)
Sorry, but this is just a horrible idea. It would be trivial to produce collisions, regardless of what hash function you used. Might as well take the first 10 characters of a file for its "hash".Bangup
what do you know about the files you'll be comparing? One of the very first you can do is use the file size as part of your hash. Out of the tens of thousands (or hundreds of thousands of more) files on your filesystem, there's a very, very low percentage of two files that do have the same filesize...Abbreviate
C
35

If you want speed, do not hash! Especially not a cryptographic hash like MD5. These hashes are designed to be impossible to reverse, not fast to calculate. What you should use is a Checksum - see java.util.zip.Checksum and its two concrete implementations. Adler32 is extremely fast to compute.

Any method based on checksums or hashes is vulnerable to collisions, but you can minimise the risk by using two different methods in the way RSYNC does.

The algorithm is basically:

  • Check file sizes are equal
  • Break the files into chunks of size N bytes
  • Compute checksum on each pair of matching blocks and compare. Any differences prove files are not the same.

This allows for early detection of a difference. You can improve it by computing two checksums at once with different algorithms, or different block sizes.

More bits in the result mean less chance of a collision, but as soon as you go over 64 bits you are outside what Java (and the computer's CPU) can handle natively and hence get slow, so FNV-1024 is less likely to give you a false negative but is much slower.

If it is all about speed, just use Adler32 and accept that very rarely a difference will not be detected. It really is rare. Checksums like these are used to ensure the internet can spot transmission errors, and how often do you get the wrong data turning up?

It it is all about accuracy really, you will have to compare every byte. Nothing else will work.

If you can compromise between speed and accuracy, there is a wealth of options out there.

Chinchin answered 12/4, 2011 at 12:40 Comment(2)
Awesome answer! I tried MD5 and SHA-1 for hashing files, but they were way too slow for the read speed of an SSH. Using the CRCs proved A LOT faster. You gave a nice overview of strategies too. Thumbs up man.Downbeat
Oh well, and an extension to this awesome answer: to make collisions very very unlikely, simply run the RSYNC algorithm twice, with different 'N's. So before, your collision probability is somewhere in 1:2^64, it will now be in the 1:2^128. Rough estimate.Downbeat
P
3

If you're comparing two files at the same time on the same system there's no need to hash both of them. Just compare the bytes in both files are equal as you read both. If you're looking to compare them at different times or they're in different places then MD5 would be both fast and adequate. There's not much reason to need a faster one unless you're dealing with really large files. Even my laptop can hash hundreds of megabytes per second.

You also need to hash the whole file if you want to verify they're identical. Otherwise you might as well just check the size and last modified time if you want a really quick check. You could also check the beginning and end of the file if they're just really large and you trust that the middle won't be changing. If you're not dealing with hundreds of megabytes though, you may as well check every byte of each file.

Pronty answered 12/4, 2011 at 8:54 Comment(2)
I need to compare those files both at different time and places so I guess hashing is the best choice here. I was thinking of MD5 but wanted to do some research if there were any faster. Thanks for your answer!Paternal
Ah, ok. Yeah, MD5 is likely going to be fine. If you are really dealing with large files there is this Fast MD5 Implementation in Java.Pronty

© 2022 - 2024 — McMap. All rights reserved.