I built a file hashing method in java that takes input string representation of a filepath+filename
and then calculates the hash of that file. The hash can be any of the native supported java hashing algo's such as MD2
through SHA-512
.
I am trying to eek out every last drop of performance since this method is an integral part of a project I'm working on. I was advised to try using FileChannel
instead of a regular FileInputStream
.
My original method:
/**
* Gets Hash of file.
*
* @param file String path + filename of file to get hash.
* @param hashAlgo Hash algorithm to use. <br/>
* Supported algorithms are: <br/>
* MD2, MD5 <br/>
* SHA-1 <br/>
* SHA-256, SHA-384, SHA-512
* @return String value of hash. (Variable length dependent on hash algorithm used)
* @throws IOException If file is invalid.
* @throws HashTypeException If no supported or valid hash algorithm was found.
*/
public String getHash(String file, String hashAlgo) throws IOException, HashTypeException {
StringBuffer hexString = null;
try {
MessageDigest md = MessageDigest.getInstance(validateHashType(hashAlgo));
FileInputStream fis = new FileInputStream(file);
byte[] dataBytes = new byte[1024];
int nread = 0;
while ((nread = fis.read(dataBytes)) != -1) {
md.update(dataBytes, 0, nread);
}
fis.close();
byte[] mdbytes = md.digest();
hexString = new StringBuffer();
for (int i = 0; i < mdbytes.length; i++) {
hexString.append(Integer.toHexString((0xFF & mdbytes[i])));
}
return hexString.toString();
} catch (NoSuchAlgorithmException | HashTypeException e) {
throw new HashTypeException("Unsuppored Hash Algorithm.", e);
}
}
Refactored method:
/**
* Gets Hash of file.
*
* @param file String path + filename of file to get hash.
* @param hashAlgo Hash algorithm to use. <br/>
* Supported algorithms are: <br/>
* MD2, MD5 <br/>
* SHA-1 <br/>
* SHA-256, SHA-384, SHA-512
* @return String value of hash. (Variable length dependent on hash algorithm used)
* @throws IOException If file is invalid.
* @throws HashTypeException If no supported or valid hash algorithm was found.
*/
public String getHash(String fileStr, String hashAlgo) throws IOException, HasherException {
File file = new File(fileStr);
MessageDigest md = null;
FileInputStream fis = null;
FileChannel fc = null;
ByteBuffer bbf = null;
StringBuilder hexString = null;
try {
md = MessageDigest.getInstance(hashAlgo);
fis = new FileInputStream(file);
fc = fis.getChannel();
bbf = ByteBuffer.allocate(1024); // allocation in bytes
int bytes;
while ((bytes = fc.read(bbf)) != -1) {
md.update(bbf.array(), 0, bytes);
}
fc.close();
fis.close();
byte[] mdbytes = md.digest();
hexString = new StringBuilder();
for (int i = 0; i < mdbytes.length; i++) {
hexString.append(Integer.toHexString((0xFF & mdbytes[i])));
}
return hexString.toString();
} catch (NoSuchAlgorithmException e) {
throw new HasherException("Unsupported Hash Algorithm.", e);
}
}
Both return a correct hash, however the refactored method only seems to cooperate on small files. When i pass in a large file, it completely chokes out and I can't figure out why. I'm new to NIO
so please advise.
EDIT: Forgot to mention I'm throwing SHA-512's through it for testing.
UPDATE:
Updating with my now current method.
/**
* Gets Hash of file.
*
* @param file String path + filename of file to get hash.
* @param hashAlgo Hash algorithm to use. <br/>
* Supported algorithms are: <br/>
* MD2, MD5 <br/>
* SHA-1 <br/>
* SHA-256, SHA-384, SHA-512
* @return String value of hash. (Variable length dependent on hash algorithm used)
* @throws IOException If file is invalid.
* @throws HashTypeException If no supported or valid hash algorithm was found.
*/
public String getHash(String fileStr, String hashAlgo) throws IOException, HasherException {
File file = new File(fileStr);
MessageDigest md = null;
FileInputStream fis = null;
FileChannel fc = null;
ByteBuffer bbf = null;
StringBuilder hexString = null;
try {
md = MessageDigest.getInstance(hashAlgo);
fis = new FileInputStream(file);
fc = fis.getChannel();
bbf = ByteBuffer.allocateDirect(8192); // allocation in bytes - 1024, 2048, 4096, 8192
int b;
b = fc.read(bbf);
while ((b != -1) && (b != 0)) {
bbf.flip();
byte[] bytes = new byte[b];
bbf.get(bytes);
md.update(bytes, 0, b);
bbf.clear();
b = fc.read(bbf);
}
fis.close();
byte[] mdbytes = md.digest();
hexString = new StringBuilder();
for (int i = 0; i < mdbytes.length; i++) {
hexString.append(Integer.toHexString((0xFF & mdbytes[i])));
}
return hexString.toString();
} catch (NoSuchAlgorithmException e) {
throw new HasherException("Unsupported Hash Algorithm.", e);
}
}
So I attempted a benchmark hashing out the MD5 of a 2.92GB file using my original example and my latest update's example. Of course any benchmark is relative since there is OS and disk caching and other "magic" going on that will skew repeated reads of the same files... but here's a shot at some benchmarks. I loaded each method up and fired it off 5 times after compiling it fresh. The benchmark was taken from the last (5th) run as this would be the "hottest" run for that algorithm, and any "magic" (in my theory anyways).
Here's the benchmarks so far:
Original Method - 14.987909 (s)
Latest Method - 11.236802 (s)
That is a 25.03% decrease
in time taken to hash the same 2.92GB file. Pretty good.
ByteBuffer
directly instead of using the backing array? – PavonineByteBuffer.allocateDirect()
then there is no backing array andByteBuffer.array()
willfail
. Instead switch over to usingMessageDigest.update(ByteBuffer)
per @Pavonine advice. This is not only more efficient, but cleaner than trying to read the buffer to some array then pass that array intoMessageDigest.update()
. – ProudmanString.format("%02x", mdbytes[i])
should be used instead ofInteger.toHexString(0xFF & mdbytes[i])
to avoid this. – IchorMessageDigest.update(ByteBuffer)
simply fetches the buffer backingarray()
and then passes that array to its ownupdate()
. And if the ByteBuffer doesn't have an array, it creates a temporary array of its own then repeatedly callsbuffer.get(temp, 0, tempsize)
andupdate(temp, 0, tempsize)
. So there's no magical benefit when calculating digests to pass theByteBuffer
; it's cleaner to read certainly, but it's not inherently faster or avoiding bulk copying. – SaintpierreMessageDigest.update(ByteBuffer)
->engineUpdate(ByteBuffer)
) but subclasses can and do override engineUpdate() and have a more efficient implementation that uses the buffer directly, for example inP11Digest.engineUpdate()
. – Pavoninearray()
is just a direct pointer, not a copy, so in the base case it's still okay), but it's not going to be a huge impact for most files The whole point after all is to get the data into memory to perform the hashing on it, so the "fast I/O copying" aspect of memory-mapped files is only available for special hardware not needing an in-memory representation like the PKCS#11 you mentioned. – Saintpierre