Possible to calculate MD5 (or other) hash with buffered reads?
Asked Answered
M

5

35

I need to calculate checksums of quite large files (gigabytes). This can be accomplished using the following method:

    private byte[] calcHash(string file)
    {
        System.Security.Cryptography.HashAlgorithm ha = System.Security.Cryptography.MD5.Create();
        FileStream fs = new FileStream(file, FileMode.Open, FileAccess.Read);
        byte[] hash = ha.ComputeHash(fs);
        fs.Close();
        return hash;
    }

However, the files are normally written just beforehand in a buffered manner (say writing 32mb's at a time). I am so convinced that I saw an override of a hash function that allowed me to calculate a MD5 (or other) hash at the same time as writing, ie: calculating the hash of one buffer, then feeding that resulting hash into the next iteration.

Something like this: (pseudocode-ish)

byte [] hash = new byte [] { 0,0,0,0,0,0,0,0 };
while(!eof)
{
   buffer = readFromSourceFile();
   writefile(buffer);
   hash = calchash(buffer, hash);
}

hash is now sililar to what would be accomplished by running the calcHash function on the entire file.

Now, I can't find any overrides like that in the.Net 3.5 Framework, am I dreaming ? Has it never existed, or am I just lousy at searching ? The reason for doing both writing and checksum calculation at once is because it makes sense due to the large files.

Marelda answered 23/1, 2010 at 19:51 Comment(0)
B
53

You use the TransformBlock and TransformFinalBlock methods to process the data in chunks.

// Init
MD5 md5 = MD5.Create();
int offset = 0;

// For each block:
offset += md5.TransformBlock(block, 0, block.Length, block, 0);

// For last block:
md5.TransformFinalBlock(block, 0, block.Length);

// Get the has code
byte[] hash = md5.Hash;

Note: It works (at least with the MD5 provider) to send all blocks to TransformBlock and then send an empty block to TransformFinalBlock to finalise the process.

Bareheaded answered 23/1, 2010 at 20:2 Comment(9)
Ok, but +1 for also providing a reference!Lace
Ay caramba! There it is! That was the function I was searching for. Good to know I wasn't making it all up. Thanks to Guffa and Rubens for providing the correct answer so promptly. +1 to you both, I will accept this answer because of the included code example.Marelda
Note that you can equivalently replace the second instance of block by null in the call to TransformBlock; you don't actually want any copying to occur; the output parameter isn't actually doing anything with respect to the hashing.Trenna
Also, TransformFinalBlock can take zero for the length.Bissonnette
Is it possible to transform the first X blocks of data, dump the state data and then continue the next blocks after restoring state on a new calculation?. Having 100GB file in a cloud solution, it would be nice to be able to not have to go over the hole file in one go. machines could recycle ect.Nonnah
@pksorensen: I don't think so, I don't see any methods or properties for getting or setting the computional state of the MD5 object. In theory it's of course possible, but you might need to use a separate implementation of the algorithm so that you can add methods for handling the state.Bareheaded
TransformFinalBlock can take zero for the length BUT the input block can't be null and must be Array.Empty<byte>. Because .NET loves their pointless ArgumentNullExceptions.Comet
@EmperorEto: I think that the reasoning here is that it's more consistent to always throw an ArgumentNullException when the argument is null, than to exclude the case where the length is zero and the array isn't actually needed.Bareheaded
@Bareheaded definitely not opening that can of worms 🤐Comet
I
49

I like the answer above but for the sake of completeness, and being a more general solution, refer to the CryptoStream class. If you are already handling streams, it is easy to wrap your stream in a CryptoStream, passing a HashAlgorithm as the ICryptoTransform parameter.

var file = new FileStream("foo.txt", FileMode.Open, FileAccess.Write);
var md5 = MD5.Create();
var cs = new CryptoStream(file, md5, CryptoStreamMode.Write);
while (notDoneYet)
{
    buffer = Get32MB();
    cs.Write(buffer, 0, buffer.Length);
}
System.Console.WriteLine(BitConverter.ToString(md5.Hash));

You might have to close the stream before getting the hash (so the HashAlgorithm knows it's done).

Izettaizhevsk answered 14/2, 2011 at 22:25 Comment(0)
S
5

Seems you can to use TransformBlock / TransformFinalBlock, as shown in this sample: Displaying progress updates when hashing large files

Substance answered 23/1, 2010 at 20:2 Comment(1)
That link is dead, try this instead: infinitec.de/post/2007/06/09/…Kaifeng
L
3

Hash algorithms are expected to handle this situation and are typically implemented with 3 functions:

hash_init() - Called to allocate resources and begin the hash.
hash_update() - Called with new data as it arrives.
hash_final() - Complete the calculation and free resources.

Look at http://www.openssl.org/docs/crypto/md5.html or http://www.openssl.org/docs/crypto/sha.html for good, standard examples in C; I'm sure there are similar libraries for your platform.

Lace answered 23/1, 2010 at 19:55 Comment(2)
Good answer, but the "where is it in .net?" part of the question remains open.Gumbo
@Pascal: See the 2 good answers below, both of which had been posted before your comment.Lace
B
3

I've just had to do something similar, but wanted to read the file asynchronously. It's using TransformBlock and TransformFinalBlock and is giving me answers consistent with Azure, so I think it is correct!

private static async Task<string> CalculateMD5Async(string fullFileName)
{
  var block = ArrayPool<byte>.Shared.Rent(8192);
  try
  {
     using (var md5 = MD5.Create())
     {
         using (var stream = new FileStream(fullFileName, FileMode.Open, FileAccess.Read, FileShare.Read, 8192, true))
         {
            int length;
            while ((length = await stream.ReadAsync(block, 0, block.Length).ConfigureAwait(false)) > 0)
            {
               md5.TransformBlock(block, 0, length, null, 0);
            }
            md5.TransformFinalBlock(block, 0, 0);
         }
         var hash = md5.Hash;
         return Convert.ToBase64String(hash);
      }
   }
   finally
   {
      ArrayPool<byte>.Shared.Return(block);
   }
}
Bomber answered 8/8, 2017 at 16:8 Comment(3)
What's ArrayPool?Economic
OK got it: ArrayPool, need to install package System.Buffers.Economic
This is useful, but not a .net 3.5 solutionMuumuu

© 2022 - 2024 — McMap. All rights reserved.