Calculate MD5 checksum for a file
Asked Answered
O

7

418

I'm using iTextSharp to read the text from a PDF file. However, there are times I cannot extract text, because the PDF file is only containing images. I download the same PDF files everyday, and I want to see if the PDF has been modified. If the text and modification date cannot be obtained, is a MD5 checksum the most reliable way to tell if the file has changed?

If it is, some code samples would be appreciated, because I don't have much experience with cryptography.

Ommiad answered 9/5, 2012 at 16:16 Comment(1)
msdn.microsoft.com/en-us/library/…Hemiterpene
D
952

It's very simple using System.Security.Cryptography.MD5:

using (var md5 = MD5.Create())
{
    using (var stream = File.OpenRead(filename))
    {
        return md5.ComputeHash(stream);
    }
}

(I believe that actually the MD5 implementation used doesn't need to be disposed, but I'd probably still do so anyway.)

How you compare the results afterwards is up to you; you can convert the byte array to base64 for example, or compare the bytes directly. (Just be aware that arrays don't override Equals. Using base64 is simpler to get right, but slightly less efficient if you're really only interested in comparing the hashes.)

If you need to represent the hash as a string, you could convert it to hex using BitConverter:

static string CalculateMD5(string filename)
{
    using (var md5 = MD5.Create())
    {
        using (var stream = File.OpenRead(filename))
        {
            var hash = md5.ComputeHash(stream);
            return BitConverter.ToString(hash).Replace("-", "").ToLowerInvariant();
        }
    }
}
Doble answered 9/5, 2012 at 16:19 Comment(23)
If you want the "standard" looking md5, you can do: return BitConverter.ToString(md5.ComputeHash(stream)).Replace("-","").ToLower();Lail
@Lail What would be the preferred format when inserting into a database?.Ommiad
That would be the format I would use. It will give you a format like this: 837a6f4fad381c2a7b909032133ddaf6, which is almost always how you'll see MD5 hashes formatted.Lail
MD5 is in System.Security.Cryptography - just to surface the info more.Cypress
What about CRC32 instead of MD5?Cheloid
I am wondering, if I had a database file and I wanted to make sure it wasn't corrupted, can I run a CRC32 checksum on it to check integrity in a similar way you illustrated above?Cheloid
@KalaJ: Yes, absolutely.Doble
@JonSkeet, for database integrity, would the type of checksum matter in terms of security? Would a CRC32 checksum be appropriate or should I use something like SHA? Does .NET have a built in CRC32 algorithm? Thank you!Cheloid
@KalaJ: If you're trying to spot deliberate tampering, then CRC32 is entirely inappropriate. If you're only talking about spotting data transfer failures, it's fine. Personally I'd probably use SHA-256 just out of habit :) I don't know about support for CRC32 in .NET offhand, but you can probably search for it as quickly as I can :)Doble
@JonSkeet, What could be causing something like this? #25041412Cheloid
FYI: If you are comparing 2 streams, the read position must be the same on both stream for the MD5 Hash to compute the same for identical files. Just ran into this issue.Boz
It's not quite so simple with text files -- it is all too easy to end up with the "same" file with different line endings on different computers (e.g. from a perforce sync or git pull of a text file with client-specific line ending conversion). This can result in that "same" file having different checksums, which can cause issues, depending on your application. If this is an issue you may need to use TransformBlock and friends to accumulate the hash over the non-end-of-line portion of the file.Secondguess
@Lail I think .Replace("-", String.Empty) is a better approach. I went through a one hour debug session because I get wrong results when comparing a user input to the file hash.Strophe
@wuethrich44 Are you just objecting to the use of "" instead of string.Empty?Doble
@JonSkeet Yes when I use "" and compare the hash to another string (user input) it is not equal. I compare the strings with ordinal equals. Do you know why this happening?Strophe
@wuethrich44: No, but it wouldn't be due to the use of "" instead of string.Empty. It's absolutely fine to use "". I suggest you ask another question with details, if you can still reproduce the problem.Doble
@JonSkeet Ok then I will open a separate question and send you the link.Strophe
@wuethrich44, I think the problem you're having is if you copy/paste the code in aquinas comment verbatim; I happened to notice the same thing. There are two invisible characters--a "zero-width non-joiner" and a Unicode "zero width space"--between the "empty" quotes in the raw HTML. I don't know if it was in the original comment or if SO is to blame here.Increase
If you want the string in the format used by Azure Blob's then the code in this answer might be helpful: https://mcmap.net/q/87286/-compute-a-hash-from-a-stream-of-unknown-length-in-cSimplistic
To speed it up for large files, it's better to use it with a buffersize, e.g.: using (var stream = new BufferedStream(File.OpenRead(filename), 1048576)Obediah
How could you do this if I wanted to only hash the first page of the PDF?Vasques
@Adjit: You'd need to use an entirely different approach - nothing in my answer is PDF-specific.Doble
Right, its a simple byte-stream. I just saw Beetee's comment with the BufferedStream, and I could just hash the first x number of bytes instead of the "first page"Vasques
R
79

This is how I do it:

using System.IO;
using System.Security.Cryptography;

public string checkMD5(string filename)
{
    using (var md5 = MD5.Create())
    {
        using (var stream = File.OpenRead(filename))
        {
            return Encoding.Default.GetString(md5.ComputeHash(stream));
        }
    }
}
Riata answered 8/1, 2016 at 0:9 Comment(8)
I upvoted you because more people need to do things like this.Harvey
I think swapping the using blocks would be useful, because opening a file is more probably going to fail. Fail early/fast approach saves you the resources needed to create (and destroy) the MD5 instance in such scenarios. Also you can omit the braces of the first using and save a level of indentation without losing readability.Pissarro
This converts the 16 bytes long result to a string of 16 chars, not the expected 32 chars hex value.Essieessinger
This code does not produce the expected result (assumed expectation). Agreeing with @EssieessingerTabby
also a reference is missing: using System.Text;Ranket
@Palec, do you realise you just optimised your failure case? "When our program errors it returns that error .0000000000001s quicker to the user than before!". Unless its box processing a metric crap ton of requests where smth like this might matter its a really, really low value optimisation.Vagina
@Quibblesome, I was just trying to promote the general idea that the order of nesting of using statements matters. Elsewhere, the difference might be significant. Why not practice the habit of detecting failure early? I agree, though, that in this specific snippet, the habit brings almost no benefit.Pissarro
Unlike Jon Skeet's answer with BitConverter, Encoding.Default.GetString returns nonascii character gibberish for me (running within Unity).Graduated
B
12

I know this question was already answered, but this is what I use:

using (FileStream fStream = File.OpenRead(filename)) {
    return GetHash<MD5>(fStream)
}

Where GetHash:

public static String GetHash<T>(Stream stream) where T : HashAlgorithm {
    StringBuilder sb = new StringBuilder();

    MethodInfo create = typeof(T).GetMethod("Create", new Type[] {});
    using (T crypt = (T) create.Invoke(null, null)) {
        byte[] hashBytes = crypt.ComputeHash(stream);
        foreach (byte bt in hashBytes) {
            sb.Append(bt.ToString("x2"));
        }
    }
    return sb.ToString();
}

Probably not the best way, but it can be handy.

Breathy answered 21/12, 2016 at 19:16 Comment(5)
I have made a small change to your GetHash function. I've turned it into an extension method and removed the reflection code.Pulque
public static String GetHash<T>(this Stream stream) where T : HashAlgorithm, new() { StringBuilder sb = new StringBuilder(); using (T crypt = new T()) { byte[] hashBytes = crypt.ComputeHash(stream); foreach (byte bt in hashBytes) { sb.Append(bt.ToString("x2")); } } return sb.ToString(); }Pulque
This actually worked.... thank you!. I spent far to long looking online for the result that would produce a normal 32 char md5 string than I would have expected. This a little more complicated that I would prefer but it definitely works.Persnickety
@LeslieMarshall if you are going to use it as a extension method then you should reset the stream location rather than leaving it at the end positionPhilippopolis
I had better luck with @LeslieMarshall's method using (T) HashAlgorithm.Create(typeof(T).Name) and removing the new() constraint. For my implementation, I also changed it so the parameter is this byte[] resource and putting the stream in the method with using var stream = new MemoryStream(resource). You'll then only need to tell the compiler that crypt isn't null.Infectious
F
5

Here is a slightly simpler version that I found. It reads the entire file in one go and only requires a single using directive.

byte[] ComputeHash(string filePath)
{
    using (var md5 = MD5.Create())
    {
        return md5.ComputeHash(File.ReadAllBytes(filePath));
    }
}
Furthermore answered 15/12, 2014 at 10:3 Comment(4)
The downside of using ReadAllBytes is that it loads the whole file into a single array. That doesn't work at all for files larger than 2 GiB and puts a lot of pressure on the GC even for medium sized files. Jon's answer is only slightly more complex, but doesn't suffer from these problems. So I prefer his answer over yours.Tensity
Put in the usings after each other with out the first curly braces using (var md5 = MD5.Create()) using (var stream = File.OpenRead(filename)) gives you one using per line without unnecessary indentation.Essieessinger
@Essieessinger You can put an entire program on one line and eliminate ALL indentation. You can even use XYZ as variable names! What is the benefit to others?Helli
@DerekJohnson the point I was trying to make was probably that "and only requires a single using directive." was not really a good reason to read everything into memory. The more effective approach is to stream in the data into ComputeHash, and if possible using should only be used, but I can totally understand if you want to avoid the extra level of indentation.Essieessinger
S
5

I know that I am late to party but performed test before actually implement the solution.

I did perform test against inbuilt MD5 class and also md5sum.exe. In my case inbuilt class took 13 second where md5sum.exe too around 16-18 seconds in every run.

    DateTime current = DateTime.Now;
    string file = @"C:\text.iso";//It's 2.5 Gb file
    string output;
    using (var md5 = MD5.Create())
    {
        using (var stream = File.OpenRead(file))
        {
            byte[] checksum = md5.ComputeHash(stream);
            output = BitConverter.ToString(checksum).Replace("-", String.Empty).ToLower();
            Console.WriteLine("Total seconds : " + (DateTime.Now - current).TotalSeconds.ToString() + " " + output);
        }
    }
Specific answered 16/3, 2019 at 13:45 Comment(0)
T
1

For dynamically-generated PDFs. The creation date and modified dates will always be different.

You have to remove them or set them to a constant value.

Then generate md5 hash to compare hashes.

You can use PDFStamper to remove or update dates.

Tout answered 1/4, 2021 at 14:24 Comment(0)
B
0

In addition to the methods answered above if you're comparing PDFs you need to amend the creation and modified dates or the hashes won't match.

For PDFs generated with QuestPdf youll need to override the CreationDate and ModifiedDate in the Document Metadata.

public class PdfDocument : IDocument
{
    ...

    DocumentMetadata GetMetadata()
    {
        return new()
        {
            CreationDate = DateTime.MinValue,
            ModifiedDate = DateTime.MinValue,
        };
    }
    
    ...
}

https://www.questpdf.com/concepts/document-metadata.html

Brendanbrenden answered 4/9, 2022 at 19:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.