Generating an MD5 checksum of a file
Asked Answered
G

9

472

Is there any simple way of generating (and checking) MD5 checksums of a list of files in Python? (I have a small program I'm working on, and I'd like to confirm the checksums of the files).

Gynecoid answered 7/8, 2010 at 19:50 Comment(10)
Why not just use md5sum?Useless
Keeping it in Python makes it easier to manage the cross-platform compatibility.Gynecoid
If you want de solution with "progress bar* or similar (for very big files), consider this solution: #1131720Fultz
@Useless The link you provided says this in the second paragraph: "The underlying MD5 algorithm is no longer deemed secure" while describing md5sum. That is why security-conscious programmers should not use it in my opinion.Wideman
@Wideman Good and valid point. Both md5sum and the technique described in this SO question should be avoided - it's better to use SHA-2 or SHA-3, if possible: en.wikipedia.org/wiki/Secure_Hash_AlgorithmsApologize
@PerLundberg or the newer hashlib.blake2b which is both faster than md5 and secure.Dior
@Boris Thanks. Is BLAKE2b/BLAKE2s as widely available cross-platform as the SHA algorithms? (I hadn't heard about them before you mentioned them here)Apologize
@PerLundberg modern languages should implement them (I know Python, Go and Rust do). There's a b2sum command available on Ubuntu.Dior
OK, nice. For reference: crypto.stackexchange.com/questions/45127/…Apologize
Might be worth mentioning there are still valid reasons to use md5 that are not affected by it's brokenness for security purposes. (eg checking for bit rot in a system that uses baked in md5 creation during archival)Projector
M
663

You can use hashlib.md5()

Note that sometimes you won't be able to fit the whole file in memory. In that case, you'll have to read chunks of 4096 bytes sequentially and feed them to the md5 method:

import hashlib
def md5(fname):
    hash_md5 = hashlib.md5()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

Note: hash_md5.hexdigest() will return the hex string representation for the digest, if you just need the packed bytes use return hash_md5.digest(), so you don't have to convert back.

Mctyre answered 7/8, 2010 at 19:53 Comment(4)
How could I decode the hex string ? It differs from the output of what md5sum returnsMesognathous
@Mesognathous no it doesn't -- sorry to put it so flippantly-sounding, but there is no way that md5 differs for the same input -- if you're reading binary (not line-ending-agnostic) input, then this algorithm is deterministic -- md5's famous problem is that it might FAIL TO DIFFER for two different inputsPittel
@Pittel As I understand md5 formula may end up generate same output for the two different inputs ?Mesognathous
yes: crypto.stackexchange.com/questions/1434/…Pittel
J
336

There is a way that's pretty memory inefficient.

single file:

import hashlib
def file_as_bytes(file):
    with file:
        return file.read()

print hashlib.md5(file_as_bytes(open(full_path, 'rb'))).hexdigest()

list of files:

[(fname, hashlib.md5(file_as_bytes(open(fname, 'rb'))).digest()) for fname in fnamelst]

Recall though, that MD5 is known broken and should not be used for any purpose since vulnerability analysis can be really tricky, and analyzing any possible future use your code might be put to for security issues is impossible. IMHO, it should be flat out removed from the library so everybody who uses it is forced to update. So, here's what you should do instead:

[(fname, hashlib.sha256(file_as_bytes(open(fname, 'rb'))).digest()) for fname in fnamelst]

If you only want 128 bits worth of digest you can do .digest()[:16].

This will give you a list of tuples, each tuple containing the name of its file and its hash.

Again I strongly question your use of MD5. You should be at least using SHA1, and given recent flaws discovered in SHA1, probably not even that. Some people think that as long as you're not using MD5 for 'cryptographic' purposes, you're fine. But stuff has a tendency to end up being broader in scope than you initially expect, and your casual vulnerability analysis may prove completely flawed. It's best to just get in the habit of using the right algorithm out of the gate. It's just typing a different bunch of letters is all. It's not that hard.

Here is a way that is more complex, but memory efficient:

import hashlib

def hash_bytestr_iter(bytesiter, hasher, ashexstr=False):
    for block in bytesiter:
        hasher.update(block)
    return hasher.hexdigest() if ashexstr else hasher.digest()

def file_as_blockiter(afile, blocksize=65536):
    with afile:
        block = afile.read(blocksize)
        while len(block) > 0:
            yield block
            block = afile.read(blocksize)


[(fname, hash_bytestr_iter(file_as_blockiter(open(fname, 'rb')), hashlib.md5()))
    for fname in fnamelst]

And, again, since MD5 is broken and should not really ever be used anymore:

[(fname, hash_bytestr_iter(file_as_blockiter(open(fname, 'rb')), hashlib.sha256()))
    for fname in fnamelst]

Again, you can put [:16] after the call to hash_bytestr_iter(...) if you only want 128 bits worth of digest.

Jospeh answered 7/8, 2010 at 19:53 Comment(36)
I'm only using MD5 to confirm the file isn't corrupted. I'm not so concerned about it being broken.Gynecoid
@TheLifelessOne: And despite @Jospeh scary warnings, that is perfectly good use of MD5.Pall
@GregS, @TheLifelessOne - Yeah, and next thing you know someone finds a way to use this fact about your application to cause a file to be accepted as uncorrupted when it isn't the file you're expecting at all. No, I stand by my scary warnings. I think MD5 should be removed or come with deprecation warnings.Jospeh
While @Mctyre has a viable answer, I believe this one should be selected as the proper method for retrieving a files md5 checksum. However, it could be simplified to "hashlib.md5(open(fname, 'r').read()).digest()". You should note that the "file" function was changed to "open" for use with python 2.7+Bulldoze
@AustinS.: nod Yeah. I fixed it to say open. I believe that's worked ever since hashlib was introduced, and possible has always worked. Old habits die hard.Jospeh
I'd probably use .hexdigest() instead of .digest() - it's easier for humans to read - which is the purpose of OP.Ensample
@Zotov: I would remove hexdigest from the standard hashlib hash function interface. I feel that it's an unnecessary wart. And I like making even small functions widely applicable. There are many cases in which the hex of the hash is quite unnecessarily verbose and making that the easiest to use version is encouraging people to be verbose when they don't have to be. But yes, in this case, for this specific purpose it is likely the better choice. I would still just use binascii.hexlify instead. :-)Jospeh
I used this solution but it uncorrectly gave the same hash for two different pdf files. The solution was to open the files by specifing binary mode, that is: [(fname, hashlib.md5(open(fname, 'rb').read()).hexdigest()) for fname in fnamelst] This is more related to the open function than md5 but I thought it might be useful to report it given the requirement for cross-platform compatibility stated above (see also: docs.python.org/2/tutorial/…).Grimy
@BlueCoder: Oh, you're right. I should've done that. I'm so used to Unix where the two are synonymous. I'll fix it now.Jospeh
@Jospeh Saying "remove MD5 from the Python library" or even just saying "add deprecation warning to Python library" is like saying "Python should not be used, if existing stuff requires MD5, please use something else". Explain security implications in docs, sure, but removal or even just deprecation is insane suggestion.Latinity
@hyde: Something has to be done to get people to stop using that stupid algorithm. I've had jobs where they persisted in using it even after I demonstrated that it created security holes (admittedly rather obscure ones) and that SHA had a faster implementation in OpenSSL, which was the library we were using. It's insane.Jospeh
Any way for this to be at most one order of magnitude slower than md5sum on the command line?Thiourea
For people using the def hashfile function above multiple times on the same file handle remember to reset the afile pointer when done reading each file. eg. afile.seek(0)Catechumen
Reminder: the known weaknesses for MD5 are collision attacks, and not preimage attacks, so it is suitable for some cryptographic applications but not others. If you don't know the difference you shouldn't be using it, but don't discard it altogether. See vpnc.org/hash.html.Yemane
is it ok to not close opened files in those list comprehensions?Multinuclear
Yes, I wanted to ask the same thing. Isn't a close() missing here?Groveman
No, it is not okay. The files will be closed on garbage collection, likely in the end of the enclosing function. If, for example, the number of elements in fnamelist is greater than the limit set by your OS, it will fail. But that is irrelevant to the question asked. We should use SO to get the gist, not copy the snippets blindly. :)Checkerwork
@Grimy How did it happen that two different pdf files had the same hash, even if opened without mode=rb? Shouldn't rt simply convert newlines and otherwise be identical to rb? (I assume this is python 2, since in python 3 hashlib.md5 requires bytes, and will simply refuse to accept a string,)Skyeskyhigh
@RomanShapovalov - I was relying on the reference counted nature of Python objects. After each element of the list comprehension is evaluated, there are no more references to it. I do agree that's rather tenuous and relying overly much on implementation. :-/ I like the interface for hashfile though, it's more flexible because it handles anything that has read.Jospeh
@RomanShapovalov - I fixed it so that it no longer has a potential resource leak, even though the current CPython implementation doesn't. I agree that it should avoid leaking even on Jython or future possible implementations of CPython.Jospeh
@JasonS - I can stick my hand in liquid nitrogen briefly and it won't be harmed. That doesn't mean I should do it. There are lots of alternatives to MD5 that are widely available. There is no more reason for anybody to use it than there is for me to stick my hand in liquid nitrogen.Jospeh
Nope. Sorry. Bad analogy.Yemane
@JasonS - Can you give a rational reason anybody should use MD5 that's not one of these two: "Well, I think I can get away with it in this circumstance." or "I have to interoperate with something else that uses MD5."?Jospeh
The entirety of life is about "I think I can get away with it in this circumstance" --- or more objectively stated, risk management, which applies to all cryptographic systems, MD5 and SHA1 included. Read up on the state-of-the-art on MD5 preimage attacks. I don't put bars on all my windows at home, and I use MD5 when I am doing garden-variety integrity checks where a malicious adversary is not present (e.g. copying files from one PC to another)Yemane
web.archive.org/web/20150901084550/http://www.vpnc.org/… -- "The difference between a collision attack and either of the two preimage attacks is crucial. At the time of this writing, there are no practical preimage attacks, meaning that if your use of hashes is only susceptible to preimage attacks, even MD5 is just fine because at attacker would have to make 2^128 guesses, which will be infeasable for many decades (if ever)."Yemane
@JasonS - And in so doing, you are perpetuating the use and very existence of an algorithm that is broken for a wide variety of other uses. Using a proper algorithm isn't like putting bars on your windows. Using the right algorithm is a matter of typing a few letters differently. There is no good reason to use MD5 at all for anything. It has no quality that recommends it over SHA256 in any reasonable situation.Jospeh
I'm not continuing this discussion, you're just being ideological about your rejection of MD5.Yemane
@JasonS - I would argue that you are being ideological in your refusal to reject an algorithm that has perfectly viable replacements that there is no good reason whatsoever to not use. "I learned to type MD5 darn it, and nobody is going to tell me I can't. Those other letters, they're weird and my fingers can't type them!"Jospeh
I just need to correct the same image, thus, using hashlib.md5(open(full_path, 'rb').read()).hexdigest() is good enough. Thanks!Interlocutor
@LittleZero - Is md5 that much easier to type than sha256? I'm just poking at this, because it's better to just forget the broken algorithm ever existed, no matter how safe it is to use in certain contexts. Retrain yourself to never even think of using the broken algorithm, and then you won't end up using it when it matters.Jospeh
We should release resources. Open file with with statement or write code to close file.Goosander
@RohitTaneja - Resources are being released. The file object is immediately associated with a with statement inside file_as_blockiter.Jospeh
@Jospeh I am talking about the first 3 code snippets. EX import hashlib [(fname, hashlib.md5(open(fname, 'rb').read()).digest()) for fname in fnamelst]Goosander
@RohitTaneja - Ahh, the ones I mean as bad examples. :-) Yes, I suppose I should fix that. They aren't supposed to be that kind of bad example.Jospeh
@ChadLowe - That makes no sense. I just tested it, and it works fine on a zero length file. What problem did you have? Or did it just look wrong, and so you had to fix it? There is no reason the iterator has to yield at least once. It will just never call update, and that's the exact same result as if you feed update a single empty string.Jospeh
You are correct. I'm not sure what I was doing before, but your code works as expected now. Just goes to show, I should always look at my own code for the problem first ;)Echoism
P
36

I'm clearly not adding anything fundamentally new, but added this answer before I was up to commenting status, plus the code regions make things more clear -- anyway, specifically to answer @Nemo's question from Omnifarious's answer:

I happened to be thinking about checksums a bit (came here looking for suggestions on block sizes, specifically), and have found that this method may be faster than you'd expect. Taking the fastest (but pretty typical) timeit.timeit or /usr/bin/time result from each of several methods of checksumming a file of approx. 11MB:

$ ./sum_methods.py
crc32_mmap(filename) 0.0241742134094
crc32_read(filename) 0.0219960212708
subprocess.check_output(['cksum', filename]) 0.0553209781647
md5sum_mmap(filename) 0.0286180973053
md5sum_read(filename) 0.0311000347137
subprocess.check_output(['md5sum', filename]) 0.0332629680634
$ time md5sum /tmp/test.data.300k
d3fe3d5d4c2460b5daacc30c6efbc77f  /tmp/test.data.300k

real    0m0.043s
user    0m0.032s
sys     0m0.010s
$ stat -c '%s' /tmp/test.data.300k
11890400

So, looks like both Python and /usr/bin/md5sum take about 30ms for an 11MB file. The relevant md5sum function (md5sum_read in the above listing) is pretty similar to Omnifarious's:

import hashlib
def md5sum(filename, blocksize=65536):
    hash = hashlib.md5()
    with open(filename, "rb") as f:
        for block in iter(lambda: f.read(blocksize), b""):
            hash.update(block)
    return hash.hexdigest()

Granted, these are from single runs (the mmap ones are always a smidge faster when at least a few dozen runs are made), and mine's usually got an extra f.read(blocksize) after the buffer is exhausted, but it's reasonably repeatable and shows that md5sum on the command line is not necessarily faster than a Python implementation...

EDIT: Sorry for the long delay, haven't looked at this in some time, but to answer @EdRandall's question, I'll write down an Adler32 implementation. However, I haven't run the benchmarks for it. It's basically the same as the CRC32 would have been: instead of the init, update, and digest calls, everything is a zlib.adler32() call:

import zlib
def adler32sum(filename, blocksize=65536):
    checksum = zlib.adler32("")
    with open(filename, "rb") as f:
        for block in iter(lambda: f.read(blocksize), b""):
            checksum = zlib.adler32(block, checksum)
    return checksum & 0xffffffff

Note that this must start off with the empty string, as Adler sums do indeed differ when starting from zero versus their sum for "", which is 1 -- CRC can start with 0 instead. The AND-ing is needed to make it a 32-bit unsigned integer, which ensures it returns the same value across Python versions.

Pittel answered 4/2, 2014 at 23:45 Comment(2)
Could you possibly add a couple of lines comparing SHA1, and also zlib.adler32 maybe?Grosberg
@EdRandall: adler32 is really not worth bothering with, eg. leviathansecurity.com/blog/analysis-of-adler32Bouley
D
36

In Python 3.8+, you can can use the assignment operator := (along with hashlib) like this:

import hashlib
with open("your_filename.txt", "rb") as f:
    file_hash = hashlib.md5()
    while chunk := f.read(8192):
        file_hash.update(chunk)

print(file_hash.digest())
print(file_hash.hexdigest())  # to get a printable str instead of bytes

Consider using hashlib.blake2b instead of md5 (just replace md5 with blake2b in the above snippet). It's cryptographically secure and faster than MD5.

Dior answered 26/11, 2019 at 17:53 Comment(0)
S
24
hashlib.md5(pathlib.Path('path/to/file').read_bytes()).hexdigest()
Stavanger answered 24/4, 2019 at 13:43 Comment(3)
Hi! Please add some explanation to your code as to why this is a solution to the problem. Furthermore, this post is pretty old, so you should also add some information as to why your solution adds something that the others have not already addressed.Osteopath
It's another memory inefficient wayPregnable
One-line solution. Perfect for a couple of tests!Aurea
R
12

In Python 3.11+, there's a new readable and memory-efficient method:

import hashlib
with open(path, "rb") as f:
    digest = hashlib.file_digest(f, "md5")
print(digest.hexdigest())
Rill answered 1/11, 2022 at 14:49 Comment(0)
C
3

You could use simple-file-checksum1, which just uses subprocess to call openssl for macOS/Linux and CertUtil for Windows and extracts only the digest from the output:

Installation:

pip install simple-file-checksum

Usage:

>>> from simple_file_checksum import get_checksum
>>> get_checksum("path/to/file.txt")
'9e107d9d372bb6826bd81d3542a419d6'
>>> get_checksum("path/to/file.txt", algorithm="MD5")
'9e107d9d372bb6826bd81d3542a419d6'

The SHA1, SHA256, SHA384, and SHA512 algorithms are also supported.


1 Disclosure: I am the author of simple-file-checksum.

Carlo answered 30/7, 2022 at 15:15 Comment(0)
M
0

you can make use of the shell here.

from subprocess import check_output

#for windows & linux
hash = check_output(args='md5sum imp_file.txt', shell=True).decode().split(' ')[0]

#for mac
hash = check_output(args='md5 imp_file.txt', shell=True).decode().split('=')[1]
Mancy answered 4/10, 2022 at 15:32 Comment(0)
S
-1

change the file_path to your file

import hashlib
def getMd5(file_path):
    m = hashlib.md5()
    with open(file_path,'rb') as f:
        lines = f.read()
        m.update(lines)
    md5code = m.hexdigest()
    return md5code
Somatotype answered 19/2, 2021 at 7:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.