Is there any simple way of generating (and checking) MD5 checksums of a list of files in Python? (I have a small program I'm working on, and I'd like to confirm the checksums of the files).
You can use hashlib.md5()
Note that sometimes you won't be able to fit the whole file in memory. In that case, you'll have to read chunks of 4096 bytes sequentially and feed them to the md5
method:
import hashlib
def md5(fname):
hash_md5 = hashlib.md5()
with open(fname, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()
Note: hash_md5.hexdigest()
will return the hex string representation for the digest, if you just need the packed bytes use return hash_md5.digest()
, so you don't have to convert back.
md5sum
returns –
Mesognathous There is a way that's pretty memory inefficient.
single file:
import hashlib
def file_as_bytes(file):
with file:
return file.read()
print hashlib.md5(file_as_bytes(open(full_path, 'rb'))).hexdigest()
list of files:
[(fname, hashlib.md5(file_as_bytes(open(fname, 'rb'))).digest()) for fname in fnamelst]
Recall though, that MD5 is known broken and should not be used for any purpose since vulnerability analysis can be really tricky, and analyzing any possible future use your code might be put to for security issues is impossible. IMHO, it should be flat out removed from the library so everybody who uses it is forced to update. So, here's what you should do instead:
[(fname, hashlib.sha256(file_as_bytes(open(fname, 'rb'))).digest()) for fname in fnamelst]
If you only want 128 bits worth of digest you can do .digest()[:16]
.
This will give you a list of tuples, each tuple containing the name of its file and its hash.
Again I strongly question your use of MD5. You should be at least using SHA1, and given recent flaws discovered in SHA1, probably not even that. Some people think that as long as you're not using MD5 for 'cryptographic' purposes, you're fine. But stuff has a tendency to end up being broader in scope than you initially expect, and your casual vulnerability analysis may prove completely flawed. It's best to just get in the habit of using the right algorithm out of the gate. It's just typing a different bunch of letters is all. It's not that hard.
Here is a way that is more complex, but memory efficient:
import hashlib
def hash_bytestr_iter(bytesiter, hasher, ashexstr=False):
for block in bytesiter:
hasher.update(block)
return hasher.hexdigest() if ashexstr else hasher.digest()
def file_as_blockiter(afile, blocksize=65536):
with afile:
block = afile.read(blocksize)
while len(block) > 0:
yield block
block = afile.read(blocksize)
[(fname, hash_bytestr_iter(file_as_blockiter(open(fname, 'rb')), hashlib.md5()))
for fname in fnamelst]
And, again, since MD5 is broken and should not really ever be used anymore:
[(fname, hash_bytestr_iter(file_as_blockiter(open(fname, 'rb')), hashlib.sha256()))
for fname in fnamelst]
Again, you can put [:16]
after the call to hash_bytestr_iter(...)
if you only want 128 bits worth of digest.
open
. I believe that's worked ever since hashlib was introduced, and possible has always worked. Old habits die hard. –
Jospeh hexdigest
from the standard hashlib hash function interface. I feel that it's an unnecessary wart. And I like making even small functions widely applicable. There are many cases in which the hex of the hash is quite unnecessarily verbose and making that the easiest to use version is encouraging people to be verbose when they don't have to be. But yes, in this case, for this specific purpose it is likely the better choice. I would still just use binascii.hexlify
instead. :-) –
Jospeh def hashfile
function above multiple times on the same file handle remember to reset the afile
pointer when done reading each file. eg. afile.seek(0)
–
Catechumen mode=rb
? Shouldn't rt
simply convert newlines and otherwise be identical to rb
? (I assume this is python 2, since in python 3 hashlib.md5
requires bytes
, and will simply refuse to accept a string,) –
Skyeskyhigh hashfile
though, it's more flexible because it handles anything that has read
. –
Jospeh hashlib.md5(open(full_path, 'rb').read()).hexdigest()
is good enough. Thanks! –
Interlocutor with
statement inside file_as_blockiter
. –
Jospeh import hashlib [(fname, hashlib.md5(open(fname, 'rb').read()).digest()) for fname in fnamelst]
–
Goosander update
, and that's the exact same result as if you feed update
a single empty string. –
Jospeh I'm clearly not adding anything fundamentally new, but added this answer before I was up to commenting status, plus the code regions make things more clear -- anyway, specifically to answer @Nemo's question from Omnifarious's answer:
I happened to be thinking about checksums a bit (came here looking for suggestions on block sizes, specifically), and have found that this method may be faster than you'd expect. Taking the fastest (but pretty typical) timeit.timeit
or /usr/bin/time
result from each of several methods of checksumming a file of approx. 11MB:
$ ./sum_methods.py
crc32_mmap(filename) 0.0241742134094
crc32_read(filename) 0.0219960212708
subprocess.check_output(['cksum', filename]) 0.0553209781647
md5sum_mmap(filename) 0.0286180973053
md5sum_read(filename) 0.0311000347137
subprocess.check_output(['md5sum', filename]) 0.0332629680634
$ time md5sum /tmp/test.data.300k
d3fe3d5d4c2460b5daacc30c6efbc77f /tmp/test.data.300k
real 0m0.043s
user 0m0.032s
sys 0m0.010s
$ stat -c '%s' /tmp/test.data.300k
11890400
So, looks like both Python and /usr/bin/md5sum take about 30ms for an 11MB file. The relevant md5sum
function (md5sum_read
in the above listing) is pretty similar to Omnifarious's:
import hashlib
def md5sum(filename, blocksize=65536):
hash = hashlib.md5()
with open(filename, "rb") as f:
for block in iter(lambda: f.read(blocksize), b""):
hash.update(block)
return hash.hexdigest()
Granted, these are from single runs (the mmap
ones are always a smidge faster when at least a few dozen runs are made), and mine's usually got an extra f.read(blocksize)
after the buffer is exhausted, but it's reasonably repeatable and shows that md5sum
on the command line is not necessarily faster than a Python implementation...
EDIT: Sorry for the long delay, haven't looked at this in some time, but to answer @EdRandall's question, I'll write down an Adler32 implementation. However, I haven't run the benchmarks for it. It's basically the same as the CRC32 would have been: instead of the init, update, and digest calls, everything is a zlib.adler32()
call:
import zlib
def adler32sum(filename, blocksize=65536):
checksum = zlib.adler32("")
with open(filename, "rb") as f:
for block in iter(lambda: f.read(blocksize), b""):
checksum = zlib.adler32(block, checksum)
return checksum & 0xffffffff
Note that this must start off with the empty string, as Adler sums do indeed differ when starting from zero versus their sum for ""
, which is 1
-- CRC can start with 0
instead. The AND
-ing is needed to make it a 32-bit unsigned integer, which ensures it returns the same value across Python versions.
In Python 3.8+, you can can use the assignment operator :=
(along with hashlib
) like this:
import hashlib
with open("your_filename.txt", "rb") as f:
file_hash = hashlib.md5()
while chunk := f.read(8192):
file_hash.update(chunk)
print(file_hash.digest())
print(file_hash.hexdigest()) # to get a printable str instead of bytes
Consider using hashlib.blake2b
instead of md5
(just replace md5
with blake2b
in the above snippet). It's cryptographically secure and faster than MD5.
hashlib.md5(pathlib.Path('path/to/file').read_bytes()).hexdigest()
In Python 3.11+, there's a new readable and memory-efficient method:
import hashlib
with open(path, "rb") as f:
digest = hashlib.file_digest(f, "md5")
print(digest.hexdigest())
You could use simple-file-checksum
1, which just uses subprocess
to call openssl
for macOS/Linux and CertUtil
for Windows and extracts only the digest from the output:
Installation:
pip install simple-file-checksum
Usage:
>>> from simple_file_checksum import get_checksum
>>> get_checksum("path/to/file.txt")
'9e107d9d372bb6826bd81d3542a419d6'
>>> get_checksum("path/to/file.txt", algorithm="MD5")
'9e107d9d372bb6826bd81d3542a419d6'
The SHA1
, SHA256
, SHA384
, and SHA512
algorithms are also supported.
1 Disclosure: I am the author of simple-file-checksum
.
you can make use of the shell here.
from subprocess import check_output
#for windows & linux
hash = check_output(args='md5sum imp_file.txt', shell=True).decode().split(' ')[0]
#for mac
hash = check_output(args='md5 imp_file.txt', shell=True).decode().split('=')[1]
change the file_path
to your file
import hashlib
def getMd5(file_path):
m = hashlib.md5()
with open(file_path,'rb') as f:
lines = f.read()
m.update(lines)
md5code = m.hexdigest()
return md5code
© 2022 - 2024 — McMap. All rights reserved.
md5sum
? – Uselessmd5sum
. That is why security-conscious programmers should not use it in my opinion. – Widemanmd5sum
and the technique described in this SO question should be avoided - it's better to use SHA-2 or SHA-3, if possible: en.wikipedia.org/wiki/Secure_Hash_Algorithms – Apologizehashlib.blake2b
which is both faster than md5 and secure. – Diorb2sum
command available on Ubuntu. – Dior