How do I calculate the MD5 checksum of a file in Python? [duplicate]
Asked Answered
V

4

173

I have written some code in Python that checks for an MD5 hash in a file and makes sure the hash matches that of the original.

Here is what I have developed:

# Defines filename
filename = "file.exe"

# Gets MD5 from file 
def getmd5(filename):
    return m.hexdigest()

md5 = dict()

for fname in filename:
    md5[fname] = getmd5(fname)

# If statement for alerting the user whether the checksum passed or failed

if md5 == '>md5 will go here<': 
    print("MD5 Checksum passed. You may now close this window")
    input ("press enter")
else:
    print("MD5 Checksum failed. Incorrect MD5 in file 'filename'. Please download a new copy")
    input("press enter") 
exit

But whenever I run the code, I get the following error:

Traceback (most recent call last):
File "C:\Users\Username\md5check.py", line 13, in <module>
 md5[fname] = getmd5(fname)
File "C:\Users\Username\md5check.py, line 9, in getmd5
  return m.hexdigest()
NameError: global name 'm' is not defined

Is there anything I am missing in my code?

Valvate answered 1/6, 2013 at 16:5 Comment(0)
F
348

In regards to your error and what's missing in your code. m is a name which is not defined for getmd5() function.

No offence, I know you are a beginner, but your code is all over the place. Let's look at your issues one by one :)

First, you are not using hashlib.md5.hexdigest() method correctly. Please refer explanation on hashlib functions in Python Doc Library. The correct way to return MD5 for provided string is to do something like this:

>>> import hashlib
>>> hashlib.md5("example string").hexdigest()
'2a53375ff139d9837e93a38a279d63e5'

However, you have a bigger problem here. You are calculating MD5 on a file name string, where in reality MD5 is calculated based on file contents. You will need to basically read file contents and pipe it though MD5. My next example is not very efficient, but something like this:

>>> import hashlib
>>> hashlib.md5(open('filename.exe','rb').read()).hexdigest()
'd41d8cd98f00b204e9800998ecf8427e'

As you can clearly see second MD5 hash is totally different from the first one. The reason for that is that we are pushing contents of the file through, not just file name.

A simple solution could be something like that:

# Import hashlib library (md5 method is part of it)
import hashlib

# File to check
file_name = 'filename.exe'

# Correct original md5 goes here
original_md5 = '5d41402abc4b2a76b9719d911017c592'  

# Open,close, read file and calculate MD5 on its contents 
with open(file_name, 'rb') as file_to_check:
    # read contents of the file
    data = file_to_check.read()    
    # pipe contents of the file through
    md5_returned = hashlib.md5(data).hexdigest()

# Finally compare original MD5 with freshly calculated
if original_md5 == md5_returned:
    print "MD5 verified."
else:
    print "MD5 verification failed!."

Please look at the post Python: Generating a MD5 checksum of a file. It explains in detail a couple of ways how it can be achieved efficiently.

Best of luck.

Fidellas answered 1/6, 2013 at 19:25 Comment(4)
Wow. I feel so embarrassed. I guess I put the wrong code for what I was doing, and added a lot of mistakes along with it. Thanks for your help. I am although more used to batch and lua. So Python is picky for me.Valvate
You should also open the file in binary mode with open(file_name, 'rb'), otherwise you might get problems when the os does newline/carriage return conversions. See mail.python.org/pipermail/tutor/2004-January/027634.html and #3432325Rafe
If you are workong on a binary file , make sure you read it correctly with 'b' mode , finally I make it works as expected with this : hashlib.sha512(open(fn,'rb').read()).hexdigest()Oversight
Generating MD5 hash of a larger file needs a different approach. You'll have to create chunks and update hash for each chunk. ReferenceIfill
P
72

In Python 3.8+ you can do

import hashlib

with open("your_filename.png", "rb") as f:
    file_hash = hashlib.md5()
    while chunk := f.read(8192):
        file_hash.update(chunk)

print(file_hash.digest())
print(file_hash.hexdigest())  # to get a printable str instead of bytes

On Python 3.7 and below:

with open("your_filename.png", "rb") as f:
    file_hash = hashlib.md5()
    chunk = f.read(8192)
    while chunk:
        file_hash.update(chunk)
        chunk = f.read(8192)

print(file_hash.hexdigest())

This reads the file 8192 (or 2¹³) bytes at a time instead of all at once with f.read() to use less memory.


Consider using hashlib.blake2b instead of md5 (just replace md5 with blake2b in the above snippets). It's cryptographically secure and faster than MD5.

Pitanga answered 26/11, 2019 at 17:56 Comment(3)
thanks! this way is consistently 11% faster than the inline "hashlib.md5(open('filename.exe','rb').read()).hexdigest()" when testing on rather large files (about 4MB)Cestar
@RyanLoggerythm that is surprising, are you sure you are profiling correctly? I would've expected that if you have enough RAM to read the entire file into it in one go, that would be faster. Also, in 2021 (hell, even in 2005), 4MB is not a "rather large" file. We load websites bigger than that just to check the weather.Pitanga
Hey Boris, yeah Im running on 32gigs of RAM. Granted, it was a couple of quick tests and 4MB was the largest file in my folder :). I just generated a 1GB file and PSS's one-line hash averaged 1.695 seconds across 5 runs, whereas your answer averaged 1.454 seconds, a 14% reduction in time!Cestar
K
8

hashlib methods also support mmap module, so I often use

from hashlib import md5
from mmap import mmap, ACCESS_READ

path = ...
with open(path) as file, mmap(file.fileno(), 0, access=ACCESS_READ) as file:
    print(md5(file).hexdigest())

where path is the path to your file.

Ref: https://docs.python.org/library/mmap.html#mmap.mmap

Edit: Comparison with the plain-read method.

Plot of time and memory usage

from hashlib import md5
from mmap import ACCESS_READ, mmap

from matplotlib.pyplot import grid, legend, plot, show, tight_layout, xlabel, ylabel
from memory_profiler import memory_usage
from numpy import arange

def MemoryMap():
    with open(path) as file, mmap(file.fileno(), 0, access=ACCESS_READ) as file:
        print(md5(file).hexdigest())

def PlainRead():
    with open(path, 'rb') as file:
        print(md5(file.read()).hexdigest())

if __name__ == '__main__':
    path = ...
    y = memory_usage(MemoryMap, interval=0.01)
    plot(arange(len(y)) / 100, y, label='mmap')
    y = memory_usage(PlainRead, interval=0.01)
    plot(arange(len(y)) / 100, y, label='read')
    ylabel('Memory Usage (MiB)')
    xlabel('Time (s)')
    legend()
    grid()
    tight_layout()
    show()

path is the path to a 3.77GiB csv file.

Kim answered 6/5, 2021 at 5:23 Comment(4)
Does this read the entire file into memory? Then why not just do hashlib.md5(file.read()).hexdigest() Pitanga
@Boris From the figure, I think the mmap method alternately reads and processes, but it seems memory is not released after use.Kim
LOTS of respect to liurui39660 for this! I look at this and think that both impls use the same about of memory in the end so the difference in execution time will be more significant depending on if you are getting the hash of 1 file or 100 million files.Partan
The only problem with mmp is empty file. It breaks the sequence and force the code to have an alternative method for obtaining MD5 only in empty files situation.Amund
G
-2

You can calculate the checksum of a file by reading the binary data and using hashlib.md5().hexdigest(). A function to do this would look like the following:

def File_Checksum_Dis(dirname):
    
    if not os.path.exists(dirname):
        print(dirname+" directory is not existing");
    
    for fname in os.listdir(dirname):
        if not fname.endswith('~'):
            fnaav = os.path.join(dirname, fname);
            fd = open(fnaav, 'rb');
            data = fd.read();
            fd.close();
        
            print("-"*70);
            print("File Name is: ",fname);          
            print(hashlib.md5(data).hexdigest())
            print("-"*70);
                
Grass answered 14/7, 2020 at 19:33 Comment(1)
This approach doesn't work well for large files.Deepfreeze

© 2022 - 2024 — McMap. All rights reserved.