How to create a checksum of a file in python
Asked Answered
S

3

5

I am trying to create a checksum of a file and save the checksum as a file same. So.I monitor the the file and if the checksum changes then a do something.

Here is the checksum

For test.txt

contents: a
checksum: dd18bf3a8e0a2a3e53e2661c7fb53534

I edit the file:

contents: aa
checksum: dd18bf3a8e0a2a3e53e2661c7fb53534

here is my code:

python -c 'import hashlib;print hashlib.md5("test.txt").hexdigest()'

Why are the checksums the same?

Straley answered 20/7, 2014 at 5:13 Comment(0)
P
3

You may try to check hashlib.md5()

import hashlib
[(fname, hashlib.md5(open(fname, 'rb').read()).digest()) for fname in fnamelst]
Psychologism answered 20/7, 2014 at 5:16 Comment(4)
This will not scale for large files.Avra
Dont need for large files. But I used hexdigest() instead and need as close of a one liner I could get.Straley
md5 is broken and should not be used.Guidepost
@Guidepost you're overgeneralizing. Yes, it has weaknesses in a cryptographical context and should not be used there. But in contexts where there's no potential attacker that could try to craft a collision, MD5 is still perfectly fine.Grackle
A
7

Why are the checksums teh same?

Because you are computing a hash of the same contents test.txt.

Here is a general purpose tool (a clone of the widely available md5sum CLI tool available on many Linux and UNIX platforms) that scales well with large files.

md5sum.py:

#!/usr/bin/env python

"""Tool to compuete md5 sums of files"""

import sys
from hashlib import md5


def md5sum(filename):
    hash = md5()
    with open(filename, "rb") as f:
        for chunk in iter(lambda: f.read(128 * hash.block_size), b""):
            hash.update(chunk)
    return hash.hexdigest()


def main():
    if len(sys.argv) < 2:
        print "Usage: md5sum <filename>"
        raise SystemExit(1)

    print md5sum(sys.argv[1])


if __name__ == "__main__":
    main()

Liberally borrowed from: https://bitbucket.org/prologic/tools/src/tip/md5sum

Avra answered 20/7, 2014 at 5:14 Comment(0)
P
3

You may try to check hashlib.md5()

import hashlib
[(fname, hashlib.md5(open(fname, 'rb').read()).digest()) for fname in fnamelst]
Psychologism answered 20/7, 2014 at 5:16 Comment(4)
This will not scale for large files.Avra
Dont need for large files. But I used hexdigest() instead and need as close of a one liner I could get.Straley
md5 is broken and should not be used.Guidepost
@Guidepost you're overgeneralizing. Yes, it has weaknesses in a cryptographical context and should not be used there. But in contexts where there's no potential attacker that could try to craft a collision, MD5 is still perfectly fine.Grackle
B
1

The core hasher functions in hashlib accept the string contents to be hashed, not filenames to open and read, so as James says, you're hashing the same value 'text.txt' in both cases.

Python 3.11+

If you're only targeting Python 3.11+, a new option is available: hashlib.file_digest(). It takes a file object and a hash function or the name of a hashlib hash function. The equivalent to what you tried would be like this:

import hashlib
with open('text.txt', 'rb') as file:
    print(hashlib.file_digest(file, 'md5').hexdigest())

Python 2.7-3.10+

While hashlib.file_digest() will not be available on all supported Python versions until October 2026, we can still take a look inside to get an idea of how we could make an even better version of md5sum(), using bytearray and memoryview instead of iter(lambda: f.read()).

import hashlib

def md5sum(filename, _bufsize=2**18):
    digest = hashlib.md5()
    
    buf = bytearray(_bufsize)
    view = memoryview(buf)
    with open(filename, 'rb') as file:
        while True:
            size = file.readinto(buf)
            if size == 0:
                break  # EOF
            digest.update(view[:size])
    
    return digest.hexdigest()
Bine answered 20/8, 2023 at 1:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.