In Python, is there a concise way of comparing whether the contents of two text files are the same?
Asked Answered
E

10

78

I don't care what the differences are. I just want to know whether the contents are different.

Encroach answered 31/10, 2008 at 17:47 Comment(0)
E
96

The low level way:

from __future__ import with_statement
with open(filename1) as f1:
   with open(filename2) as f2:
      if f1.read() == f2.read():
         ...

The high level way:

import filecmp
if filecmp.cmp(filename1, filename2, shallow=False):
   ...
Emden answered 31/10, 2008 at 17:50 Comment(12)
I corrected your filecmp.cmp call, because without a non-true shallow argument, it doesn't do what the question asks for.Spirograph
You're right. python.org/doc/2.5.2/lib/module-filecmp.html . Thank you very much.Emden
btw, one should open the files in binary mode to be sure, since the files can differ in line separators.Pentacle
This can have problems if the files are huge. You can possibly save some effort by the computer if the first thing you do is compare file sizes. If the sizes are different, obviously the files are different. You only need to read the files if the sizes are the same.Patten
I just found out that filecmp.cmp() also compares metadata as well, such as inode number and ctime and other stats. This was undesirable in my use-case. If you just want to compare contents without comparing metadata, f1.read() == f2.read() is probably a better way.Controversy
Is this working when files are the same but the order is changed?Rancor
@Spirograph shallow=True will still compare the files byte-by-byte in most cases, right? It only skips the check if the files are hardlinks to the same inode, which is a perfectly safe thing to skip?Dunsinane
@Controversy Why is comparing metadata undesirable?Dunsinane
I just realized that filecmp.cmp follows links, so it will say that file.txt and Link to file.txt are equal. I guess cmp does this too, though?Dunsinane
@Dunsinane just do an import filecmp; help(filecmp.cmp) in a Python console. shallow=True won't compare contents, only metadata.Spirograph
@Spirograph It compares contents except in the rare case that the metadata matches exactly (which means the two files are hardlinks)Dunsinane
How does shallow=False affect?Liquorice
P
36

If you're going for even basic efficiency, you probably want to check the file size first:

if os.path.getsize(filename1) == os.path.getsize(filename2):
  if open('filename1','r').read() == open('filename2','r').read():
    # Files are the same.

This saves you reading every line of two files that aren't even the same size, and thus can't be the same.

(Even further than that, you could call out to a fast MD5sum of each file and compare those, but that's not "in Python", so I'll stop here.)

Pyrimidine answered 31/10, 2008 at 17:56 Comment(4)
The md5sum approach will be slower with just 2 files (You still need to read the file to compute the sum) It only pays off when you're looking for duplicates among several files.Happ
@Brian: you're assuming that md5sum's file reading is no faster than Python's, and that there's no overhead from reading the entire file into the Python environment as a string! Try this with 2GB files...Pyrimidine
There's no reason to expect md5sum's file reading would be faster than python's - IO is pretty independant of language. The large file problem is a reason to iterate in chunks (or use filecmp), not to use md5 where you're needlessly paying an extra CPU penalty.Happ
This is especially true when you consider the case when the files are not identical. Comparing by blocks can bail out early, but md5sum must carry on reading the entire file.Happ
S
15

This is a functional-style file comparison function. It returns instantly False if the files have different sizes; otherwise, it reads in 4KiB block sizes and returns False instantly upon the first difference:

from __future__ import with_statement
import os
import itertools, functools, operator
try:
    izip= itertools.izip  # Python 2
except AttributeError:
    izip= zip  # Python 3

def filecmp(filename1, filename2):
    "Do the two files have exactly the same contents?"
    with open(filename1, "rb") as fp1, open(filename2, "rb") as fp2:
        if os.fstat(fp1.fileno()).st_size != os.fstat(fp2.fileno()).st_size:
            return False # different sizes ∴ not equal

        # set up one 4k-reader for each file
        fp1_reader= functools.partial(fp1.read, 4096)
        fp2_reader= functools.partial(fp2.read, 4096)

        # pair each 4k-chunk from the two readers while they do not return '' (EOF)
        cmp_pairs= izip(iter(fp1_reader, b''), iter(fp2_reader, b''))

        # return True for all pairs that are not equal
        inequalities= itertools.starmap(operator.ne, cmp_pairs)

        # voilà; any() stops at first True value
        return not any(inequalities)

if __name__ == "__main__":
    import sys
    print filecmp(sys.argv[1], sys.argv[2])

Just a different take :)

Spirograph answered 31/10, 2008 at 23:3 Comment(3)
Quite hacky, using all shortcuts, itertools and partial - kudos, this is the best solution!Romanfleuve
I had to make a slight change in Python 3, otherwise the function never returned: cmp_pairs= izip(iter(fp1_reader, b''), iter(fp2_reader, b''))Gleesome
@TedStriker you are correct! thanks for helping improve this answer :)Spirograph
I
6

Since I can't comment on the answers of others I'll write my own.

If you use md5 you definitely must not just md5.update(f.read()) since you'll use too much memory.

def get_file_md5(f, chunk_size=8192):
    h = hashlib.md5()
    while True:
        chunk = f.read(chunk_size)
        if not chunk:
            break
        h.update(chunk)
    return h.hexdigest()
Impetus answered 31/10, 2008 at 19:6 Comment(4)
I believe that any hashing operation is overkill for this question's purposes; direct piece-by-piece comparison is faster and more straight.Spirograph
I was just clearing up the actual hashing part someone suggested.Impetus
+1 I like your version better. Also, I don't think using a hash is overkill. There's really no good reason not to if all you want to know is whether or not they're different.Oarfish
@Jeremy Cantrell: one computes hashes when they are to be cached/stored, or compared to cached/stored ones. Otherwise, just compare strings. Whatever the hardware, str1 != str2 is faster than md5.new(str1).digest() != md5.new(str2).digest(). Hashes also have collisions (unlikely but not impossible).Spirograph
O
4

I would use a hash of the file's contents using MD5.

import hashlib

def checksum(f):
    md5 = hashlib.md5()
    md5.update(open(f).read())
    return md5.hexdigest()

def is_contents_same(f1, f2):
    return checksum(f1) == checksum(f2)

if not is_contents_same('foo.txt', 'bar.txt'):
    print 'The contents are not the same!'
Oarfish answered 31/10, 2008 at 18:53 Comment(0)
M
2

f = open(filename1, "r").read()
f2 = open(filename2,"r").read()
print f == f2


Marotta answered 31/10, 2008 at 17:52 Comment(2)
“Well, I have this 8 GiB file and that 32 GiB file that I want to compare…”Spirograph
This is not a good way to do this. A big issue is the files are never closed after opening. Less critically, there is no optimization, for example a file size comparison, before opening and reading the files..Vermiform
C
1

For larger files you could compute a MD5 or SHA hash of the files.

Chilcote answered 31/10, 2008 at 17:56 Comment(2)
So what about two 32 GiB files differing in the first byte only? Why spend CPU time and wait too long for an answer?Spirograph
See my solution, for larger files it is better to do buffered readsRosaline
A
1
from __future__ import with_statement

filename1 = "G:\\test1.TXT"

filename2 = "G:\\test2.TXT"


with open(filename1) as f1:

   with open(filename2) as f2:

      file1list = f1.read().splitlines()

      file2list = f2.read().splitlines()

      list1length = len(file1list)

      list2length = len(file2list)

      if list1length == list2length:

          for index in range(len(file1list)):

              if file1list[index] == file2list[index]:

                   print file1list[index] + "==" + file2list[index]

              else:                  

                   print file1list[index] + "!=" + file2list[index]+" Not-Equel"

      else:

          print "difference inthe size of the file and number of lines"
Armored answered 15/12, 2016 at 17:10 Comment(0)
R
0

Simple and efficient solution:

import os


def is_file_content_equal(
    file_path_1: str, file_path_2: str, buffer_size: int = 1024 * 8
) -> bool:
    """Checks if two files content is equal
    Arguments:
        file_path_1 (str): Path to the first file
        file_path_2 (str): Path to the second file
        buffer_size (int): Size of the buffer to read the file
    Returns:
        bool that indicates if the file contents are equal
    Example:
        >>> is_file_content_equal("filecomp.py", "filecomp copy.py")
            True
        >>> is_file_content_equal("filecomp.py", "diagram.dio")
            False
    """
    # First check sizes
    s1, s2 = os.path.getsize(file_path_1), os.path.getsize(file_path_2)
    if s1 != s2:
        return False
    # If the sizes are the same check the content
    with open(file_path_1, "rb") as fp1, open(file_path_2, "rb") as fp2:
        while True:
            b1 = fp1.read(buffer_size)
            b2 = fp2.read(buffer_size)
            if b1 != b2:
                return False
            # if the content is the same and they are both empty bytes
            # the file is the same
            if not b1:
                return True
Rosaline answered 31/7, 2021 at 11:10 Comment(0)
M
0

filecmp is great for easy comparison of files, but doesn't allow you to print the line number or difference in the files:

import filecmp

def compare_files(filename1, filename2):
    return filecmp.cmp(filename1, filename2, shallow=False)

Here's a simple and efficient solution that is a bit more flexible in that you can print status of comparison, line numbers, and the line values of where there is a difference in the files:

def compare_with_line_diff(filename1, filename2):
    with open(filename1, "r") as file1, open(filename2, "r") as file2:

        # Loop for all lines in first file (keep only 2 lines in memory)
        for line_num, f1_line in enumerate(file1, start=1):

            # Only print status for range of lines
            if (line_num == 1 or line_num % 1000 == 0):
                print(f"comparing lines {line_num} to {line_num + 1000}")

            # Compare with next line of file2
            f2_line = file2.readline()
            if (f1_line != f2_line):
                print(f"Difference on line: {line_num}")
                print(f"f1_line: '{f1_line}'")
                print(f"f2_line: '{f2_line}'")
                return False

        # Check if file2 has more lines than file1
        for extra_line in file2:
            print(f"Difference on file2: {extra_line}")
            return False

    # Files are equal
    return True
Munniks answered 15/9, 2023 at 2:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.