Compare binary files in C#
Asked Answered
J

8

34

I want to compare two binary files. One of them is already stored on the server with a pre-calculated CRC32 in the database from when I stored it originally.

I know that if the CRC is different, then the files are definitely different. However, if the CRC is the same, I don't know that the files are. So, I'm looking for a nice efficient way of comparing the two streams: one from the posted file and one from the file system.

I'm not an expert on streams, but I'm well aware that I could easily shoot myself in the foot here as far as memory usage is concerned.

Jumada answered 9/6, 2009 at 8:50 Comment(0)
B
43
static bool FileEquals(string fileName1, string fileName2)
{
    // Check the file size and CRC equality here.. if they are equal...    
    using (var file1 = new FileStream(fileName1, FileMode.Open))
        using (var file2 = new FileStream(fileName2, FileMode.Open))
            return FileStreamEquals(file1, file2);
}

static bool FileStreamEquals(Stream stream1, Stream stream2)
{
    const int bufferSize = 2048;
    byte[] buffer1 = new byte[bufferSize]; //buffer size
    byte[] buffer2 = new byte[bufferSize];
    while (true) {
        int count1 = stream1.Read(buffer1, 0, bufferSize);
        int count2 = stream2.Read(buffer2, 0, bufferSize);

        if (count1 != count2)
            return false;

        if (count1 == 0)
            return true;

        // You might replace the following with an efficient "memcmp"
        if (!buffer1.Take(count1).SequenceEqual(buffer2.Take(count2)))
            return false;
    }
}
B answered 9/6, 2009 at 9:5 Comment(11)
Require conunt1 == count2 could be inaccurate, as Stream.Read is free to return a block that has a length less than requested byte count. see msdn.microsoft.com/en-us/library/vstudio/…Penult
Thanks for the solution Mehrdad. Do you need the Take calls? I tried only if (!buffer1.SequenceEqual(buffer2)) and it seems to work.Chiquia
@Ozgur it works but it is less efficient and not very principled IMO.B
I agree with to comment by @Karata. You should make sure that the buffer is filled as much as possible by either repeatedly calling Stream.Read(), or use StreamReader.ReadBlock().Katzman
@Katzman I agree. For a FileStream that's generally not a problem though. I will rename the function to reflect that.B
The docs say it may be a problem even for FileStream. Do you mean it is usually not a problem? Or are the docs misleading?Katzman
@Katzman Well, I stand corrected then--if the docs say that for FileStream, that's the contract, even if it doesn't fail for local files. Will update the code.B
@Penult is absolutely right, and this faulty code should not stand uncorrected (somebody may use it on a piece of software I use later). At the very least FileStreamEquals should take two properly typed FileStream arguments; a weak case can probably be made that usually a Read request for n bytes from a file indeed reads n bytes if nothing went wrong. But would you bet your life (or your company) on every contingency? What about network mapped drives? Named pipes?Persas
Comment in February: "Will update the code" It's now September. When do you plan to fix the error in the above?Progressionist
As other said the (count1 != count2) is wrong. it works for FileStream in many devices but its so deceptive and dangerous. I recommended down voteNitty
@T.J.Crowder I'm still waitingThorley
S
22

I sped up the "memcmp" by using a Int64 compare in a loop over the read stream chunks. This reduced time to about 1/4.

    private static bool StreamsContentsAreEqual(Stream stream1, Stream stream2)
    {
        const int bufferSize = 2048 * 2;
        var buffer1 = new byte[bufferSize];
        var buffer2 = new byte[bufferSize];

        while (true)
        {
            int count1 = stream1.Read(buffer1, 0, bufferSize);
            int count2 = stream2.Read(buffer2, 0, bufferSize);

            if (count1 != count2)
            {
                return false;
            }

            if (count1 == 0)
            {
                return true;
            }

            int iterations = (int)Math.Ceiling((double)count1 / sizeof(Int64));
            for (int i = 0; i < iterations; i++)
            {
                if (BitConverter.ToInt64(buffer1, i * sizeof(Int64)) != BitConverter.ToInt64(buffer2, i * sizeof(Int64)))
                {
                    return false;
                }
            }
        }
    }
Sverdlovsk answered 14/4, 2010 at 12:30 Comment(4)
Is this advantageous only for 64-bit CPUs or will this help on 32-bit CPUs as well?Demisec
It should not matter if you have a 32-bit or 64-bit operating system running. But I never tried it on a pure 32-bit CPU. You have to try it and maybe just change Int64 to int32. But aren't most of more or less modern CPUs capable of 64-bit operations (x86 since 2004)? Go ahead and try it!Sverdlovsk
See comments on this answer. Relying on count1 equalling count2 is not reliable.Progressionist
Does this handle the last 0-7 bytes correctly? Did you test it on two files where fileSize % sizeof(Int64) > 0 and only the last byte is different?Postmortem
F
9

This is how I would do it if you didn't want to rely on crc:

    /// <summary>
    /// Binary comparison of two files
    /// </summary>
    /// <param name="fileName1">the file to compare</param>
    /// <param name="fileName2">the other file to compare</param>
    /// <returns>a value indicateing weather the file are identical</returns>
    public static bool CompareFiles(string fileName1, string fileName2)
    {
        FileInfo info1 = new FileInfo(fileName1);
        FileInfo info2 = new FileInfo(fileName2);
        bool same = info1.Length == info2.Length;
        if (same)
        {
            using (FileStream fs1 = info1.OpenRead())
            using (FileStream fs2 = info2.OpenRead())
            using (BufferedStream bs1 = new BufferedStream(fs1))
            using (BufferedStream bs2 = new BufferedStream(fs2))
            {
                for (long i = 0; i < info1.Length; i++)
                {
                    if (bs1.ReadByte() != bs2.ReadByte())
                    {
                        same = false;
                        break;
                    }
                }
            }
        }

        return same;
    }
Febrifugal answered 23/8, 2013 at 14:27 Comment(1)
info2 should be taking fileName2 as argument instead of fileName1. Otherwise, nice solution :-).Termagant
T
6

The accepted answer had an error that was pointed out, but never corrected: stream read calls are not guaranteed to return all bytes requested.

BinaryReader ReadBytes calls are guaranteed to return as many bytes as requested unless the end of the stream is reached first.

The following code takes advantage of BinaryReader to do the comparison:

    static private bool FileEquals(string file1, string file2)
    {
        using (FileStream s1 = new FileStream(file1, FileMode.Open, FileAccess.Read, FileShare.Read))
        using (FileStream s2 = new FileStream(file2, FileMode.Open, FileAccess.Read, FileShare.Read))
        using (BinaryReader b1 = new BinaryReader(s1))
        using (BinaryReader b2 = new BinaryReader(s2))
        {
            while (true)
            {
                byte[] data1 = b1.ReadBytes(64 * 1024);
                byte[] data2 = b2.ReadBytes(64 * 1024);
                if (data1.Length != data2.Length)
                    return false;
                if (data1.Length == 0)
                    return true;
                if (!data1.SequenceEqual(data2))
                    return false;
            }
        }
    }
Tallulah answered 11/11, 2017 at 16:5 Comment(4)
But, the downside of this is that it allocates both files in memory. Much more efficient is using multiple Read on FileStream to get missing bytesThorley
@Thorley No, it does not read entire files in memory as you seem to suggest. Yes there is some buffering but I think you'll find the code is very efficient, reading both files sequentially in 64k chunks.Tallulah
@I've done some benchmarking: pastebin.com/raw/ky9D8yndThorley
@Thorley That is useful. My post was not intended to show an absolutely optimal method, just a simple one that averted the serious bug in a previous posting. If I interpret your post correctly, the ForceRead approach (which was not posted previously) is a bit more efficient.Tallulah
W
3

if you change that crc to a sha1 signature the chances of it being different but with the same signature are astronomicly small

Watereddown answered 9/6, 2009 at 8:58 Comment(6)
You should never rely on that in most serious apps. It's like just checking the hash in a hashtable lookup without comparing the actual keys!B
unfortunately you can guarantee that the one time it messes up will be absolutely critical, probably that one big pitch.Jumada
@Simon - hehe very true. @Mehrdad - No probably not but it would greatly reduce the times you'd have to check to be super uber sure.Watereddown
Take the CRC and say file size and the changes are ever smaller.Vibrator
@MehrdadAfshari a rather serious app like git relies on exactly this. To quote Linus Torvalds we will "quite likely never ever see it in [collision of two files by comparing sha's] the full history of the universe". Cf. #9392865.Limiting
@Limiting Git does not rely on SHA1s to verify single file equality (which is a much more likely scenario) though, but yeah, sometimes it makes sense.B
G
2

You can check the length and dates of the two files even before checking the CRC to possibly avoid the CRC check.

But if you have to compare the entire file contents, one neat trick I've seen is reading the bytes in strides equal to the bitness of the CPU. For example, on a 32 bit PC, read 4 bytes at a time and compare them as int32's. On a 64 bit PC you can read 8 bytes at a time. This is roughly 4 or 8 times as fast as doing it byte by byte. You also would probably wanna use an unsafe code block so that you could use pointers instead of doing a bunch of bit shifting and OR'ing to get the bytes into the native int sizes.

You can use IntPtr.Size to determine the ideal size for the current processor architecture.

Guipure answered 9/6, 2009 at 9:4 Comment(0)
Z
1

I took the previous answers, and added the logic from the source code of BinaryReader.ReadBytes to get a solution that does not recreate buffer in every loop and does not suffer from unexpected return values from FileStream.Read:

public static bool AreSame(string path1, string path2) {
    int BUFFER_SIZE = 64 * 1024;
    byte[] buffer1 = new byte[BUFFER_SIZE];
    byte[] buffer2 = new byte[BUFFER_SIZE];

    int ReadBytes(FileStream fs, byte[] buffer) {
        int totalBytes = 0;
        int count = buffer.Length;
        while (count > 0) {
            int readBytes = fs.Read(buffer, totalBytes, count);
            if (readBytes == 0)
                break;

            totalBytes += readBytes;
            count -= readBytes;
        }

        return totalBytes;
    }

    using (FileStream fs1 = new FileStream(path1, FileMode.Open, FileAccess.Read, FileShare.Read))
    using (FileStream fs2 = new FileStream(path2, FileMode.Open, FileAccess.Read, FileShare.Read)) {
        while (true) {
            int count1 = ReadBytes(fs1, buffer1);
            int count2 = ReadBytes(fs2, buffer2);

            if (count1 != count2)
                return false;

            if (count1 == 0)
                return true;

            if (count1 == BUFFER_SIZE) {
                if (!buffer1.SequenceEqual(buffer2))
                    return false;
            } else {
                if (!buffer1.Take(count1).SequenceEqual(buffer2.Take(count2)))
                    return false;
            }
        }
    }
}
Zygophyllaceous answered 26/7, 2023 at 16:28 Comment(0)
D
0

This is how I do it today with no loops. Hope this helps provide an alternative option.

public class FileCompare
{
    public bool IsFileSame(string filePath1, string filePath2) => 
        IsFileSame(new FileInfo(filePath1), new FileInfo(filePath2));

    public bool IsFileSame(FileInfo filePath1, FileInfo filePath2)
    {
        var retVal = false;

        if (filePath1.Exists && 
            filePath2.Exists && 
            filePath1.Length == filePath2.Length)
        {
            using (FileStream inputStream1 = File.OpenRead(filePath1.FullName))
            {
                using (FileStream inputStream2 = File.OpenRead(filePath2.FullName))
                {
                    using (MD5 mD = MD5.Create())
                    {
                        retVal = BitConverter.ToString(mD.ComputeHash(inputStream1))
                            .Equals(BitConverter.ToString(mD.ComputeHash(inputStream2)));
                    }
                }
            }
        }

        return retVal;
    }
}
Denial answered 8/4 at 22:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.