How to compare 2 files fast using .NET?
Asked Answered
D

21

162

Typical approaches recommend reading the binary via FileStream and comparing it byte-by-byte.

  • Would a checksum comparison such as CRC be faster?
  • Are there any .NET libraries that can generate a checksum for a file?
Decagram answered 31/8, 2009 at 17:38 Comment(1)
dupe: https://mcmap.net/q/151837/-c-file-managementMicrophysics
T
134

A checksum comparison will most likely be slower than a byte-by-byte comparison.

In order to generate a checksum, you'll need to load each byte of the file, and perform processing on it. You'll then have to do this on the second file. The processing will almost definitely be slower than the comparison check.

As for generating a checksum: You can do this easily with the cryptography classes. Here's a short example of generating an MD5 checksum with C#.

However, a checksum may be faster and make more sense if you can pre-compute the checksum of the "test" or "base" case. If you have an existing file, and you're checking to see if a new file is the same as the existing one, pre-computing the checksum on your "existing" file would mean only needing to do the DiskIO one time, on the new file. This would likely be faster than a byte-by-byte comparison.

Twylatwyman answered 31/8, 2009 at 17:41 Comment(4)
Make sure to take into account where your files are located. If you're comparing local files to a back-up half-way across the world (or over a network with horrible bandwidth) you may be better off to hash first and send a checksum over the network instead of sending a stream of bytes to compare.Thirtieth
@ReedCopsey: I'm having a similar problem, since I need to store input/output files produced by several elaborations that are supposed to contain a lot of duplications. I thought to use precomputed hash, but do you think I can reasonably assume that if 2 (e.g. MD5) hash are equal, the 2 files are equal and avoid further byte-2-byte comparison ? As far as I know MD5/SHA1 etc collisions are really unlikely...Luu
@Luu Collision chance is low - you can always do a stronger hash, though - ie: use SHA256 instead of SHA1, which will reduce the likelihood of collisions further.Twylatwyman
thanks for your answer - i'm just getting into .net. i'm assuming that if one is using the hashcode/check sum technique, then the hashes of the main folder will be stored persistently somewhere? out of curiousity how would you store it for a WPF application - what would you do? (i've currently looking at xml, text files or databases).Hubie
E
161

The slowest possible method is to compare two files byte by byte. The fastest I've been able to come up with is a similar comparison, but instead of one byte at a time, you would use an array of bytes sized to Int64, and then compare the resulting numbers.

Here's what I came up with:

    const int BYTES_TO_READ = sizeof(Int64);

    static bool FilesAreEqual(FileInfo first, FileInfo second)
    {
        if (first.Length != second.Length)
            return false;

        if (string.Equals(first.FullName, second.FullName, StringComparison.OrdinalIgnoreCase))
            return true;

        int iterations = (int)Math.Ceiling((double)first.Length / BYTES_TO_READ);

        using (FileStream fs1 = first.OpenRead())
        using (FileStream fs2 = second.OpenRead())
        {
            byte[] one = new byte[BYTES_TO_READ];
            byte[] two = new byte[BYTES_TO_READ];

            for (int i = 0; i < iterations; i++)
            {
                 fs1.Read(one, 0, BYTES_TO_READ);
                 fs2.Read(two, 0, BYTES_TO_READ);

                if (BitConverter.ToInt64(one,0) != BitConverter.ToInt64(two,0))
                    return false;
            }
        }

        return true;
    }

In my testing, I was able to see this outperform a straightforward ReadByte() scenario by almost 3:1. Averaged over 1000 runs, I got this method at 1063ms, and the method below (straightforward byte by byte comparison) at 3031ms. Hashing always came back sub-second at around an average of 865ms. This testing was with an ~100MB video file.

Here's the ReadByte and hashing methods I used, for comparison purposes:

    static bool FilesAreEqual_OneByte(FileInfo first, FileInfo second)
    {
        if (first.Length != second.Length)
            return false;

        if (string.Equals(first.FullName, second.FullName, StringComparison.OrdinalIgnoreCase))
            return true;

        using (FileStream fs1 = first.OpenRead())
        using (FileStream fs2 = second.OpenRead())
        {
            for (int i = 0; i < first.Length; i++)
            {
                if (fs1.ReadByte() != fs2.ReadByte())
                    return false;
            }
        }

        return true;
    }

    static bool FilesAreEqual_Hash(FileInfo first, FileInfo second)
    {
        byte[] firstHash = MD5.Create().ComputeHash(first.OpenRead());
        byte[] secondHash = MD5.Create().ComputeHash(second.OpenRead());

        for (int i=0; i<firstHash.Length; i++)
        {
            if (firstHash[i] != secondHash[i])
                return false;
        }
        return true;
    }
Ecospecies answered 31/8, 2009 at 23:55 Comment(8)
@anindis: For completeness, you may want to read both @Lars' answer and @RandomInsano's answer. Glad it helped so many years on though! :)Ecospecies
The FilesAreEqual_Hash method should have a using on both file streams too like the ReadByte method otherwise it will hang on to both files.Detruncate
Note that FileStream.Read() may actually read less bytes than the requested number. You should use StreamReader.ReadBlock() instead.Zalucki
In the Int64 version when the stream length is not a multiple of Int64 then the last iteration is comparing the unfilled bytes using previous iteration's fill (which should also be equal so it's fine). Also if the stream length is less than sizeof(Int64) then the unfilled bytes are 0 since C# initializes arrays. IMO, the code should probably comment these oddities.Deucalion
@Ecospecies I modified your code slightly to add a short-circuit case for when the two files being compared are the same file (in case the caller forgets to check). This shouldn't affect the comparative performance stats in any measurable way for a 100MB file.Beslobber
I modified the code above to allow for case insensitivity to not throw off the filename fast-path. But there is still a bug as @palec called out. @Deucalion suggested this is fine, but it really isn't. It's not just when the stream ends that the Read method may return fewer bytes than requested. The contract is that Read must not return 0 bytes unless the end of the stream is read, but mid-stream, it can also return (0,max] bytes which buffering streams may actually do. So it really is important to make sure to consider that case.Levity
As noted by @Zalucki in a previous comment, the int64 version may fail to give correct results since FileStream.Read() can read less than the requested number of bytes.Remount
this method is ultra fast. I am able to fully utilize my m2 ssd with just 4 threadsNipa
T
134

A checksum comparison will most likely be slower than a byte-by-byte comparison.

In order to generate a checksum, you'll need to load each byte of the file, and perform processing on it. You'll then have to do this on the second file. The processing will almost definitely be slower than the comparison check.

As for generating a checksum: You can do this easily with the cryptography classes. Here's a short example of generating an MD5 checksum with C#.

However, a checksum may be faster and make more sense if you can pre-compute the checksum of the "test" or "base" case. If you have an existing file, and you're checking to see if a new file is the same as the existing one, pre-computing the checksum on your "existing" file would mean only needing to do the DiskIO one time, on the new file. This would likely be faster than a byte-by-byte comparison.

Twylatwyman answered 31/8, 2009 at 17:41 Comment(4)
Make sure to take into account where your files are located. If you're comparing local files to a back-up half-way across the world (or over a network with horrible bandwidth) you may be better off to hash first and send a checksum over the network instead of sending a stream of bytes to compare.Thirtieth
@ReedCopsey: I'm having a similar problem, since I need to store input/output files produced by several elaborations that are supposed to contain a lot of duplications. I thought to use precomputed hash, but do you think I can reasonably assume that if 2 (e.g. MD5) hash are equal, the 2 files are equal and avoid further byte-2-byte comparison ? As far as I know MD5/SHA1 etc collisions are really unlikely...Luu
@Luu Collision chance is low - you can always do a stronger hash, though - ie: use SHA256 instead of SHA1, which will reduce the likelihood of collisions further.Twylatwyman
thanks for your answer - i'm just getting into .net. i'm assuming that if one is using the hashcode/check sum technique, then the hashes of the main folder will be stored persistently somewhere? out of curiousity how would you store it for a WPF application - what would you do? (i've currently looking at xml, text files or databases).Hubie
S
69

If you d̲o̲ decide you truly need a full byte-by-byte comparison (see other answers for discussion of hashing), then the easiest solution is:


• for `System.String` path names:
public static bool AreFileContentsEqual(String path1, String path2) =>
              File.ReadAllBytes(path1).SequenceEqual(File.ReadAllBytes(path2));

• for `System.IO.FileInfo` instances:
public static bool AreFileContentsEqual(FileInfo fi1, FileInfo fi2) =>
    fi1.Length == fi2.Length &&
    (fi1.Length == 0L || File.ReadAllBytes(fi1.FullName).SequenceEqual(
                         File.ReadAllBytes(fi2.FullName)));

Unlike some other posted answers, this is conclusively correct for any kind of file: binary, text, media, executable, etc., but as a full binary comparison, files that that differ only in "unimportant" ways (such as BOM, line-ending, character encoding, media metadata, whitespace, padding, source code comments, etc.note 1) will always be considered not-equal.

This code loads both files into memory entirely, so it should not be used for comparing truly gigantic files. Beyond that important caveat, full loading isn't really a penalty given the design of the .NET GC (because it's fundamentally optimized to keep small, short-lived allocations extremely cheap), and in fact could even be optimal when file sizes are expected to be less than 85K, because using a minimum of user code (as shown here) implies maximally delegating file performance issues to the CLR, BCL, and JIT to benefit from (e.g.) the latest design technology, system code, and adaptive runtime optimizations.

Furthermore, for such workaday scenarios, concerns about the performance of byte-by-byte comparison via LINQ enumerators (as shown here) are moot, since hitting the disk a̲t̲ a̲l̲l̲ for file I/O will dwarf, by several orders of magnitude, the benefits of the various memory-comparing alternatives. For example, even though SequenceEqual does in fact give us the "optimization" of abandoning on first mismatch, this hardly matters after having already fetched the files' contents, each fully necessary for any true-positive cases.



1. An obscure exception: NTFS alternate data streams are not examined by any of the answers discussed on this page, so such streams may be different for files otherwise reported as the "same."
Superheat answered 20/3, 2016 at 0:21 Comment(2)
this one does not look good for big files. not good for memory usage since it will read both files up to the end before starting comparing the byte array. That is why i would rather go for a streamreader with a buffer.Kaminsky
@Kaminsky I discussed these factors and the approprate use in the text of my answer.Superheat
D
34

In addition to Reed Copsey's answer:

  • The worst case is where the two files are identical. In this case it's best to compare the files byte-by-byte.

  • If if the two files are not identical, you can speed things up a bit by detecting sooner that they're not identical.

For example, if the two files are of different length then you know they cannot be identical, and you don't even have to compare their actual content.

Demurrage answered 31/8, 2009 at 17:47 Comment(3)
To be complete: the other big gain is stopping as soon as the bytes at 1 position are different.Nobles
@Henk: I thought this was too obvious :-)Demurrage
Good point on adding this. It was obvious to me, so I didn't include it, but it's good to mention.Twylatwyman
E
21

It's getting even faster if you don't read in small 8 byte chunks but put a loop around, reading a larger chunk. I reduced the average comparison time to 1/4.

    public static bool FilesContentsAreEqual(FileInfo fileInfo1, FileInfo fileInfo2)
    {
        bool result;

        if (fileInfo1.Length != fileInfo2.Length)
        {
            result = false;
        }
        else
        {
            using (var file1 = fileInfo1.OpenRead())
            {
                using (var file2 = fileInfo2.OpenRead())
                {
                    result = StreamsContentsAreEqual(file1, file2);
                }
            }
        }

        return result;
    }

    private static bool StreamsContentsAreEqual(Stream stream1, Stream stream2)
    {
        const int bufferSize = 1024 * sizeof(Int64);
        var buffer1 = new byte[bufferSize];
        var buffer2 = new byte[bufferSize];

        while (true)
        {
            int count1 = stream1.Read(buffer1, 0, bufferSize);
            int count2 = stream2.Read(buffer2, 0, bufferSize);

            if (count1 != count2)
            {
                return false;
            }

            if (count1 == 0)
            {
                return true;
            }

            int iterations = (int)Math.Ceiling((double)count1 / sizeof(Int64));
            for (int i = 0; i < iterations; i++)
            {
                if (BitConverter.ToInt64(buffer1, i * sizeof(Int64)) != BitConverter.ToInt64(buffer2, i * sizeof(Int64)))
                {
                    return false;
                }
            }
        }
    }
}
Ean answered 14/4, 2010 at 12:38 Comment(1)
In general the check count1 != count2 isn't correct. Stream.Read() can return less than the count you have provided, for various reasons.Remsen
N
15

Edit: This method would not work for comparing binary files!

In .NET 4.0, the File class has the following two new methods:

public static IEnumerable<string> ReadLines(string path)
public static IEnumerable<string> ReadLines(string path, Encoding encoding)

Which means you could use:

bool same = File.ReadLines(path1).SequenceEqual(File.ReadLines(path2));
Nutrition answered 31/8, 2009 at 17:45 Comment(3)
Wouldn't you also need to store both files in memory?Conall
Note that File also has the function ReadAllBytes which can use SequenceEquals as well so use that instead as it would work on all files. And as @Conall said, this is stored in memory so while it's perferctly fine to use for small files I would be careful using it with large files.Galang
@Galang It returns an enumerable, so the lines will be loaded on-demand and not stored in memory the whole time. ReadAllBytes, on the other hand, does return the whole file as an array.Aplite
F
14

The only thing that might make a checksum comparison slightly faster than a byte-by-byte comparison is the fact that you are reading one file at a time, somewhat reducing the seek time for the disk head. That slight gain may however very well be eaten up by the added time of calculating the hash.

Also, a checksum comparison of course only has any chance of being faster if the files are identical. If they are not, a byte-by-byte comparison would end at the first difference, making it a lot faster.

You should also consider that a hash code comparison only tells you that it's very likely that the files are identical. To be 100% certain you need to do a byte-by-byte comparison.

If the hash code for example is 32 bits, you are about 99.99999998% certain that the files are identical if the hash codes match. That is close to 100%, but if you truly need 100% certainty, that's not it.

Fresno answered 31/8, 2009 at 18:30 Comment(4)
Use a larger hash and you can get the odds of a false positive to well below the odds the computer erred while doing the test.Disrobe
I disagree about the hash time vs seek time. You can do a lot of calculations during a single head seek. If the odds are high that the files match I would use a hash with a lot of bits. If there's a reasonable chance of a match I would compare them a block at a time, for say 1MB blocks. (Pick a block size that 4k divides evenly to ensure you never split sectors.)Disrobe
To explain @Guffa's figure 99.99999998%, it comes from computing 1 - (1 / (2^32)), which is the probability that any single file will have some given 32-bit hash. The probability of two different files having the same hash is the same, because the first file provides the "given" hash value, and we only need to consider whether or not the other file matches that value. The chances with 64- and 128-bit hashing decrease to 99.999999999999999994% and 99.9999999999999999999999999999999999997% (respectively), as if that matters with such unfathomable numbers.Superheat
...Indeed, the fact that these numbers are harder for most people to grasp than the putatively simple notion, albeit true, of "infinitely many files colliding into same hash code" may explain why humans are unreasonably suspicious of accepting hash-as-equality.Superheat
L
9

My answer is a derivative of @lars but fixes the bug in the call to Stream.Read. I also add some fast path checking that other answers had, and input validation. In short, this should be the answer:

using System;
using System.IO;

namespace ConsoleApp4
{
    class Program
    {
        static void Main(string[] args)
        {
            var fi1 = new FileInfo(args[0]);
            var fi2 = new FileInfo(args[1]);
            Console.WriteLine(FilesContentsAreEqual(fi1, fi2));
        }

        public static bool FilesContentsAreEqual(FileInfo fileInfo1, FileInfo fileInfo2)
        {
            if (fileInfo1 == null)
            {
                throw new ArgumentNullException(nameof(fileInfo1));
            }

            if (fileInfo2 == null)
            {
                throw new ArgumentNullException(nameof(fileInfo2));
            }

            if (string.Equals(fileInfo1.FullName, fileInfo2.FullName, StringComparison.OrdinalIgnoreCase))
            {
                return true;
            }

            if (fileInfo1.Length != fileInfo2.Length)
            {
                return false;
            }
            else
            {
                using (var file1 = fileInfo1.OpenRead())
                {
                    using (var file2 = fileInfo2.OpenRead())
                    {
                        return StreamsContentsAreEqual(file1, file2);
                    }
                }
            }
        }

        private static int ReadFullBuffer(Stream stream, byte[] buffer)
        {
            int bytesRead = 0;
            while (bytesRead < buffer.Length)
            {
                int read = stream.Read(buffer, bytesRead, buffer.Length - bytesRead);
                if (read == 0)
                {
                    // Reached end of stream.
                    return bytesRead;
                }

                bytesRead += read;
            }

            return bytesRead;
        }

        private static bool StreamsContentsAreEqual(Stream stream1, Stream stream2)
        {
            const int bufferSize = 1024 * sizeof(Int64);
            var buffer1 = new byte[bufferSize];
            var buffer2 = new byte[bufferSize];

            while (true)
            {
                int count1 = ReadFullBuffer(stream1, buffer1);
                int count2 = ReadFullBuffer(stream2, buffer2);

                if (count1 != count2)
                {
                    return false;
                }

                if (count1 == 0)
                {
                    return true;
                }

                int iterations = (int)Math.Ceiling((double)count1 / sizeof(Int64));
                for (int i = 0; i < iterations; i++)
                {
                    if (BitConverter.ToInt64(buffer1, i * sizeof(Int64)) != BitConverter.ToInt64(buffer2, i * sizeof(Int64)))
                    {
                        return false;
                    }
                }
            }
        }
    }
}

Or if you want to be super-awesome, you can use the async variant:

using System;
using System.IO;
using System.Threading.Tasks;

namespace ConsoleApp4
{
    class Program
    {
        static void Main(string[] args)
        {
            var fi1 = new FileInfo(args[0]);
            var fi2 = new FileInfo(args[1]);
            Console.WriteLine(FilesContentsAreEqualAsync(fi1, fi2).GetAwaiter().GetResult());
        }

        public static async Task<bool> FilesContentsAreEqualAsync(FileInfo fileInfo1, FileInfo fileInfo2)
        {
            if (fileInfo1 == null)
            {
                throw new ArgumentNullException(nameof(fileInfo1));
            }

            if (fileInfo2 == null)
            {
                throw new ArgumentNullException(nameof(fileInfo2));
            }

            if (string.Equals(fileInfo1.FullName, fileInfo2.FullName, StringComparison.OrdinalIgnoreCase))
            {
                return true;
            }

            if (fileInfo1.Length != fileInfo2.Length)
            {
                return false;
            }
            else
            {
                using (var file1 = fileInfo1.OpenRead())
                {
                    using (var file2 = fileInfo2.OpenRead())
                    {
                        return await StreamsContentsAreEqualAsync(file1, file2).ConfigureAwait(false);
                    }
                }
            }
        }

        private static async Task<int> ReadFullBufferAsync(Stream stream, byte[] buffer)
        {
            int bytesRead = 0;
            while (bytesRead < buffer.Length)
            {
                int read = await stream.ReadAsync(buffer, bytesRead, buffer.Length - bytesRead).ConfigureAwait(false);
                if (read == 0)
                {
                    // Reached end of stream.
                    return bytesRead;
                }

                bytesRead += read;
            }

            return bytesRead;
        }

        private static async Task<bool> StreamsContentsAreEqualAsync(Stream stream1, Stream stream2)
        {
            const int bufferSize = 1024 * sizeof(Int64);
            var buffer1 = new byte[bufferSize];
            var buffer2 = new byte[bufferSize];

            while (true)
            {
                int count1 = await ReadFullBufferAsync(stream1, buffer1).ConfigureAwait(false);
                int count2 = await ReadFullBufferAsync(stream2, buffer2).ConfigureAwait(false);

                if (count1 != count2)
                {
                    return false;
                }

                if (count1 == 0)
                {
                    return true;
                }

                int iterations = (int)Math.Ceiling((double)count1 / sizeof(Int64));
                for (int i = 0; i < iterations; i++)
                {
                    if (BitConverter.ToInt64(buffer1, i * sizeof(Int64)) != BitConverter.ToInt64(buffer2, i * sizeof(Int64)))
                    {
                        return false;
                    }
                }
            }
        }
    }
}
Levity answered 21/11, 2017 at 20:31 Comment(3)
wouldnt the bitconverter bit be better as ``` for (var i = 0; i < count; i+= sizeof(long)) { if (BitConverter.ToInt64(buffer1, i) != BitConverter.ToInt64(buffer2, i)) { return false; } } ```Union
You need to open the file with the isAsync parameter set to true in order to take advantage of truly-async i/o: learn.microsoft.com/en-us/dotnet/api/…Eatmon
works fantasticNipa
C
7

Honestly, I think you need to prune your search tree down as much as possible.

Things to check before going byte-by-byte:

  1. Are sizes the same?
  2. Is the last byte in file A different than file B

Also, reading large blocks at a time will be more efficient since drives read sequential bytes more quickly. Going byte-by-byte causes not only far more system calls, but it causes the read head of a traditional hard drive to seek back and forth more often if both files are on the same drive.

Read chunk A and chunk B into a byte buffer, and compare them (do NOT use Array.Equals, see comments). Tune the size of the blocks until you hit what you feel is a good trade off between memory and performance. You could also multi-thread the comparison, but don't multi-thread the disk reads.

Conall answered 27/1, 2012 at 18:21 Comment(6)
Using Array.Equals is a bad idea because it compares the whole array. It is likely, at least one block read will not fill the whole array.Venous
Why is comparing the whole array a bad idea? Why would a block read not fill the array? There's definitely a good tuning point, but that's why you play with the sizes. Extra points for doing the comparison in a separate thread.Conall
When you define a byte array, it will have a fixed length. (e.g. - var buffer = new byte[4096]) When you read a block from the file, it may or may not return the full 4096 bytes. For instance, if the file is only 3000 bytes long.Venous
Ah, now I understand! Good news is the read will return the number of bytes loaded into the array, so if the array can't be filled, there will be data. Since we're testing for equality, old buffer data won't matter. Docs: msdn.microsoft.com/en-us/library/9kstw824(v=vs.110).aspxConall
Also important, my recommendation to use the Equals() method is a bad idea. In Mono, they do a memory compare since the elements are contiguous in memory. Microsoft however doesn't override it, instead only doing a reference comparison which here would always be false.Conall
It's worth noting that as long as you don't override it in the constructor, the default size of FileStream's internal buffer is already a large block (4096 bytes). While there's a bit of overhead calling Read(), it's not actually hitting the disk every time. referencesource.microsoft.com/#mscorlib/system/io/…Beslobber
N
4

Inspired from https://dev.to/emrahsungu/how-to-compare-two-files-using-net-really-really-fast-2pd9

Here is a proposal to do it with AVX2 SIMD instructions:

using System.Buffers;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;

namespace FileCompare;

public static class FastFileCompare
{
    public static bool AreFilesEqual(FileInfo fileInfo1, FileInfo fileInfo2, int bufferSize = 4096 * 32)
    {
        if (fileInfo1.Exists == false)
        {
            throw new FileNotFoundException(nameof(fileInfo1), fileInfo1.FullName);
        }

        if (fileInfo2.Exists == false)
        {
            throw new FileNotFoundException(nameof(fileInfo2), fileInfo2.FullName);
        }

        if (fileInfo1.Length != fileInfo2.Length)
        {
            return false;
        }

        if (string.Equals(fileInfo1.FullName, fileInfo2.FullName, StringComparison.OrdinalIgnoreCase))
        {
            return true;
        }
 
        using FileStream fileStream01 = fileInfo1.OpenRead();
        using FileStream fileStream02 = fileInfo2.OpenRead();
        ArrayPool<byte> sharedArrayPool = ArrayPool<byte>.Shared;
        byte[] buffer1 = sharedArrayPool.Rent(bufferSize);
        byte[] buffer2 = sharedArrayPool.Rent(bufferSize);
        Array.Fill<byte>(buffer1, 0);
        Array.Fill<byte>(buffer2, 0);
        try
        {
            while (true)
            {
                int len1 = 0;
                for (int read;
                     len1 < buffer1.Length &&
                     (read = fileStream01.Read(buffer1, len1, buffer1.Length - len1)) != 0;
                     len1 += read)
                {
                }

                int len2 = 0;
                for (int read;
                     len2 < buffer1.Length &&
                     (read = fileStream02.Read(buffer2, len2, buffer2.Length - len2)) != 0;
                     len2 += read)
                {
                }

                if (len1 != len2)
                {
                    return false;
                }

                if (len1 == 0)
                {
                    return true;
                }

                unsafe
                {
                    fixed (byte* pb1 = buffer1)
                    {
                        fixed (byte* pb2 = buffer2)
                        {
                            int vectorSize = Vector256<byte>.Count;
                            for (int processed = 0; processed < len1; processed += vectorSize)
                            {
                                Vector256<byte> result = Avx2.CompareEqual(Avx.LoadVector256(pb1 + processed), Avx.LoadVector256(pb2 + processed));
                                if (Avx2.MoveMask(result) != -1)
                                {
                                    return false;
                                }
                            }
                        }
                    }
                }
            }
        }
        finally
        {
            sharedArrayPool.Return(buffer1);
            sharedArrayPool.Return(buffer2);
        }
    }
}
Nailbiting answered 6/8, 2022 at 20:41 Comment(0)
S
2

If the files are not too big, you can use:

public static byte[] ComputeFileHash(string fileName)
{
    using (var stream = File.OpenRead(fileName))
        return System.Security.Cryptography.MD5.Create().ComputeHash(stream);
}

It will only be feasible to compare hashes if the hashes are useful to store.

(Edited the code to something much cleaner.)

Saccharide answered 31/8, 2009 at 17:46 Comment(0)
O
2

My experiments show that it definitely helps to call Stream.ReadByte() fewer times, but using BitConverter to package bytes does not make much difference against comparing bytes in a byte array.

So it is possible to replace that "Math.Ceiling and iterations" loop in the comment above with the simplest one:

            for (int i = 0; i < count1; i++)
            {
                if (buffer1[i] != buffer2[i])
                    return false;
            }

I guess it has to do with the fact that BitConverter.ToInt64 needs to do a bit of work (check arguments and then perform the bit shifting) before you compare and that ends up being the same amount of work as compare 8 bytes in two arrays.

Omni answered 16/9, 2011 at 1:49 Comment(2)
Array.Equals goes deeper into the system, so it will likely be a lot faster than going byte by byte in C#. I can't speak for Microsoft, but deep down, Mono uses C's memcpy() command for array equality. Can't get much faster than that.Conall
@Conall guess you mean memcmp(), not memcpy()Extravaganza
C
1

Another improvement on large files with identical length, might be to not read the files sequentially, but rather compare more or less random blocks.

You can use multiple threads, starting on different positions in the file and comparing either forward or backwards.

This way you can detect changes at the middle/end of the file, faster than you would get there using a sequential approach.

Chapeau answered 14/4, 2010 at 12:49 Comment(2)
Would disk thrashing cause problems here?Conall
Physical disk drives yes, SSD's would handle this.Zeitler
D
1

If you only need to compare two files, I guess the fastest way would be (in C, I don't know if it's applicable to .NET)

  1. open both files f1, f2
  2. get the respective file length l1, l2
  3. if l1 != l2 the files are different; stop
  4. mmap() both files
  5. use memcmp() on the mmap()ed files

OTOH, if you need to find if there are duplicate files in a set of N files, then the fastest way is undoubtedly using a hash to avoid N-way bit-by-bit comparisons.

Dunford answered 12/12, 2011 at 9:17 Comment(0)
P
0

Something (hopefully) reasonably efficient:

public class FileCompare
{
    public static bool FilesEqual(string fileName1, string fileName2)
    {
        return FilesEqual(new FileInfo(fileName1), new FileInfo(fileName2));
    }

    /// <summary>
    /// 
    /// </summary>
    /// <param name="file1"></param>
    /// <param name="file2"></param>
    /// <param name="bufferSize">8kb seemed like a good default</param>
    /// <returns></returns>
    public static bool FilesEqual(FileInfo file1, FileInfo file2, int bufferSize = 8192)
    {
        if (!file1.Exists || !file2.Exists || file1.Length != file2.Length) return false;

        var buffer1 = new byte[bufferSize];
        var buffer2 = new byte[bufferSize];

        using var stream1 = file1.Open(FileMode.Open, FileAccess.Read, FileShare.Read);
        using var stream2 = file2.Open(FileMode.Open, FileAccess.Read, FileShare.Read);

        while (true)
        {
            var bytesRead1 = ReallyRead(stream1, buffer1, 0, bufferSize);
            var bytesRead2 = ReallyRead(stream2, buffer2, 0, bufferSize);

            if (bytesRead1 != bytesRead2) return false;
            if (bytesRead1 == 0) return true;
            if (!ArraysEqual(buffer1, buffer2, bytesRead1)) return false;
        }
    }

    /// <summary>
    /// 
    /// </summary>
    /// <param name="array1"></param>
    /// <param name="array2"></param>
    /// <param name="bytesToCompare"> 0 means compare entire arrays</param>
    /// <returns></returns>
    public static bool ArraysEqual(byte[] array1, byte[] array2, int bytesToCompare = 0)
    {
        if (array1.Length != array2.Length) return false;

        var length = (bytesToCompare == 0) ? array1.Length : bytesToCompare;
        var tailIdx = length - length % sizeof(Int64);

        //check in 8 byte chunks
        for (var i = 0; i < tailIdx; i += sizeof(Int64))
        {
            if (BitConverter.ToInt64(array1, i) != BitConverter.ToInt64(array2, i)) return false;
        }

        //check the remainder of the array, always shorter than 8 bytes
        for (var i = tailIdx; i < length; i++)
        {
            if (array1[i] != array2[i]) return false;
        }

        return true;
    }
    
    private static int ReallyRead(FileStream src, byte[] buffer, int offset, int count){
        int bytesRead = 0;
        do{
            var currentBytesRead = src.Read(buffer, bytesRead, count);
            if(currentBytesRead == 0){
                return Math.Max(0, bytesRead);
            }
            count -= currentBytesRead;
            bytesRead += currentBytesRead;
        }while(count > 0);
        return bytesRead;
    }
}
Parmesan answered 29/3, 2016 at 17:36 Comment(5)
This is wrong, since Read() can return less bytes than requested: learn.microsoft.com/en-us/dotnet/api/…Bronnie
@Bronnie hence the bytesRead* variables. I suggest you step through the code under debug before making such statements :)Parmesan
You are wrong. stream1.Read(buffer1, 0, bufferSize); can return different value than stream2.Read(buffer2, 0, bufferSize);. According to documentation return value from Read() is "The total number of bytes read into the buffer. This might be less than the number of bytes requested if that number of bytes are not currently available, or zero if the end of the stream is reached.". Sometimes you might get less bytes from Read(), so with this check: if (bytesRead1 != bytesRead2) return false; you can get wrong answer from FilesEqual() method.Bronnie
@check #1 this, or #2 thisBronnie
okay, fixed (I think)Parmesan
C
0

I think there are applications where "hash" is faster than comparing byte by byte. If you need to compare a file with others or have a thumbnail of a photo that can change. It depends on where and how it is using.

private bool CompareFilesByte(string file1, string file2)
{
    using (var fs1 = new FileStream(file1, FileMode.Open))
    using (var fs2 = new FileStream(file2, FileMode.Open))
    {
        if (fs1.Length != fs2.Length) return false;
        int b1, b2;
        do
        {
            b1 = fs1.ReadByte();
            b2 = fs2.ReadByte();
            if (b1 != b2 || b1 < 0) return false;
        }
        while (b1 >= 0);
    }
    return true;
}

private string HashFile(string file)
{
    using (var fs = new FileStream(file, FileMode.Open))
    using (var reader = new BinaryReader(fs))
    {
        var hash = new SHA512CryptoServiceProvider();
        hash.ComputeHash(reader.ReadBytes((int)file.Length));
        return Convert.ToBase64String(hash.Hash);
    }
}

private bool CompareFilesWithHash(string file1, string file2)
{
    var str1 = HashFile(file1);
    var str2 = HashFile(file2);
    return str1 == str2;
}

Here, you can get what is the fastest.

var sw = new Stopwatch();
sw.Start();
var compare1 = CompareFilesWithHash(receiveLogPath, logPath);
sw.Stop();
Debug.WriteLine(string.Format("Compare using Hash {0}", sw.ElapsedTicks));
sw.Reset();
sw.Start();
var compare2 = CompareFilesByte(receiveLogPath, logPath);
sw.Stop();
Debug.WriteLine(string.Format("Compare byte-byte {0}", sw.ElapsedTicks));

Optionally, we can save the hash in a database.

Hope this can help

Cinquefoil answered 26/4, 2016 at 4:33 Comment(0)
O
0

Here are some utility functions that allow you to determine if two files (or two streams) contain identical data.

I have provided a "fast" version that is multi-threaded as it compares byte arrays (each buffer filled from what's been read in each file) in different threads using Tasks.

As expected, it's much faster (around 3x faster) but it consumes more CPU (because it's multi threaded) and more memory (because it needs two byte array buffers per comparison thread).

    public static bool AreFilesIdenticalFast(string path1, string path2)
    {
        return AreFilesIdentical(path1, path2, AreStreamsIdenticalFast);
    }

    public static bool AreFilesIdentical(string path1, string path2)
    {
        return AreFilesIdentical(path1, path2, AreStreamsIdentical);
    }

    public static bool AreFilesIdentical(string path1, string path2, Func<Stream, Stream, bool> areStreamsIdentical)
    {
        if (path1 == null)
            throw new ArgumentNullException(nameof(path1));

        if (path2 == null)
            throw new ArgumentNullException(nameof(path2));

        if (areStreamsIdentical == null)
            throw new ArgumentNullException(nameof(path2));

        if (!File.Exists(path1) || !File.Exists(path2))
            return false;

        using (var thisFile = new FileStream(path1, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
        {
            using (var valueFile = new FileStream(path2, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
            {
                if (valueFile.Length != thisFile.Length)
                    return false;

                if (!areStreamsIdentical(thisFile, valueFile))
                    return false;
            }
        }
        return true;
    }

    public static bool AreStreamsIdenticalFast(Stream stream1, Stream stream2)
    {
        if (stream1 == null)
            throw new ArgumentNullException(nameof(stream1));

        if (stream2 == null)
            throw new ArgumentNullException(nameof(stream2));

        const int bufsize = 80000; // 80000 is below LOH (85000)

        var tasks = new List<Task<bool>>();
        do
        {
            // consumes more memory (two buffers for each tasks)
            var buffer1 = new byte[bufsize];
            var buffer2 = new byte[bufsize];

            int read1 = stream1.Read(buffer1, 0, buffer1.Length);
            if (read1 == 0)
            {
                int read3 = stream2.Read(buffer2, 0, 1);
                if (read3 != 0) // not eof
                    return false;

                break;
            }

            // both stream read could return different counts
            int read2 = 0;
            do
            {
                int read3 = stream2.Read(buffer2, read2, read1 - read2);
                if (read3 == 0)
                    return false;

                read2 += read3;
            }
            while (read2 < read1);

            // consumes more cpu
            var task = Task.Run(() =>
            {
                return IsSame(buffer1, buffer2);
            });
            tasks.Add(task);
        }
        while (true);

        Task.WaitAll(tasks.ToArray());
        return !tasks.Any(t => !t.Result);
    }

    public static bool AreStreamsIdentical(Stream stream1, Stream stream2)
    {
        if (stream1 == null)
            throw new ArgumentNullException(nameof(stream1));

        if (stream2 == null)
            throw new ArgumentNullException(nameof(stream2));

        const int bufsize = 80000; // 80000 is below LOH (85000)
        var buffer1 = new byte[bufsize];
        var buffer2 = new byte[bufsize];

        var tasks = new List<Task<bool>>();
        do
        {
            int read1 = stream1.Read(buffer1, 0, buffer1.Length);
            if (read1 == 0)
                return stream2.Read(buffer2, 0, 1) == 0; // check not eof

            // both stream read could return different counts
            int read2 = 0;
            do
            {
                int read3 = stream2.Read(buffer2, read2, read1 - read2);
                if (read3 == 0)
                    return false;

                read2 += read3;
            }
            while (read2 < read1);

            if (!IsSame(buffer1, buffer2))
                return false;
        }
        while (true);
    }

    public static bool IsSame(byte[] bytes1, byte[] bytes2)
    {
        if (bytes1 == null)
            throw new ArgumentNullException(nameof(bytes1));

        if (bytes2 == null)
            throw new ArgumentNullException(nameof(bytes2));

        if (bytes1.Length != bytes2.Length)
            return false;

        for (int i = 0; i < bytes1.Length; i++)
        {
            if (bytes1[i] != bytes2[i])
                return false;
        }
        return true;
    }
Once answered 4/10, 2016 at 13:32 Comment(1)
This is wrong, since Read can return less bytes than requested: learn.microsoft.com/en-us/dotnet/api/…Bronnie
S
0

This I have found works well comparing first the length without reading data and then comparing the read byte sequence

private static bool IsFileIdentical(string a, string b)
{            
   if (new FileInfo(a).Length != new FileInfo(b).Length) return false;
   return (File.ReadAllBytes(a).SequenceEqual(File.ReadAllBytes(b)));
}
Simonsimona answered 25/10, 2016 at 9:15 Comment(0)
U
0

Yet another answer, derived from @chsh. MD5 with usings and shortcuts for file same, file not exists and differing lengths:

/// <summary>
/// Performs an md5 on the content of both files and returns true if
/// they match
/// </summary>
/// <param name="file1">first file</param>
/// <param name="file2">second file</param>
/// <returns>true if the contents of the two files is the same, false otherwise</returns>
public static bool IsSameContent(string file1, string file2)
{
    if (file1 == file2)
        return true;

    FileInfo file1Info = new FileInfo(file1);
    FileInfo file2Info = new FileInfo(file2);

    if (!file1Info.Exists && !file2Info.Exists)
       return true;
    if (!file1Info.Exists && file2Info.Exists)
        return false;
    if (file1Info.Exists && !file2Info.Exists)
        return false;
    if (file1Info.Length != file2Info.Length)
        return false;

    using (FileStream file1Stream = file1Info.OpenRead())
    using (FileStream file2Stream = file2Info.OpenRead())
    { 
        byte[] firstHash = MD5.Create().ComputeHash(file1Stream);
        byte[] secondHash = MD5.Create().ComputeHash(file2Stream);
        for (int i = 0; i < firstHash.Length; i++)
        {
            if (i>=secondHash.Length||firstHash[i] != secondHash[i])
                return false;
        }
        return true;
    }
}
Unadvised answered 15/11, 2017 at 17:16 Comment(1)
You say if (i>=secondHash.Length ... Under what circumstances would two MD5 hashes be different lengths?Ruddy
T
0

Not really an answer, but kinda funny.
This is what github's CoPilot (AI) suggested :-)

public static void CompareFiles(FileInfo actualFile, FileInfo expectedFile) {
    if (actualFile.Length != expectedFile.Length) {
        throw new Exception($"File {actualFile.Name} has different length in actual and expected directories.");
    }

    // compare the files on a byte level
    using var actualStream   = actualFile.OpenRead();
    using var expectedStream = expectedFile.OpenRead();
    var       actualBuffer   = new byte[1024];
    var       expectedBuffer = new byte[1024];
    int       actualBytesRead;
    int       expectedBytesRead;
    do {
        actualBytesRead   = actualStream.Read(actualBuffer, 0, actualBuffer.Length);
        expectedBytesRead = expectedStream.Read(expectedBuffer, 0, expectedBuffer.Length);
        if (actualBytesRead != expectedBytesRead) {
            throw new Exception($"File {actualFile.Name} has different content in actual and expected directories.");
        }

        if (!actualBuffer.SequenceEqual(expectedBuffer)) {
            throw new Exception($"File {actualFile.Name} has different content in actual and expected directories.");
        }
    } while (actualBytesRead > 0);
}

I find the usage of SequenceEqual particular interesting.

Teal answered 3/10, 2022 at 4:8 Comment(0)
I
0

I liked the SequenceEqual answers above, but the hash comparison answers looked very messy. I prefer a hash comparison more like this:

    public bool AreFilesEqual(string file1Path, string file2Path)
    {
        string file1Hash = "", file2Hash = "";
        SHA1 sha = new SHA1CryptoServiceProvider();

        using (FileStream fs = File.OpenRead(file1Path))
        {
            byte[] hash;
            hash = sha.ComputeHash(fs);
            file1Hash = Convert.ToBase64String(hash);
        }

        using (FileStream fs = File.OpenRead(file2Path))
        {
            byte[] hash;
            hash = sha.ComputeHash(fs);
            file2Hash = Convert.ToBase64String(hash);
        }

        return (file1Hash == file2Hash);
    }
Isothermal answered 16/8, 2023 at 23:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.