Speed of binary file comparisons in PowerShell
Asked Answered
B

1

0

There is a lot of discussion on the Internet about how to compare files in PowerShell. For example:

However, nothing that I've found discusses the speed differences of the different ways of doing a comparison.

(The article above by Kees Bakker and his answer to the 2013 SO question present the function FilesAreEqual. The article title claims it's faster, but doesn't say faster than what and doesn't offer any data to back up the claim. In my answer below, the function bFilesCompareBinary is adapted from his, and you'll see that my data agree with his claim.)

This question is a follow-up to my question, PowerShell: Why are these file comparison times so different?. In more research on that, I've compiled data on the speed of various methods of comparing binary files in PowerShell. I'm posting this question in order to provide those data in an answer. So the question is:

How fast are different methods of comparing binary files in PowerShell?

Bilbao answered 14/8, 2023 at 2:40 Comment(0)
B
2

Contents

  • Data
  • Errors and mistakes
  • Conclusion
  • Environment
  • Size dependence of speed of buffered method
  • Code

Data

The table below gives, for four files and their identical copies, data on the speed of comparison of the pairs by seven different methods. These four files were selected for convenience of being located together in a path on my computer and in a corresponding path on an external SSD. They all happen to be gvi video files, which should not have been relevant, but a quirk of their structure turned out to have an interesting effect on one of the methods. The table gives the speed, in Mb/sec, of the comparison process for each method and for each file. The speed was calculated by dividing the size of the file by the elapsed time of the process. Code is given further below for the scripts that performed and timed the comparisons.

The columns of the table are:

  • Size: The size of the files in Mb.
  • PS: PowerShell version in which the process was run. A hyphen ("-") indicates the process was run in a Windows batch file, not in PowerShell. The scripts were run directly in Windows, not in a scripting environment (ISE or VS Code).
  • The methods:
    • Comp: The Windows command comp.
    • FC: The Windows command FC.
    • Compare-object: The PowerShell command compare-object acting on a get-content of each file to be compared.
    • Compare raw: The PowerShell command compare-object in which the get-contents have the parameter -raw.
    • (not included:) Compare as byte: I attempted to include the PowerShell command compare-object in which the get-contents have just the parameter -encoding byte (PS 5) or -AsByteStream (PS 7), but this sat for over a half-hour in both PS 5 and 7, so either the process hung or it took so long that it might as well have hung.
    • Compare as byte raw: The PowerShell command compare-object in which the get-contents have the parameters -encoding byte (PS 5) or -AsByteStream (PS 7) plus -raw.
    • Compare as byte read 0: The PowerShell command compare-object in which the get-contents have the parameters -encoding byte (PS 5) or -AsByteStream (PS 7) plus -ReadCount 0.
    • Buffered: The PowerShell custom function bFilesCompareBinary, based on code written by Kees Bakker, which performs a buffered comparison (code included in script below).
    • (not included:) Hash comparisons. All the methods tested do direct byte-by-byte comparisons of the file contents.

Since the pairs tested were all identical, all the measurements had to compare all bytes in the files. For pairs that are not necessarily identical, the Windows commands and the buffered method have the ability to abort after detecting a difference, and so could run even faster. The compare-object methods compare the entire files, even if the first bytes are different.

Size PS Comp FC Compare-object Compare raw Compare as byte raw Compare as byte read 0 Buffered
74 - 29.0 30.3
"" 5 29.2 30.2 4.1 18.7 0.5 0.5 35.2
"" 7 29.2 30.9 3.4 20.7 1.2 0.9 36.5
66 - 25.3 26.2
"" 5 25.5 26.1 5.6 20.4 0.5 0.5 35.4
"" 7 25.4 26.3 2.8 22.0 1.2 1.0 37.1
162 - 25.6 26.1
"" 5 25.5 26.5 15.0 18.7 0.5 Error 35.8
"" 7 25.8 26.8 17.8 24.6 1.2 1.0 36.8
56 - 25.5 25.8
"" 5 25.5 26.0 21.6 3.0 0.5 0.5 35.2
"" 7 26.0 26.5 17.6 25.1 1.3 1.1 36.0

Table: Speed, in Mb/sec, of comparing four identical pairs of files (identified by their size in Mb) by seven methods running in Windows batch, in Windows PowerShell 5.1, and in PowerShell 7.

Note that with the method "Compare-object", the third and fourth files run much faster than the first two. This was the mystery that my original question asked about, and is explained in its answers.

Errors and mistakes

In the case indicated as "Error" (method "Compare as byte read 0" in PS 5 on the largest file), the process crashed PowerShell with the message, "get-content : Array dimensions exceeded supported range."

As I've pointed out elsewhere, the "compare raw" method crashed with an OutOfMemoryException when presented with a pair of files of 3.7 Gb.

Warning: In initial testing, results appeared to indicate that the Windows command FC was about seven times faster than the buffered method. I had already performed a comparison of a 1 Tb folder with its backup that took about 10 hours using the buffered method. I was excited that FC could work so much faster, so I rewrote my script to repeat that comparison using FC instead, and was then confused to find that it took 14 hours. Then I realized that the initial results had been skewed by Windows caching the files when I ran the comparisons with comp, so they ran much faster when doing it again with FC. In the results reported above, the measurements were made with an empty cache. I have not found a way to remove a file from cache, so each measurement was made immediately after rebooting the computer (and with nothing else running).

Conclusion

  • In all cases, the buffered method ran the fastest.
  • compare-object is essentially useless on binary files. It only gave any reasonable speed when called with get-content ... -raw not "as byte", but when doing that it crashed on files over a few Gb.
  • If one doesn't want to run a custom function, the best method appears to be good-old FC,

Environment

The data above were collected on an AMD Ryzen 7 Pro 6850H Processor with 32 Gb of RAM, running Windows 11 Pro 64. The files in each pair are on an internal SSD and an external USB SSD.

Later, I repeated the tests of just methods "FC" and "buffered" with the external storage being a USB spinning hard drive instead of the SSD. I was surprised to see a dramatic speed improvement with that change:

Size PS FC Buffered
74 7 48.2 57.7
66 7 52.1 63.3
162 7 54.5 59.7
56 7 50.6 55.9

Table: Speed, in Mb/sec, of comparing four identical pairs of files (identified by their size in Mb) by two methods running in PowerShell 7. Difference from previous table is that one file in each pair was on a spinning HD instead of an SSD.

I don't know if this means my low cost SSD has poor performance, or if it's because I don't have the right cable for it. It's not a big problem for me because I don't run these comparisons often, but it does show the hardware dependence of such a process.

Size dependence of speed of buffered method

I used the buffered method to run a comparison of a 1.1 Tb folder with its backup. This took 10.1 hours, of which 9.8 hr was the sum of the elapsed times of the comparisons (i.e., overhead of 0.3 hr from scanning of folders). Thus the average speed of the comparisons was 116 Gb/hr or 33 Mb/sec. The size of the files ranged from 1 byte to 32 Gb.

To learn about the factors that affect the speed of the comparisons, I used Excel to rank the 222,000 files by size and comparison time of the 1,355 files with comparison times over 1 second, and by comparison speed of the 2,009 files with times over 1/2 second.

There was a rough, but far from perfect, correlation of file size and comparison speed. The 25 largest files, ranging from 4 to 32 Gb, had speeds ranging from 34.8 to 36.8 Mb/sec. These were close to, but not the fastest speeds, of which the top 25 ranged from 36.7 to 36.9 Mb/sec, with sizes ranging from 61 Mb to 28 Gb.

At the lower end, the 25 smallest ranked files, ranging from 22 kb to 33 Mb, had speeds ranging from 13 kb/sec to 31 Mb/sec. The slowest 25 ranked speeds ranged from 2 kb/sec. to 13 Mb/sec, with sizes ranging from 2 kb to 55 Mb.

It's helpful that in general, larger files compare faster. Definitely better than the other way around!

Code

I'm interested in feedback on improving these scripts, with two provisos. First, I know the batch script is pretty lame; it was just laid out quickly to get the job done. More attention was paid to the design of the PowerShell script. In that, I know that my coding style is unconventional, but I've developed it over many years, and I can only apologize if you don't like it. However, please do say something if you see ways to improve the functionality of the script.

It would also be interesting to hear if other people run the script and get results that are consistent with mine or different.

Windows batch scripts for comp and FC:

rem Script: "measure speed - comp.bat"
rem Measure the time taken to compare two files using "comp" running in a Windows batch script.
rem To ensure that none of the files is in cache, run this immediately after booting the computer.

time < nul
comp /m "<path 1><file 1>" "<path 2><file 1>"
time < nul
comp /m "<path 1><file 2>" "<path 2><file 2>"
time < nul
comp /m "<path 1><file 3>" "<path 2><file 3>"
time < nul
comp /m "<path 1><file 4>" "<path 2><file 4>"
time < nul

The console output was copy pasted into Excel, which then subtracted the times to get the elapsed time of each process. The batch for FC was the same with comp /m replaced with FC /b.

PowerShell script, including function bFilesCompareBinary:

# measure-speed-of-file-comparisons.ps1

# Set the $sFolder_n to a pair of folders with identical content. This script will measure and record, 
#     by one of eight different methods, the time taken to verify that all the files are identical.
# To ensure that none of the files is in cache, run this immediately after booting the computer.

# On use of get-content parameters "-encoding byte", "-AsByteStream", "-raw", and "-ReadCount 0":
#     www.jonathanmedd.net/2017/12/powershell-core-does-not-have-encoding-byte.-replaced-with-new-parameter-asbytestream.html/
#     www.powershellmagazine.com/2014/03/17/pstip-reading-file-content-as-a-byte-array/
#     www.github.com/PowerShell/PowerShell/issues/11266
#     www.github.com/MicrosoftDocs/PowerShell-Docs/issues/3215

# Calls to get-content with as-byte paremters are wrapped in an array ("@(, )") per instructions in
#     https://mcmap.net/q/911265/-powershell-why-are-these-file-comparison-times-so-different/#76843506

# =========================================================================
# Manually set these paths before running:
# =========================================================================
$sFolder_1 = "<path to first folder, including final '\'>"
$sFolder_2 = "<path to second folder, including final '\'>"
$sOutputFilespec = "<filespec of output csv file>"

# =========================================================================
# Function bFilesCompareBinary()
# =========================================================================
function bFilesCompareBinary ([System.IO.FileInfo] $oFile_1, [System.IO.FileInfo] $oFile_2, `
                              [uint32] $nBufferSize = 524288, $sRetIfSame = "Same", $sRetIfDif = "Dif")
   {# Return message for whether two given files are identical by binary comparison, or error description.
    #    Assumes the files are the same size, else error.
    
    # From "https://mcmap.net/q/842014/-powershell-binary-file-comparison#22800663"
    #    But comment by @mclayton on "https://mcmap.net/q/911265/-powershell-why-are-these-file-comparison-times-so-different/#76843506"
    #        warns that .read() does not always get all the bytes requested, so I've added a test for that.
    # FileInfo Class:   "https://learn.microsoft.com/en-us/dotnet/api/system.io.fileinfo"
    # FileStream Class: "https://learn.microsoft.com/en-us/dotnet/api/system.io.filestream"

    if ($nBufferSize -eq 0) {$nBufferSize = 524288}

    try{$oStream_1 = $oFile_1.OpenRead()
        $oStream_2 = $oFile_2.OpenRead()

        $oBuffer_1 = New-Object byte[] $nBufferSize
        $oBuffer_2 = New-Object byte[] $nBufferSize

        if ($oFile_1.Length -ne $oFile_2.Length) {throw "Files are different sizes: $oFile_1.Length , $oFile_2.Length"}
        $nBytesLeft = $oFile_1.Length
        $bDifferenceFound = $false
        $sError = ""

        do {$nBytesToGet = [math]::Min($nBytesLeft, $nBufferSize)
            $nBytesRead_1 = $oStream_1.read($oBuffer_1, 0, $nBytesToGet)
            $nBytesRead_2 = $oStream_2.read($oBuffer_2, 0, $nBytesToGet)
            if ($nBytesRead_1 -ne $nBytesRead_2) {throw "Different byte count each file: $nBytesRead_1 , $nBytesRead_2"}
            if ($nBytesRead_1 -ne $nBytesToGet) {throw "Byte count different from requested: $nBytesRead_1 , $nBytesToGet"}
            $nBytesLeft -= $nBytesRead_1
            if (-not [System.Linq.Enumerable]::SequenceEqual($oBuffer_1, $oBuffer_2)) {$bDifferenceFound = $true}
            } while ((-not $bDifferenceFound) -and $nBytesLeft -gt 0)
        }

    catch {$sError = "Error: $_"}

    finally {$oStream_1.Close() ; $oStream_2.Close()}

    if ($sError -ne "") {return $sError}
      elseif ($bDifferenceFound) {return $sRetIfDif}
      else {return ($sRetIfSame)}
    }

# =========================================================================
# User interaction
# =========================================================================
$bBooted = (read-host ("Did you boot the computer immediately before running this? (Enter ""Y"" or ""N"".)")).ToUpper()
$sPSenv = (read-host ("PowerShell environment: Enter ""D"" if running directly in Windows or ""S"" if in scripting environment (ISE or VS Code)")).ToUpper()
$nMethod = read-host ("Comparison method: Enter 1 for comp, 2 for FC, 3 for compare-object, 4 for compare raw, " + `
                            # "5 for compare as byte, " + `
                            "6 for compare as byte raw, 7 for compare as byte read 0, or 8 for buffered")
switch ($nMethod) {1 {$sMethod = "comp"}                   2 {$sMethod = "FC"}
                   3 {$sMethod = "compare-object"}         4 {$sMethod = "compare raw"}
                   5 {$sMethod = "compare as byte"}        6 {$sMethod = "compare as byte raw"}
                   7 {$sMethod = "compare as byte read 0"} 8 {$sMethod = "buffered"}}

# =========================================================================
# Scan the folders and compare files.
# =========================================================================
$nLen_1 = $sFolder_1.Length
$PSversion = $PSVersionTable.PSVersion.Major
get-ChildItem -path $sFolder_1 -Recurse | ForEach-Object `
   {$oItem_1 = $_
    $sItem_1 = $oItem_1.FullName

    # If it's a file, compare in both folders:
    if (Test-Path -Type Leaf $sItem_1) `
       {$nSize_1   = $oItem_1.Length
        $sItem_rel = $sItem_1.Substring($nLen_1)
        $sItem_2   = join-path $sFolder_2 $sItem_rel
        $oItem_2   = get-item $sItem_2
        $LastExitCode = 99
        $nMid = ""
        write-output "Check $sItem_rel"
        $dStart = $(get-date)
        switch ($nMethod)
           {{$_ -in 1, 2}
                {switch ($nMethod)
                   {1 {comp /m "$sItem_1" "$sItem_2"}
                    2 {FC.exe /b "$sItem_1" "$sItem_2"}}
                 switch ($LastExitCode) {0 {$sResult = "Same"} 1 {$sResult = "Dif"} default {$sResult = "Error: $LastExitCode"}}}
            {$_ -in 3, 4, 5, 6, 7}
                {switch ($nMethod)
                   {3 {$oContent_1 = (get-content $sItem_1)
                       $oContent_2 = (get-content $sItem_2)}
                    4 {$oContent_1 = (get-content $sItem_1 -raw)
                       $oContent_2 = (get-content $sItem_2 -raw)}
                    {$_ -in 5, 6, 7}
                        {switch ($PSversion)
                            {5 {switch ($nMethod)
                                   {5 {$oContent_1 = @(, (get-content $sItem_1 -encoding byte))
                                       $oContent_2 = @(, (get-content $sItem_2 -encoding byte))}
                                    6 {$oContent_1 = @(, (get-content $sItem_1 -encoding byte -raw))
                                       $oContent_2 = @(, (get-content $sItem_2 -encoding byte -raw))}
                                    7 {$oContent_1 = @(, (get-content $sItem_1 -encoding byte -ReadCount 0))
                                       $oContent_2 = @(, (get-content $sItem_2 -encoding byte -ReadCount 0))}
                                }   }
                             7 {switch ($nMethod)
                                   {5 {$oContent_1 = @(, (get-content $sItem_1 -AsByteStream))
                                       $oContent_2 = @(, (get-content $sItem_2 -AsByteStream))}
                                    6 {$oContent_1 = @(, (get-content $sItem_1 -AsByteStream -raw))
                                       $oContent_2 = @(, (get-content $sItem_2 -AsByteStream -raw))}
                                    7 {$oContent_1 = @(, (get-content $sItem_1 -AsByteStream -ReadCount 0))
                                       $oContent_2 = @(, (get-content $sItem_2 -AsByteStream -ReadCount 0))}
                                }   }
                             default {$sResult = "Error: PowerShell version is $PSversion"}
                    }    }   }
                 $nMid = ($(get-date) - $dStart).Ticks / 1e7
                 if (compare-object $oContent_1 $oContent_2) `
                    {$sResult = "Dif"} else {$sResult = "Same"}}
            8 {$sResult = bFilesCompareBinary $oItem_1 $oItem_2}
            }
        $nElapsed = ($(get-date) - $dStart).Ticks / 1e7
        $oOutput = [PSCustomObject]@{Booted = $bBooted ; PSversion = $PSversion ; PSenv = $sPSenv ; Method = $sMethod    ; Item = $nItem         ; Result = $sResult
                                     Size = $nSize_1   ; tStart = $dStart       ; tMid = $nMid    ; tElapsed = $nElapsed ; Filespec = $sItem_rel}
        Export-Csv -InputObject $oOutput -Path $sOutputFilespec -Append -NoTypeInformation
    }   }

# =========================================================================
# End of script
# =========================================================================
Bilbao answered 14/8, 2023 at 2:40 Comment(4)
Nice. You’ve put a lot of work into this. I think ultimately if you want raw performance, Powershell isn’t your best choice in general - one option you didn’t explore was compiling a c# assembly with Add-Type that uses the “buffered” approach - you might squeeze a few extra mb/s out of it with that approach, but that’s not to take away from what you’ve already done…Wed
@iRon What would a binary search help here? The data is not known to be well-ordered, and for a full comparison you'd need to inspect (or, in the case of hash-based comparison, envelop) every byte in both files at least once.Wilda
@MathiasR.Jessen, apparently I missed the point that full compare was required (and I still fail to see the use case to this on a binary file, why going to the hassle of a full compare of gvi files as you al ready know they are different by looking to the first bytes or the file size?)Dionedionis
@Dionedionis That's obviously a useful heuristic (returning false early when file size differs), but it still won't help you when files are actually identical - you still need to read both file streams to end in the worst case (worst case being identical files here)Wilda

© 2022 - 2024 — McMap. All rights reserved.