Contents
- Data
- Errors and mistakes
- Conclusion
- Environment
- Size dependence of speed of buffered method
- Code
Data
The table below gives, for four files and their identical copies, data on the speed of comparison of the pairs by seven different methods. These four files were selected for convenience of being located together in a path on my computer and in a corresponding path on an external SSD. They all happen to be gvi video files, which should not have been relevant, but a quirk of their structure turned out to have an interesting effect on one of the methods. The table gives the speed, in Mb/sec, of the comparison process for each method and for each file. The speed was calculated by dividing the size of the file by the elapsed time of the process. Code is given further below for the scripts that performed and timed the comparisons.
The columns of the table are:
- Size: The size of the files in Mb.
- PS: PowerShell version in which the process was run. A hyphen ("-") indicates the process was run in a Windows batch file, not in PowerShell. The scripts were run directly in Windows, not in a scripting environment (ISE or VS Code).
- The methods:
- Comp: The Windows command
comp
.
- FC: The Windows command
FC
.
- Compare-object: The PowerShell command
compare-object
acting on a get-content
of each file to be compared.
- Compare raw: The PowerShell command
compare-object
in which the get-content
s have the parameter -raw
.
- (not included:) Compare as byte: I attempted to include the PowerShell command
compare-object
in which the get-content
s have just the parameter -encoding byte
(PS 5) or -AsByteStream
(PS 7), but this sat for over a half-hour in both PS 5 and 7, so either the process hung or it took so long that it might as well have hung.
- Compare as byte raw: The PowerShell command
compare-object
in which the get-content
s have the parameters -encoding byte
(PS 5) or -AsByteStream
(PS 7) plus -raw
.
- Compare as byte read 0: The PowerShell command
compare-object
in which the get-content
s have the parameters -encoding byte
(PS 5) or -AsByteStream
(PS 7) plus -ReadCount 0
.
- Buffered: The PowerShell custom function
bFilesCompareBinary
, based on code written by Kees Bakker, which performs a buffered comparison (code included in script below).
- (not included:) Hash comparisons. All the methods tested do direct byte-by-byte comparisons of the file contents.
Since the pairs tested were all identical, all the measurements had to compare all bytes in the files. For pairs that are not necessarily identical, the Windows commands and the buffered method have the ability to abort after detecting a difference, and so could run even faster. The compare-object
methods compare the entire files, even if the first bytes are different.
Size |
PS |
Comp |
FC |
Compare-object |
Compare raw |
Compare as byte raw |
Compare as byte read 0 |
Buffered |
74 |
- |
29.0 |
30.3 |
|
|
|
|
|
"" |
5 |
29.2 |
30.2 |
4.1 |
18.7 |
0.5 |
0.5 |
35.2 |
"" |
7 |
29.2 |
30.9 |
3.4 |
20.7 |
1.2 |
0.9 |
36.5 |
66 |
- |
25.3 |
26.2 |
|
|
|
|
|
"" |
5 |
25.5 |
26.1 |
5.6 |
20.4 |
0.5 |
0.5 |
35.4 |
"" |
7 |
25.4 |
26.3 |
2.8 |
22.0 |
1.2 |
1.0 |
37.1 |
162 |
- |
25.6 |
26.1 |
|
|
|
|
|
"" |
5 |
25.5 |
26.5 |
15.0 |
18.7 |
0.5 |
Error |
35.8 |
"" |
7 |
25.8 |
26.8 |
17.8 |
24.6 |
1.2 |
1.0 |
36.8 |
56 |
- |
25.5 |
25.8 |
|
|
|
|
|
"" |
5 |
25.5 |
26.0 |
21.6 |
3.0 |
0.5 |
0.5 |
35.2 |
"" |
7 |
26.0 |
26.5 |
17.6 |
25.1 |
1.3 |
1.1 |
36.0 |
Table: Speed, in Mb/sec, of comparing four identical pairs of files (identified by their size in Mb) by seven methods running in Windows batch, in Windows PowerShell 5.1, and in PowerShell 7.
Note that with the method "Compare-object", the third and fourth files run much faster than the first two. This was the mystery that my original question asked about, and is explained in its answers.
Errors and mistakes
In the case indicated as "Error" (method "Compare as byte read 0" in PS 5 on the largest file), the process crashed PowerShell with the message, "get-content : Array dimensions exceeded supported range."
As I've pointed out elsewhere, the "compare raw" method crashed with an OutOfMemoryException
when presented with a pair of files of 3.7 Gb.
Warning: In initial testing, results appeared to indicate that the Windows command FC
was about seven times faster than the buffered method. I had already performed a comparison of a 1 Tb folder with its backup that took about 10 hours using the buffered method. I was excited that FC
could work so much faster, so I rewrote my script to repeat that comparison using FC
instead, and was then confused to find that it took 14 hours. Then I realized that the initial results had been skewed by Windows caching the files when I ran the comparisons with comp
, so they ran much faster when doing it again with FC
. In the results reported above, the measurements were made with an empty cache. I have not found a way to remove a file from cache, so each measurement was made immediately after rebooting the computer (and with nothing else running).
Conclusion
- In all cases, the buffered method ran the fastest.
compare-object
is essentially useless on binary files. It only gave any reasonable speed when called with get-content ... -raw
not "as byte", but when doing that it crashed on files over a few Gb.
- If one doesn't want to run a custom function, the best method appears to be good-old
FC
,
Environment
The data above were collected on an AMD Ryzen 7 Pro 6850H Processor with 32 Gb of RAM, running Windows 11 Pro 64. The files in each pair are on an internal SSD and an external USB SSD.
Later, I repeated the tests of just methods "FC" and "buffered" with the external storage being a USB spinning hard drive instead of the SSD. I was surprised to see a dramatic speed improvement with that change:
Size |
PS |
FC |
Buffered |
74 |
7 |
48.2 |
57.7 |
66 |
7 |
52.1 |
63.3 |
162 |
7 |
54.5 |
59.7 |
56 |
7 |
50.6 |
55.9 |
Table: Speed, in Mb/sec, of comparing four identical pairs of files (identified by their size in Mb) by two methods running in PowerShell 7. Difference from previous table is that one file in each pair was on a spinning HD instead of an SSD.
I don't know if this means my low cost SSD has poor performance, or if it's because I don't have the right cable for it. It's not a big problem for me because I don't run these comparisons often, but it does show the hardware dependence of such a process.
Size dependence of speed of buffered method
I used the buffered method to run a comparison of a 1.1 Tb folder with its backup. This took 10.1 hours, of which 9.8 hr was the sum of the elapsed times of the comparisons (i.e., overhead of 0.3 hr from scanning of folders). Thus the average speed of the comparisons was 116 Gb/hr or 33 Mb/sec. The size of the files ranged from 1 byte to 32 Gb.
To learn about the factors that affect the speed of the comparisons, I used Excel to rank the 222,000 files by size and comparison time of the 1,355 files with comparison times over 1 second, and by comparison speed of the 2,009 files with times over 1/2 second.
There was a rough, but far from perfect, correlation of file size and comparison speed. The 25 largest files, ranging from 4 to 32 Gb, had speeds ranging from 34.8 to 36.8 Mb/sec. These were close to, but not the fastest speeds, of which the top 25 ranged from 36.7 to 36.9 Mb/sec, with sizes ranging from 61 Mb to 28 Gb.
At the lower end, the 25 smallest ranked files, ranging from 22 kb to 33 Mb, had speeds ranging from 13 kb/sec to 31 Mb/sec. The slowest 25 ranked speeds ranged from 2 kb/sec. to 13 Mb/sec, with sizes ranging from 2 kb to 55 Mb.
It's helpful that in general, larger files compare faster. Definitely better than the other way around!
Code
I'm interested in feedback on improving these scripts, with two provisos. First, I know the batch script is pretty lame; it was just laid out quickly to get the job done. More attention was paid to the design of the PowerShell script. In that, I know that my coding style is unconventional, but I've developed it over many years, and I can only apologize if you don't like it. However, please do say something if you see ways to improve the functionality of the script.
It would also be interesting to hear if other people run the script and get results that are consistent with mine or different.
Windows batch scripts for comp
and FC
:
rem Script: "measure speed - comp.bat"
rem Measure the time taken to compare two files using "comp" running in a Windows batch script.
rem To ensure that none of the files is in cache, run this immediately after booting the computer.
time < nul
comp /m "<path 1><file 1>" "<path 2><file 1>"
time < nul
comp /m "<path 1><file 2>" "<path 2><file 2>"
time < nul
comp /m "<path 1><file 3>" "<path 2><file 3>"
time < nul
comp /m "<path 1><file 4>" "<path 2><file 4>"
time < nul
The console output was copy pasted into Excel, which then subtracted the times to get the elapsed time of each process. The batch for FC
was the same with comp /m
replaced with FC /b
.
PowerShell script, including function bFilesCompareBinary
:
# measure-speed-of-file-comparisons.ps1
# Set the $sFolder_n to a pair of folders with identical content. This script will measure and record,
# by one of eight different methods, the time taken to verify that all the files are identical.
# To ensure that none of the files is in cache, run this immediately after booting the computer.
# On use of get-content parameters "-encoding byte", "-AsByteStream", "-raw", and "-ReadCount 0":
# www.jonathanmedd.net/2017/12/powershell-core-does-not-have-encoding-byte.-replaced-with-new-parameter-asbytestream.html/
# www.powershellmagazine.com/2014/03/17/pstip-reading-file-content-as-a-byte-array/
# www.github.com/PowerShell/PowerShell/issues/11266
# www.github.com/MicrosoftDocs/PowerShell-Docs/issues/3215
# Calls to get-content with as-byte paremters are wrapped in an array ("@(, )") per instructions in
# https://mcmap.net/q/911265/-powershell-why-are-these-file-comparison-times-so-different/#76843506
# =========================================================================
# Manually set these paths before running:
# =========================================================================
$sFolder_1 = "<path to first folder, including final '\'>"
$sFolder_2 = "<path to second folder, including final '\'>"
$sOutputFilespec = "<filespec of output csv file>"
# =========================================================================
# Function bFilesCompareBinary()
# =========================================================================
function bFilesCompareBinary ([System.IO.FileInfo] $oFile_1, [System.IO.FileInfo] $oFile_2, `
[uint32] $nBufferSize = 524288, $sRetIfSame = "Same", $sRetIfDif = "Dif")
{# Return message for whether two given files are identical by binary comparison, or error description.
# Assumes the files are the same size, else error.
# From "https://mcmap.net/q/842014/-powershell-binary-file-comparison#22800663"
# But comment by @mclayton on "https://mcmap.net/q/911265/-powershell-why-are-these-file-comparison-times-so-different/#76843506"
# warns that .read() does not always get all the bytes requested, so I've added a test for that.
# FileInfo Class: "https://learn.microsoft.com/en-us/dotnet/api/system.io.fileinfo"
# FileStream Class: "https://learn.microsoft.com/en-us/dotnet/api/system.io.filestream"
if ($nBufferSize -eq 0) {$nBufferSize = 524288}
try{$oStream_1 = $oFile_1.OpenRead()
$oStream_2 = $oFile_2.OpenRead()
$oBuffer_1 = New-Object byte[] $nBufferSize
$oBuffer_2 = New-Object byte[] $nBufferSize
if ($oFile_1.Length -ne $oFile_2.Length) {throw "Files are different sizes: $oFile_1.Length , $oFile_2.Length"}
$nBytesLeft = $oFile_1.Length
$bDifferenceFound = $false
$sError = ""
do {$nBytesToGet = [math]::Min($nBytesLeft, $nBufferSize)
$nBytesRead_1 = $oStream_1.read($oBuffer_1, 0, $nBytesToGet)
$nBytesRead_2 = $oStream_2.read($oBuffer_2, 0, $nBytesToGet)
if ($nBytesRead_1 -ne $nBytesRead_2) {throw "Different byte count each file: $nBytesRead_1 , $nBytesRead_2"}
if ($nBytesRead_1 -ne $nBytesToGet) {throw "Byte count different from requested: $nBytesRead_1 , $nBytesToGet"}
$nBytesLeft -= $nBytesRead_1
if (-not [System.Linq.Enumerable]::SequenceEqual($oBuffer_1, $oBuffer_2)) {$bDifferenceFound = $true}
} while ((-not $bDifferenceFound) -and $nBytesLeft -gt 0)
}
catch {$sError = "Error: $_"}
finally {$oStream_1.Close() ; $oStream_2.Close()}
if ($sError -ne "") {return $sError}
elseif ($bDifferenceFound) {return $sRetIfDif}
else {return ($sRetIfSame)}
}
# =========================================================================
# User interaction
# =========================================================================
$bBooted = (read-host ("Did you boot the computer immediately before running this? (Enter ""Y"" or ""N"".)")).ToUpper()
$sPSenv = (read-host ("PowerShell environment: Enter ""D"" if running directly in Windows or ""S"" if in scripting environment (ISE or VS Code)")).ToUpper()
$nMethod = read-host ("Comparison method: Enter 1 for comp, 2 for FC, 3 for compare-object, 4 for compare raw, " + `
# "5 for compare as byte, " + `
"6 for compare as byte raw, 7 for compare as byte read 0, or 8 for buffered")
switch ($nMethod) {1 {$sMethod = "comp"} 2 {$sMethod = "FC"}
3 {$sMethod = "compare-object"} 4 {$sMethod = "compare raw"}
5 {$sMethod = "compare as byte"} 6 {$sMethod = "compare as byte raw"}
7 {$sMethod = "compare as byte read 0"} 8 {$sMethod = "buffered"}}
# =========================================================================
# Scan the folders and compare files.
# =========================================================================
$nLen_1 = $sFolder_1.Length
$PSversion = $PSVersionTable.PSVersion.Major
get-ChildItem -path $sFolder_1 -Recurse | ForEach-Object `
{$oItem_1 = $_
$sItem_1 = $oItem_1.FullName
# If it's a file, compare in both folders:
if (Test-Path -Type Leaf $sItem_1) `
{$nSize_1 = $oItem_1.Length
$sItem_rel = $sItem_1.Substring($nLen_1)
$sItem_2 = join-path $sFolder_2 $sItem_rel
$oItem_2 = get-item $sItem_2
$LastExitCode = 99
$nMid = ""
write-output "Check $sItem_rel"
$dStart = $(get-date)
switch ($nMethod)
{{$_ -in 1, 2}
{switch ($nMethod)
{1 {comp /m "$sItem_1" "$sItem_2"}
2 {FC.exe /b "$sItem_1" "$sItem_2"}}
switch ($LastExitCode) {0 {$sResult = "Same"} 1 {$sResult = "Dif"} default {$sResult = "Error: $LastExitCode"}}}
{$_ -in 3, 4, 5, 6, 7}
{switch ($nMethod)
{3 {$oContent_1 = (get-content $sItem_1)
$oContent_2 = (get-content $sItem_2)}
4 {$oContent_1 = (get-content $sItem_1 -raw)
$oContent_2 = (get-content $sItem_2 -raw)}
{$_ -in 5, 6, 7}
{switch ($PSversion)
{5 {switch ($nMethod)
{5 {$oContent_1 = @(, (get-content $sItem_1 -encoding byte))
$oContent_2 = @(, (get-content $sItem_2 -encoding byte))}
6 {$oContent_1 = @(, (get-content $sItem_1 -encoding byte -raw))
$oContent_2 = @(, (get-content $sItem_2 -encoding byte -raw))}
7 {$oContent_1 = @(, (get-content $sItem_1 -encoding byte -ReadCount 0))
$oContent_2 = @(, (get-content $sItem_2 -encoding byte -ReadCount 0))}
} }
7 {switch ($nMethod)
{5 {$oContent_1 = @(, (get-content $sItem_1 -AsByteStream))
$oContent_2 = @(, (get-content $sItem_2 -AsByteStream))}
6 {$oContent_1 = @(, (get-content $sItem_1 -AsByteStream -raw))
$oContent_2 = @(, (get-content $sItem_2 -AsByteStream -raw))}
7 {$oContent_1 = @(, (get-content $sItem_1 -AsByteStream -ReadCount 0))
$oContent_2 = @(, (get-content $sItem_2 -AsByteStream -ReadCount 0))}
} }
default {$sResult = "Error: PowerShell version is $PSversion"}
} } }
$nMid = ($(get-date) - $dStart).Ticks / 1e7
if (compare-object $oContent_1 $oContent_2) `
{$sResult = "Dif"} else {$sResult = "Same"}}
8 {$sResult = bFilesCompareBinary $oItem_1 $oItem_2}
}
$nElapsed = ($(get-date) - $dStart).Ticks / 1e7
$oOutput = [PSCustomObject]@{Booted = $bBooted ; PSversion = $PSversion ; PSenv = $sPSenv ; Method = $sMethod ; Item = $nItem ; Result = $sResult
Size = $nSize_1 ; tStart = $dStart ; tMid = $nMid ; tElapsed = $nElapsed ; Filespec = $sItem_rel}
Export-Csv -InputObject $oOutput -Path $sOutputFilespec -Append -NoTypeInformation
} }
# =========================================================================
# End of script
# =========================================================================
Add-Type
that uses the “buffered” approach - you might squeeze a few extra mb/s out of it with that approach, but that’s not to take away from what you’ve already done… – Wed