Edit: I have changed the title of this question from "PowerShell: Why is this timing not working?" I originally thought the times reported had to be wrong, but I was wrong about that. The times reported were correct, and what I learned from the discussion of the question was why the times were so different. So the new title better describes what can be learned from this Q&A.
I'm writing a script to compare the contents of two folders, including a binary comparison if the size and timestamp are the same. I want to monitor how quickly it does the comparisons. But my results of that are way out of whack.
Here is an excerpt of my code that just tests the monitoring of the speed of comparison.
$sFolder_1 = "<path of folder 1, including final \>"
$sFolder_2 = "<path of folder 2, including final \>"
get-ChildItem -path $sFolder_1 -Recurse | ForEach-Object `
{$oItem_1 = $_
$sItem_1 = $oItem_1.FullName
$sItem_rel = $sItem_1.Substring($nLen_1)
$sItem_2 = join-path $sFolder_2 $sItem_rel
if(Test-Path -Type Container $sItem_1) {$sFile = ""} else {$sFile = "F"}
# Check for corresponding item in folder 2:
if (-not (Test-Path $sItem_2)) `
{$sResult = "Not in 2"}
else
# If it's a file, compare in both folders:
{if ($sFile -eq "") `
{$sResult = "Found"}
else
{$nSize_1 = $oItem_1.Length
$dTimeStart = $(get-date)
$nKb = ($nSize_1 / 1024)
Write-Output "$dTimeStart : Checking file ($nKb kb)"
if (Compare-Object (Get-Content $sItem_1) (Get-Content $sItem_2)) `
{$sResult = "Dif content"}
else
{$sResult = "Same"}
$nTimeElapsed = ($(get-date) - $dTimeStart).Ticks / 1e7
$nSpeed = $nKb / $nTimeElapsed
Write-Output "$nKb kb in $nTimeElapsed seconds, speed $nSpeed kb/sec."
} }
Write-Output $sResult
}
Here is the output from running that on a particular pair of folders. The four files in the two folders are all "gvi" files, which is a type of video file.
08/05/2023 08:58:41 : Checking file (75402.453125 kb)
75402.453125 kb in 37.389018 seconds, speed 2016.70054894194 kb/sec.
Same
08/05/2023 08:59:18 : Checking file (67386.28515625 kb)
67386.28515625 kb in 22.6866484 seconds, speed 2970.30588071573 kb/sec.
Same
08/05/2023 08:59:41 : Checking file (165559.28125 kb)
165559.28125 kb in 5.6360258 seconds, speed 29375.1815774158 kb/sec.
Same
08/05/2023 08:59:47 : Checking file (57776.244140625 kb)
57776.244140625 kb in 2.059942 seconds, speed 28047.5101437929 kb/sec.
Same
This says that the comparison ran ten times faster on the third and fourth files than on the first two. That doesn't make sense. I'm guessing that there's something about the way PowerShell is optimizing the process that is causing the difference. Is there a way to find out the actual time spend doing each comparison?
If we rephrase your question to:
Then the answer is easy:
You're expecting the time taken to vary with the size of the file, but your code is actually doing something that means a significant part of the performance is based on the number of line break character sequences in the files!
So there's a couple of issues with your approach:
Problem 1: With binary data files, the expressions (Get-Content $sItem_1)
and (Get-Content $sItem_2)
basically retrieve mangled arrays of stringified binary data, the number of items in which are determined by the number of "line-break-like" sequences in the binary content.
Get-Content
is primarily meant for use with text-based files - by default it will decode and split a file into lines of text based on line break sequences it finds in the file - see Get-Content
:the content is read one line at a time and returns a collection of objects, each representing a line of content.
This means that any byte sequences in the binary file that happen to look like line breaks will be treated as line breaks regardless of their meaning in the binary file's native format. The number of strings in the return array from Get-Content
will correlate with the number of accidental line break sequences in the binary file.
In your sample data this ranges from hundreds of items through to hundreds of thousands of items, and doesn't really seem to relate to the size of the file.
Problem 2: Performance of Compare-Object
correlates loosely to the number of items in the input collections.
Compare-Object
attempts to pair up equal values in the left and right sides and returns any items that it fails to find a partner for. As the number of input items increases, so the time taken increases, and the increase can be combinatorially explosive with the wrong data.From your own test data you can see the number of "line breaks" correlates with the processing time:
File | Size | CR Count | LF Count | Time | Kb/s |
---|---|---|---|---|---|
File 1 | 75mb | 272,652 | 291,178 | 37s | 2,016 |
File 2 | 67mb | 189,941 | 197,111 | 22s | 2,970 |
File 3 | 165mb | 398 | 721 | 5s | 29,375 |
File 4 | 57mb | 3,130 | 28,847 | 2s | 28,047 |
Possible fixes:
One option is to use the -Raw
switch on Get-Content
which forces it to read the entire file contents into a single string and ignore line breaks.
You don't really need to use Compare-Object
if you do this - you can just to a simple string comparison:
if ((Get-Content $sItem_1 -Raw) -eq (Get-Content $sItem_2 -Raw))
However, you're still creating mangled stringified representations of the binary data, which isn't ideal and you're processing the whole file even if the first byte is different.
Yet another option is to use the -AsByteStream
switch on Get-Content
- this will return an array of bytes instead of a string, but you'll need to modify the call to Compare-Object
as well:
Compare-Object @(,(Get-Content $sItem_1 -AsByteStream)) @(,(Get-Content $sItem_2 -AsByteStream))
Note the return value from Get-Content
is wrapped in an outer array @(, ... )
- this forces Compare-Object
to compare the two arrays as ordered lists, rather than as sets of values. See the two examples below:
# nothing returned because the arrays are treated as *sets* with 2 matching items in, not an ordered list
PS> Compare-Object @(0, 1) @(1, 0)
# inputs are treated as containing a single ordered-list item, and the lists are not the same
PS> Compare-Object @(,@(0, 1)) @(,@(1, 0))
InputObject SideIndicator
----------- -------------
{1, 0} =>
{0, 1} <=
In this case you could do:
if( Compare-Object @(,(Get-Content $sItem_1 -AsByteStream)) @(,(Get-Content $sItem_2 -AsByteStream)) )
... athough this still reads in the whole file though even if the first byte is different.
update - as suggested by @mklement0, using -Raw
as well as-AsByteStream
will improve performance as the entire file contents is returned as a single byte array, rather than a drip-fed pipeline consisting of individual bytes one-at-a-time that have to be collected into an array anyway.
The updated code would look like:
if( Compare-Object @(,(Get-Content $sItem_1 -AsByteStream -Raw)) @(,(Get-Content $sItem_2 -AsByteStream -Raw)) )
You could also take a completely different approach and compare the hashes of the files with Get-FileHash
. That's presumably optimised to be memory-efficient (e.g. not storing the whole file in memory at once) and properly treats the binary data as binary data.
As with the other two approaches though, it will still process the whole file before comparing the hashes. To fix that you might need to drop down to native dotnet methods, but this answer is already pretty long, so you could maybe search for "compare binary files in powershell" to research that...