powershellperformancecomparetiming

PowerShell: Why are these file comparison times so different?


Edit: I have changed the title of this question from "PowerShell: Why is this timing not working?" I originally thought the times reported had to be wrong, but I was wrong about that. The times reported were correct, and what I learned from the discussion of the question was why the times were so different. So the new title better describes what can be learned from this Q&A.


I'm writing a script to compare the contents of two folders, including a binary comparison if the size and timestamp are the same. I want to monitor how quickly it does the comparisons. But my results of that are way out of whack.

Here is an excerpt of my code that just tests the monitoring of the speed of comparison.

$sFolder_1 = "<path of folder 1, including final \>"
$sFolder_2 = "<path of folder 2, including final \>"

get-ChildItem -path $sFolder_1 -Recurse | ForEach-Object `
   {$oItem_1   = $_
    $sItem_1   = $oItem_1.FullName
    $sItem_rel = $sItem_1.Substring($nLen_1)
    $sItem_2   = join-path $sFolder_2 $sItem_rel
    if(Test-Path -Type Container $sItem_1) {$sFile = ""} else {$sFile = "F"}

    # Check for corresponding item in folder 2:
    if (-not (Test-Path $sItem_2)) `
       {$sResult = "Not in 2"}
      else
        # If it's a file, compare in both folders:
       {if ($sFile -eq "") `
           {$sResult = "Found"}
          else
           {$nSize_1 = $oItem_1.Length
            $dTimeStart = $(get-date)
            $nKb = ($nSize_1 / 1024)
            Write-Output "$dTimeStart : Checking file ($nKb kb)"
            if (Compare-Object (Get-Content $sItem_1) (Get-Content $sItem_2)) `
               {$sResult = "Dif content"}
              else
               {$sResult = "Same"}
            $nTimeElapsed = ($(get-date) - $dTimeStart).Ticks / 1e7
            $nSpeed = $nKb / $nTimeElapsed
            Write-Output "$nKb kb in $nTimeElapsed seconds, speed $nSpeed kb/sec."
        }   }
    Write-Output $sResult
    }

Here is the output from running that on a particular pair of folders. The four files in the two folders are all "gvi" files, which is a type of video file.

08/05/2023 08:58:41 : Checking file (75402.453125 kb)
75402.453125 kb in 37.389018 seconds, speed 2016.70054894194 kb/sec.
Same
08/05/2023 08:59:18 : Checking file (67386.28515625 kb)
67386.28515625 kb in 22.6866484 seconds, speed 2970.30588071573 kb/sec.
Same
08/05/2023 08:59:41 : Checking file (165559.28125 kb)
165559.28125 kb in 5.6360258 seconds, speed 29375.1815774158 kb/sec.
Same
08/05/2023 08:59:47 : Checking file (57776.244140625 kb)
57776.244140625 kb in 2.059942 seconds, speed 28047.5101437929 kb/sec.
Same

This says that the comparison ran ten times faster on the third and fourth files than on the first two. That doesn't make sense. I'm guessing that there's something about the way PowerShell is optimizing the process that is causing the difference. Is there a way to find out the actual time spend doing each comparison?


Solution

  • If we rephrase your question to:

    Then the answer is easy:

    You're expecting the time taken to vary with the size of the file, but your code is actually doing something that means a significant part of the performance is based on the number of line break character sequences in the files!


    So there's a couple of issues with your approach:

    Problem 1: With binary data files, the expressions (Get-Content $sItem_1) and (Get-Content $sItem_2) basically retrieve mangled arrays of stringified binary data, the number of items in which are determined by the number of "line-break-like" sequences in the binary content.

    the content is read one line at a time and returns a collection of objects, each representing a line of content.

    This means that any byte sequences in the binary file that happen to look like line breaks will be treated as line breaks regardless of their meaning in the binary file's native format. The number of strings in the return array from Get-Content will correlate with the number of accidental line break sequences in the binary file.

    In your sample data this ranges from hundreds of items through to hundreds of thousands of items, and doesn't really seem to relate to the size of the file.


    Problem 2: Performance of Compare-Object correlates loosely to the number of items in the input collections.

    From your own test data you can see the number of "line breaks" correlates with the processing time:

    File Size CR Count LF Count Time Kb/s
    File 1 75mb 272,652 291,178 37s 2,016
    File 2 67mb 189,941 197,111 22s 2,970
    File 3 165mb 398 721 5s 29,375
    File 4 57mb 3,130 28,847 2s 28,047

    Possible fixes:

    Get-Content -Raw

    One option is to use the -Raw switch on Get-Content which forces it to read the entire file contents into a single string and ignore line breaks.

    You don't really need to use Compare-Object if you do this - you can just to a simple string comparison:

    if ((Get-Content $sItem_1 -Raw) -eq (Get-Content $sItem_2 -Raw))
    

    However, you're still creating mangled stringified representations of the binary data, which isn't ideal and you're processing the whole file even if the first byte is different.

    Get-Content -AsByteStream

    Yet another option is to use the -AsByteStream switch on Get-Content - this will return an array of bytes instead of a string, but you'll need to modify the call to Compare-Object as well:

    Compare-Object @(,(Get-Content $sItem_1 -AsByteStream)) @(,(Get-Content $sItem_2 -AsByteStream))
    

    Note the return value from Get-Content is wrapped in an outer array @(, ... ) - this forces Compare-Object to compare the two arrays as ordered lists, rather than as sets of values. See the two examples below:

    # nothing returned because the arrays are treated as *sets* with 2 matching items in, not an ordered list
    PS> Compare-Object @(0, 1) @(1, 0)
    
    # inputs are treated as containing a single ordered-list item, and the lists are not the same
    PS> Compare-Object @(,@(0, 1)) @(,@(1, 0))
    
    InputObject SideIndicator
    ----------- -------------
    {1, 0}      =>
    {0, 1}      <=
    

    In this case you could do:

    if( Compare-Object @(,(Get-Content $sItem_1 -AsByteStream)) @(,(Get-Content $sItem_2 -AsByteStream)) )
    

    ... athough this still reads in the whole file though even if the first byte is different.


    update - as suggested by @mklement0, using -Raw as well as-AsByteStream will improve performance as the entire file contents is returned as a single byte array, rather than a drip-fed pipeline consisting of individual bytes one-at-a-time that have to be collected into an array anyway.

    The updated code would look like:

    if( Compare-Object @(,(Get-Content $sItem_1 -AsByteStream -Raw)) @(,(Get-Content $sItem_2 -AsByteStream -Raw)) )
    

    Get-FileHash

    You could also take a completely different approach and compare the hashes of the files with Get-FileHash. That's presumably optimised to be memory-efficient (e.g. not storing the whole file in memory at once) and properly treats the binary data as binary data.

    As with the other two approaches though, it will still process the whole file before comparing the hashes. To fix that you might need to drop down to native dotnet methods, but this answer is already pretty long, so you could maybe search for "compare binary files in powershell" to research that...