powershellmemorymemory-profilingimport-csvpscustomobject

PowerShell: Importing 16MB CSV Into PowerShell Variable Creates >600MB's of PowerShell Memory Usage


I'm trying to understand why PowerShell's memory balloons so much when I import a file that's ~16MB's as a variable. I can understand there's additional memory structure around that variable but I'm just trying to understand why it's THAT high. Here's what I'm doing below - just a stripped down simplistic snippet of another script that anyone can run.

Notes/Questions

  1. Not complaining, trying to understand why so much usage and is there a better way of doing this or managing memory more efficiently to respect the system I'm running this on.
  2. This same behavior occurs in PowerShell 5.1 and on PowerShell 7, RC3 just released. I don't think this is a bug, simply another opportunity for me to learn more.
  3. My overall objective with this is to run a foreach loop to check another much smaller array against this array for matches or lack there of.

My Test Code

Invoke-WebRequest -uri "http://s3.amazonaws.com/alexa-static/top-1m.csv.zip" -OutFile C:\top-1m.csv.zip

Expand-Archive -Path C:\top-1m.csv.zip -DestinationPath C:\top-1m.csv

$alexaTopMillion = Import-Csv -Path C:\top-1m.csv

To anyone answering this: Thank you for your time and helping me learn more every day!


Solution

  • Generally speaking, iRon's advice in a comment on the question is worth heeding (the specific question is addressed in the section that follows this one):

    To keep memory usage low, use streaming of objects in the pipeline rather than collecting them in memory first - if feasible.

    That is, instead of doing this:

    # !! Collects ALL objects in memory, as an array.
    $rows = Import-Csv in.csv
    foreach ($row in $rows) { ... }
    

    do this:

    # Process objects ONE BY ONE.
    # As long as you stream to a *file* or some other output stream
    # (as opposed to assigning to a *variable*), memory use should remain constant,
    # except for temporarily held memory awaiting garbage collection.
    Import-Csv in.csv | ForEach-Object { ... } # pipe to Export-Csv, for instance
    

    However, even then you seemingly can run out of memory with very large files - see this question - possibly related to buildup of memory from no longer needed objects that haven't yet been garbage-collected; therefore, periodically calling [GC]::Collect() in the ForEach-Object script block may solve the problem - see this answer for an example.


    If you do need to collect all objects output by Import-Csv in memory at once:

    The inordinate memory use you observe comes from how [pscustomobject] instances (Import-Csv's output type) are implemented, as discussed in GitHub issue #7603 (emphasis added):

    The memory pressure most likely comes from the cost of PSNoteProperty [which is how [pscustomobject] properties are implemented]. Each PSNoteProperty has an overhead of 48 bytes, so when you just store a few bytes per property, that becomes massive.

    The same issue proposes a workaround to reduce memory consumption (as also shown in Wasif Hasan's answer):

    Note: This workaround comes at the expensive of considerably diminished execution speed.

    $csvFile = 'C:\top-1m.csv'
    
    # Dynamically define a custom class derived from the *first* row
    # read from the CSV file.
    # Note: While this is a legitimate use of Invoke-Expression, 
    #       it should generally be avoided.
    "class CsvRow { 
     $((Import-Csv $csvFile | Select-Object -first 1).psobject.properties.Name -replace '^', '[string] $$' -join ";") 
    }" | Invoke-Expression
    
    # Import all rows and convert them from [pscustomobject] instances 
    # to [CsvRow] instances to reduce memory consumption.
    # Note: Casting the Import-Csv call directly to [CsvRow[]] would be noticeably
    #       faster, but increases *temporary* memory pressure substantially.
    $alexaTopMillion = Import-Csv $csvFile | ForEach-Object { [CsvRow] $_ }
    

    Longer-term, a better solution that would also be faster is to make Import-Csv support outputting parsed rows with a given output type, say, via an -OutputType parameter, as proposed in GitHub issue #8862.
    If that is of interest to you, show your support for the proposal there.


    Memory-use benchmarks:

    The following code compares memory use with normal Import-Csv import (array of [pscustomobject]s) to the workaround (array of custom-class instances).

    The measurement isn't exact, as the PowerShell's process working memory is simply queried, which can show the influence of background activities, but it gives a rough sense of how much less memory using a custom class requires.

    Sample output, which shows that the custom-class workaround requires only about one 5th of the memory with the sample 10-column CSV input file with about 166,000 rows used below - the specific ratio depends on the number of input rows and columns:

    MB Used Command
    ------- -------
     384.50  # normal import…
      80.48  # import via custom class…
    

    Benchmark code:

    # Create a sample CSV file with 10 columns about 16 MB in size.
    $tempCsvFile = [IO.Path]::GetTempFileName()
    ('"Col1","Col2","Col3","Col4","Col5","Col6","Col7","Col8","Col9","Col10"' + "`n") | Set-Content -NoNewline $tempCsvFile
    ('"Col1Val","Col2Val","Col3Val","Col4Val","Col5Val","Col6Val","Col7Val","Col8Val","Col9Val","Col10Val"' + "`n") * 1.662e5 |
      Add-Content $tempCsvFile
    
    try {
    
      { # normal import
        $all = Import-Csv $tempCsvFile
      },
      { # import via custom class
        "class CsvRow {
          $((Import-Csv $tempCsvFile | Select-Object -first 1).psobject.properties.Name -replace '^', '[string] $$' -join ";")
        }" | Invoke-Expression
        $all = Import-Csv $tempCsvFile | ForEach-Object { [CsvRow] $_ }
      } | ForEach-Object {
        [gc]::Collect(); [gc]::WaitForPendingFinalizers() # garbage-collect first.
        Start-Sleep 2 # Wait a little for the freed memory to be reflected in the process object.
        $before = (Get-Process -Id $PID).WorkingSet64
        # Execute the command.
        & $_
        # Measure memory consumption and output the result.
        [pscustomobject] @{
          'MB Used' = ('{0,4:N2}' -f (((Get-Process -Id $PID).WorkingSet64 - $before) / 1mb)).PadLeft(7)
          Command = $_
        }
      }
    
    } finally {
      Remove-Item $tempCsvFile
    }