I'm trying to understand why PowerShell's memory balloons so much when I import a file that's ~16MB's as a variable. I can understand there's additional memory structure around that variable but I'm just trying to understand why it's THAT high. Here's what I'm doing below - just a stripped down simplistic snippet of another script that anyone can run.
Notes/Questions
My Test Code
Invoke-WebRequest -uri "http://s3.amazonaws.com/alexa-static/top-1m.csv.zip" -OutFile C:\top-1m.csv.zip
Expand-Archive -Path C:\top-1m.csv.zip -DestinationPath C:\top-1m.csv
$alexaTopMillion = Import-Csv -Path C:\top-1m.csv
To anyone answering this: Thank you for your time and helping me learn more every day!
Generally speaking, iRon's advice in a comment on the question is worth heeding (the specific question is addressed in the section that follows this one):
To keep memory usage low, use streaming of objects in the pipeline rather than collecting them in memory first - if feasible.
That is, instead of doing this:
# !! Collects ALL objects in memory, as an array.
$rows = Import-Csv in.csv
foreach ($row in $rows) { ... }
do this:
# Process objects ONE BY ONE.
# As long as you stream to a *file* or some other output stream
# (as opposed to assigning to a *variable*), memory use should remain constant,
# except for temporarily held memory awaiting garbage collection.
Import-Csv in.csv | ForEach-Object { ... } # pipe to Export-Csv, for instance
However, even then you seemingly can run out of memory with very large files - see this question - possibly related to buildup of memory from no longer needed objects that haven't yet been garbage-collected; therefore, periodically calling [GC]::Collect()
in the ForEach-Object
script block may solve the problem - see this answer for an example.
Import-Csv
in memory at once:The inordinate memory use you observe comes from how [pscustomobject]
instances (Import-Csv
's output type) are implemented, as discussed in GitHub issue #7603 (emphasis added):
The memory pressure most likely comes from the cost of
PSNoteProperty
[which is how[pscustomobject]
properties are implemented]. EachPSNoteProperty
has an overhead of 48 bytes, so when you just store a few bytes per property, that becomes massive.
The same issue proposes a workaround to reduce memory consumption (as also shown in Wasif Hasan's answer):
Read the first CVS row and dynamically create a custom class that represents the rows, using Invoke-Expression
.
Note: While its use is safe here, Invoke-Expression
should generally be avoided.
If you know the column structure in advance, you can create a custom class
the regular way, which also allows you to use appropriate data types for the properties (which are otherwise all strings by default); for instance, defining appropriate properties as [int]
(System.Int32
) further reduces memory consumption.
Pipe Import-Csv
to a ForEach-Object
call that converts each [pscustomobject]
created to an instance of the dynamically created class, which stores the data more efficiently.
Note: This workaround comes at the expensive of considerably diminished execution speed.
$csvFile = 'C:\top-1m.csv'
# Dynamically define a custom class derived from the *first* row
# read from the CSV file.
# Note: While this is a legitimate use of Invoke-Expression,
# it should generally be avoided.
"class CsvRow {
$((Import-Csv $csvFile | Select-Object -first 1).psobject.properties.Name -replace '^', '[string] $$' -join ";")
}" | Invoke-Expression
# Import all rows and convert them from [pscustomobject] instances
# to [CsvRow] instances to reduce memory consumption.
# Note: Casting the Import-Csv call directly to [CsvRow[]] would be noticeably
# faster, but increases *temporary* memory pressure substantially.
$alexaTopMillion = Import-Csv $csvFile | ForEach-Object { [CsvRow] $_ }
Longer-term, a better solution that would also be faster is to make Import-Csv
support outputting parsed rows with a given output type, say, via an -OutputType
parameter, as proposed in GitHub issue #8862.
If that is of interest to you, show your support for the proposal there.
The following code compares memory use with normal Import-Csv
import (array of [pscustomobject]
s) to the workaround (array of custom-class instances).
The measurement isn't exact, as the PowerShell's process working memory is simply queried, which can show the influence of background activities, but it gives a rough sense of how much less memory using a custom class requires.
Sample output, which shows that the custom-class workaround requires only about one 5th of the memory with the sample 10-column CSV input file with about 166,000 rows used below - the specific ratio depends on the number of input rows and columns:
MB Used Command
------- -------
384.50 # normal import…
80.48 # import via custom class…
Benchmark code:
# Create a sample CSV file with 10 columns about 16 MB in size.
$tempCsvFile = [IO.Path]::GetTempFileName()
('"Col1","Col2","Col3","Col4","Col5","Col6","Col7","Col8","Col9","Col10"' + "`n") | Set-Content -NoNewline $tempCsvFile
('"Col1Val","Col2Val","Col3Val","Col4Val","Col5Val","Col6Val","Col7Val","Col8Val","Col9Val","Col10Val"' + "`n") * 1.662e5 |
Add-Content $tempCsvFile
try {
{ # normal import
$all = Import-Csv $tempCsvFile
},
{ # import via custom class
"class CsvRow {
$((Import-Csv $tempCsvFile | Select-Object -first 1).psobject.properties.Name -replace '^', '[string] $$' -join ";")
}" | Invoke-Expression
$all = Import-Csv $tempCsvFile | ForEach-Object { [CsvRow] $_ }
} | ForEach-Object {
[gc]::Collect(); [gc]::WaitForPendingFinalizers() # garbage-collect first.
Start-Sleep 2 # Wait a little for the freed memory to be reflected in the process object.
$before = (Get-Process -Id $PID).WorkingSet64
# Execute the command.
& $_
# Measure memory consumption and output the result.
[pscustomobject] @{
'MB Used' = ('{0,4:N2}' -f (((Get-Process -Id $PID).WorkingSet64 - $before) / 1mb)).PadLeft(7)
Command = $_
}
}
} finally {
Remove-Item $tempCsvFile
}