I am trying to extract a representative sample of a 200MB csv file by writing the header and every 500th row to a new file for testers to use. My first attempt was knowingly sub-optimal but seemed valid for a 5 minute quick hack, as I relied on out-file -append to add each row matching the modulus condition to the destination file on network share, but what I found is that I had slightly fewer rows in the sample file than expected. Repeated runs produced slightly different numbers of rows in the destination file (expected 2014, actual ranged between 1992-2011).
I re-wrote the script to gather the results of the foreach into a variable and output once at the end. That worked as expected (2014 lines), but curious as to the cause of the failure. I know it's repeatedly opening/closing the destination file but I'd have expected it to report an error.
This is the original version of the script:
$destfile = "\\UNCSHARE\Folder\Export_Sample_$(get-date -Format "yyyyMMdd_HHmmss").txt"
$Original = get-content "\\UNCSHARE\Folder\200MB_Export_20231208_1545.txt"
[int64]$ln = 0
[int64]$SampleCount = 0
foreach ($line in $Original) {
$ln++
if ($ln -eq 1 -or $ln % 500 -eq 0) {
$line | Out-File -FilePath $destfile -Append -ErrorAction Stop
$SampleCount++
}
}
write-host $SampleCount
(get-content $destfile).count
The error does NOT occur if I use a location on my local hard drive for the destination file.
I checked the 2nd (correct output) version output against the first, and can see that the missing lines are irregularly spaced thoughout the file (e.g. missing lines at 56,359,368,405,600,700,702,788,854...).
I'm running this in PS Core 7.4.2 on a Windows 10 workstation joined to AD domain.
Edit: I've tried replacing the cmdlet with native API call
#$line | Out-File -FilePath $destfile -Append -ErrorAction Stop -Encoding utf8
[System.IO.File]::AppendAllText($destfile, "$line`n")
but I still get a variable number of missing lines and no errors reported.
Edit2: Switched to Windows Powershell 5.1.19041.3803 and now I do get an error with the native API call (but not with Out-File).
Exception calling "AppendAllText" with "2" argument(s): "The process cannot access the file
'\\UNCSHARE\Export_sample20240424_164810.txt' because it is being used
by another process."
On my system Get-Command Out-File returns
Version | PS Version |
---|---|
3.1.0.0 | 5.1 |
7.0.0.0 | 7.4.2 |
I've tested again in fresh shell sessions and the results remain consistent. Out-File doesn't report errors, whilst [System.IO.File]::AppendAllText does, but only in Windows Powershell.
Edit 3: The replacement code block to avoid the issue (as suggested by @Santiago Squarzon) looks like this:
# Collect the line data in a variable
$Sample = foreach ($line in $Original) {
$ln++
if ($ln -eq 1 -or $ln % 500 -eq 0) {
$line
$SampleCount++
}
# Single write to file
$sample | Out-File -FilePath $destfile
It is hard to determine the cause of this issue, I do agree that both, the cmdlet Out-File
and the .NET API File.AppendAllText
should both report a writing error or a failure to close / open the open the stream on your consecutive appending operations. As we have discovered later on, using the .NET API in PowerShell 5.1 (.NET Framework) does report a writing error due to a handle already taken on that file (likable cause of this could be that previous loop iteration failed to correctly close the file stream after appending), however, the .NET 8 API (PowerShell 7.4.2) as well as the cmdlet in both versions are failing to report this issue. My advise in this case would be to open an issue to the .NET repo: https://github.com/dotnet/runtime/issues and / or an issue to the PowerShell repo: https://github.com/PowerShell/PowerShell/issues to seek an answer to this issue.
As for the solution to the problem, instead of appending to the file which leads to file stream opening and closing consecutively, it is far recommended to perform this writing operation only once. Also for such a big file as you have, for efficiency purposes, I'd recommend you to use File.ReadLines
instead of Get-Content
.
You could also avoid the need to store the new content in memory (by assigning it to variable) by wrapping the loop expression in a scriptblock, this allows streaming from it when invoked and also allows piping it's output to Set-Content
or Out-File
as you want:
$destfile = "\\UNCSHARE\Folder\Export_Sample_$(Get-Date -Format 'yyyyMMdd_HHmmss').txt"
& {
try {
$sourcefile = '\\UNCSHARE\Folder\200MB_Export_20231208_1545.txt'
$reader = [System.IO.File]::ReadLines($sourcefile)
$null = $reader.MoveNext()
# output the first line, we can avoid the `-or` condition here
$reader.Current
[int64] $ln = 1
[int64] $SampleCount = 0
foreach ($line in $reader) {
if ($ln++ % 500 -eq 0) {
$line
$SampleCount++
}
}
Write-Host $SampleCount
}
finally {
# null conditional is pwsh 7 only
# use `if ($reader) { $reader.Dispose() }`
# for pwsh 5.1
${reader}?.Dispose()
}
} | Out-File $destfile -ErrorAction Stop
From feedback in comments it seems that $reader.MoveNext()
into $reader.Current
is failing to get the first line of the file, suggesting another alternative using StreamReader
instead. Performance with this method should be equally as good as with File.ReadLines
and hopefully more reliable.
$destfile = "\\UNCSHARE\Folder\Export_Sample_$(Get-Date -Format 'yyyyMMdd_HHmmss').txt"
& {
try {
$sourcefile = '\\UNCSHARE\Folder\200MB_Export_20231208_1545.txt'
$reader = [System.IO.StreamReader]::new($sourcefile)
# output the first line, we can avoid the `-or` condition here
$reader.ReadLine()
[int64] $ln = 1
[int64] $SampleCount = 0
while (-not $reader.EndOfStream) {
$line = $reader.ReadLine()
if ($ln++ % 500 -eq 0) {
$line
$SampleCount++
}
}
Write-Host $SampleCount
}
finally {
# null conditional is pwsh 7 only
# use `if ($reader) { $reader.Dispose() }`
# for pwsh 5.1
${reader}?.Dispose()
}
} | Out-File $destfile -ErrorAction Stop