I have a delimited file with 350 columns. The delimiter is \034(Field separator)
.
I have to extract a particular column value and find out the count of each distinct value of that column in the file. If the count of distinct value is greater or equal to 2, I need to output it to a file.
The source file is 1GB. I have written the following command. It is very slow.
Get-Content E:\Test\test.txt | Foreach {($_ -split '\034')[117]} | Group-Object -Property { $_ } | %{ if($_.Count -ge 2) { Select-Object -InputObject $_ -Property Name,Count} } | Export-csv -Path "E:\Test\test2.csv" -NoTypeInformation
Please help!
I suggest using a switch
statement to process the input file quickly (by PowerShell standards):
# Get an array of all the column values of interest.
$allColValues = switch -File E:\Test\test.txt {
default { # each input line
# For better performance with *literal* separators,
# use the .Split() *method*.
# Generally, however, use of the *regex*-based -split *operator* is preferable.
$_.Split([char] 0x1c)[117] # hex 0x1c is octal 034
}
}
# Group the column values, and only output those that occur at least
# twice.
$allColValues | Group-Object -NoElement | Where-Object Count -ge 2 |
Select-Object Name, Count | Export-Csv E:\Test\test2.csv -NoTypeInformation
Tip of the hat to Mathias R. Jessen for suggesting the -NoElement
switch, which streamlines the Group-Object
call by only maintaining abstract group information; that is, only the grouping criteria (as reflected in .Name
, not also the individual objects that make up the group (as normally reflected in .Group
) are returned via the output objects.
As for what you tried:
Get-Content
with line-by-line streaming in the pipeline is slow, both generally (the object-by-object passing introduces overhead) and, specifically, because Get-Content
decorates each line it outputs with ETS (Extended Type System) metadata.
-Raw
switch reads the entire file as a single, multi-line string, which is much faster.Passing -Property { $_ }
to Group-Object
isn't necessary - just omit it. Without a -Property
argument, the input objects are grouped as a whole.
Chaining Where-Object
and Select-Object
- rather than filtering via an if
statement in a ForEach-Object
call combined with multiple Select-Object
calls - is not only conceptually clearer, but performs better.