powershellfiledelimitedgroup-object

Powershell Performance tuning for aggregation operation on big delimited files


I have a delimited file with 350 columns. The delimiter is \034(Field separator). I have to extract a particular column value and find out the count of each distinct value of that column in the file. If the count of distinct value is greater or equal to 2, I need to output it to a file. The source file is 1GB. I have written the following command. It is very slow.

    Get-Content E:\Test\test.txt | Foreach {($_ -split '\034')[117]} | Group-Object -Property { $_ } | %{ if($_.Count -ge 2) { Select-Object -InputObject $_ -Property Name,Count} } | Export-csv -Path "E:\Test\test2.csv" -NoTypeInformation

Please help!


Solution

  • I suggest using a switch statement to process the input file quickly (by PowerShell standards):

    # Get an array of all the column values of interest.
    $allColValues = switch -File E:\Test\test.txt {
      default {  # each input line
        # For better performance with *literal* separators, 
        # use the .Split() *method*.
        # Generally, however, use of the *regex*-based -split *operator* is preferable.
        $_.Split([char] 0x1c)[117] # hex 0x1c is octal 034
      }
    }
    
    # Group the column values, and only output those that occur at least 
    # twice.
    $allColValues | Group-Object -NoElement | Where-Object Count -ge 2 |
      Select-Object Name, Count | Export-Csv E:\Test\test2.csv -NoTypeInformation
    

    Tip of the hat to Mathias R. Jessen for suggesting the -NoElement switch, which streamlines the Group-Object call by only maintaining abstract group information; that is, only the grouping criteria (as reflected in .Name, not also the individual objects that make up the group (as normally reflected in .Group) are returned via the output objects.


    As for what you tried: