powershellperformance

Powershell version of cut -d is very slow on large files, am I missing the fast way to do it?


I have a very large (>100k lines) file that i want to split on :.
I then want to discard the first item, and leave all the rest. for example, foo:bar:baz becomes bar:baz.
If i do cut -d ':' -f2- myfile.txt > newfile.txt it finishes in a matter of milliseconds.
I have tried a few methods in Powershell, but I have yet to see one finish. After a couple of minutes I abort, because this script cannot afford to wait that long. Surely there is a better/faster way to do this, but I can't seem to find it.

The most promising method I found so far looks like this:

$reader = [System.IO.File]::OpenText("myfile.txt")
try {
    for() {
        $line = $reader.ReadLine()
        if ($line -eq $null) { break }
        $split = $line.Split(":")
        $join = $split[1..($split.Length-1)] -join ":"
        Add-Content -Path "newfile.txt" -Value "$join"
    }
}
finally {
    $reader.Close()
}

Please help/advise.


Solution

  • In both examples in this answer you can use regex instead of splitting, it would be more efficient that way. For the regex details you can check: https://regex101.com/r/iGfHWp/1.

    (Get-Content myfile.txt -Raw) -replace '(?m)^.+?:' |
        Set-Content newfile.txt
    
    try {
        # use absolute path always in this case, i.e.:
        # `newfile.txt` should be `X:\path\to\newfile.txt`
        $writer = [System.IO.StreamWriter] 'newfile.txt'
        $re = [regex]::new(
            '^.+?:', [System.Text.RegularExpressions.RegexOptions]::Compiled)
    
        foreach ($line in [System.IO.File]::ReadLines('myfile.txt')) {
            $writer.WriteLine($re.Replace($line, ''))
        }
    }
    finally {
        if ($writer) {
            $writer.Dispose()
        }
    }