powershelltext-fileslarge-filespajek

How to split a very large text file (4GB) at pre-defined string in powershell and do it fast


I have a large text file World.net (Which is a Pajek file, but consider it as text) with content:

*Vertices 999999
    1 ""                                       0.2931    0.2107    0.5000 empty
    2 ""                                       0.2975    0.2214    0.5000
    3 ""                                       0.3083    0.2258    0.5000
    4 ""                                       0.3127    0.2406    0.5000
    5 ""                                       0.3083    0.2514    0.5000
    6 ""                                       0.3147    0.2578    0.5000
...
    999999 ""                                       0.3103    0.2622    0.5000
*Edges :2 "World contours"
    1     2 1 
    2     3 1 
    3     4 1 
    4     5 1 
    5     6 1 
    6     7 1 
...
    983725     8 1 

I would like to split it into different .txt files, at the lines that start with

*[Something]

The [Something] should go into the file name like World_Vertices.txt and World_Edges.txt.

File contents should be the lines (1,2,3...), following each category (Vertices, Edges) from the original file, without the category name (which starts with *).

I have a code that (kind-of) works:

$filename = "World"
echo $pwd\"$filename.net"
$file = New-Object System.IO.StreamReader -Arg "$pwd\$filename.net"
while (($line = $file.ReadLine()) -ne $null) {
    If ($line -match "^\*\w+") {
        $newfile = -join("$filename ","$($line.Split('\*')[1]).txt")
        echo $newfile
    }
    Else {
        $line | Out-File -Append $newfile
    }
}

But this code is very slow. It takes 20 minutes on a 10 mb file. And I would like to be able to process a 4GB file.

Hardware notes: The machine is good: i7 with hybrid disk, 16GB ram and I can install .net framework whichever-is-needed-to-do-the-job.


Solution

  • In general, using .NET functions inside PowerShell is always the best way when performance is important. So using a StreamReader is already a good approach.

    I changed your code to use a StreamWriter for writing to the output files:

    $filename = "World"
    echo "$pwd\$filename.net"
    $file = New-Object System.IO.StreamReader -Arg "$pwd\$filename.net"
    $writer = $null
    while (($line = $file.ReadLine()) -ne $null) {
        If ($line -match "^\*\w+") {
            $newfile = -join("$filename ","$($line.Split('\*')[1]).txt")
            echo $newfile
            if ($null -ne $writer) {
                $writer.Dispose()
            }
            $writer = New-Object System.IO.StreamWriter "$pwd\$newfile"
        }
        Else {
            $writer.WriteLine($line)
        }
    }
    

    Try it.

    There are other ways to further improve your performance. For instance, you might skip the expensive regex check. Use this instead:

    if ($line.StartsWith("*"))