powershell

Extract email:password


I'm curious if there's a way to extract email:password from a big list. It is listed in the text in that format but with a few other unuseable parts in front (such as name, last name).

The format is mostly:

xx:Mxx:Support:xx:support@xx.com:x19000

But sometimes can be like this as well:

xxxx::gexrge@xxnt.com:111111

I have tried with EmEditor and if I search for

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\]).*$

it does find it. I have then to replace with \1 - however this takes literally ages and finally crashes (the file is 17GB).

Knowing that powershell could do this too, I'm looking for the right command.


Solution

  • The switch statement allows combining efficient line-by-line processing of files (via the -File parameter), optionally combined with regex-matching (via the -Regex option):

    & { 
      switch -regex -file in.txt { 
       '(?<=:)[^@:]+@[^:]+:.*' { $Matches[0] } 
      }
    } | Set-Content -Encoding utf8 out.txt
    

    Adjust the -Encoding argument as needed; note that in Windows PowerShell utf8 creates a file with BOM, whereas PowerShell [Core] v6+ creates one wihout BOM. By default, Set-Encoding uses the system's active ANSI code page in Windows PowerShell, whereas PowerShell [Core] v6+ consistently defaults to BOM-less UTF-8, across all cmdlets.

    The above extracts the email-password pairs extracted from file in.txt as individual lines to file out.txt.

    Note: Even though the above performs line-by-line processing, an out-of-memory exception can apparently still occur in Set-Content with very large input files; the .NET-based solution in the next section should fix that, while also significantly speeding up the operation.


    Performance caveat: While the above is memory-efficient, it will be slow with large files; to address that, you must make direct use of the .NET framework, via a System.IO.StreamWriter instance:

    # Create the output file.
    # Note:
    #  * Be sure to use a *full* path, because .NET's current dir. usually differs
    #    from PowerShell's
    #  * UTF-8 *without a BOM* is used as the character encoding by default,
    #    but you may pass a [System.Text.Encoding] instance as needed.
    $sw = [System.IO.StreamWriter]::new("$PWD/out.txt")
    
    switch -regex -file in.txt { 
       '(?<=:)[^@:]+@[^:]+:.*' { $sw.WriteLine($Matches[0]) } 
    }
    
    $sw.Close()