regexperlpowershellreplacetext-files

PowerShell multiple string replacement efficiency


I'm trying to replace 600 different strings in a very large text file 30Mb+. I'm current building a script that does this; following this Question:

Script:

$string = gc $filePath 
$string | % {
    $_ -replace 'something0','somethingelse0' `
       -replace 'something1','somethingelse1' `
       -replace 'something2','somethingelse2' `
       -replace 'something3','somethingelse3' `
       -replace 'something4','somethingelse4' `
       -replace 'something5','somethingelse5' `
       ...
       (600 More Lines...)
       ...
}
$string | ac "C:\log.txt"

But as this will check each line 600 times and there are well over 150,000+ lines in the text file this means there’s a lot of processing time.

Is there a better alternative to doing this that is more efficient?


Solution

  • So, what you're saying is that you want to replace any of 600 strings in each of 150,000 lines, and you want to run one replace operation per line?

    Yes, there is a way to do it, but not in PowerShell, at least I can't think of one. It can be done in Perl.


    The Method:

    1. Construct a hash where the keys are the somethings and the values are the somethingelses.
    2. Join the keys of the hash with the | symbol, and use it as a match group in the regex.
    3. In the replacement, interpolate an expression that retrieves a value from the hash using the match variable for the capture group

    The Problem:

    Frustratingly, PowerShell doesn't expose the match variables outside the regex replace call. It doesn't work with the -replace operator and it doesn't work with [regex]::replace.

    In Perl, you can do this, for example:

    $string =~ s/(1|2|3)/@{[$1 + 5]}/g;
    

    This will add 5 to the digits 1, 2, and 3 throughout the string, so if the string is "1224526123 [2] [6]", it turns into "6774576678 [7] [6]".

    However, in PowerShell, both of these fail:

    $string -replace '(1|2|3)',"$($1 + 5)"
    
    [regex]::replace($string,'(1|2|3)',"$($1 + 5)")
    

    In both cases, $1 evaluates to null, and the expression evaluates to plain old 5. The match variables in replacements are only meaningful in the resulting string, i.e. a single-quoted string or whatever the double-quoted string evaluates to. They're basically just backreferences that look like match variables. Sure, you can quote the $ before the number in a double-quoted string, so it will evaluate to the corresponding match group, but that defeats the purpose - it can't participate in an expression.


    The Solution:

    [This answer has been modified from the original. It has been formatted to fit match strings with regex metacharacters. And your TV screen, of course.]

    If using another language is acceptable to you, the following Perl script works like a charm:

    $filePath = $ARGV[0]; # Or hard-code it or whatever
    open INPUT, "< $filePath";
    open OUTPUT, '> C:\log.txt';
    %replacements = (
      'something0' => 'somethingelse0',
      'something1' => 'somethingelse1',
      'something2' => 'somethingelse2',
      'something3' => 'somethingelse3',
      'something4' => 'somethingelse4',
      'something5' => 'somethingelse5',
      'X:\Group_14\DACU' => '\\DACU$',
      '.*[^xyz]' => 'oO{xyz}',
      'moresomethings' => 'moresomethingelses'
    );
    foreach (keys %replacements) {
      push @strings, qr/\Q$_\E/;
      $replacements{$_} =~ s/\\/\\\\/g;
    }
    $pattern = join '|', @strings;
    while (<INPUT>) {
      s/($pattern)/$replacements{$1}/g;
      print OUTPUT;
    }
    close INPUT;
    close OUTPUT;
    

    It searches for the keys of the hash (left of the =>), and replaces them with the corresponding values. Here's what's happening:


    BTW, you might have noticed several other modifications from the original script. My Perl has collected some dust during my recent PowerShell kick, and on a second look I noticed several things that could be done better.