How can I do a large number recursive word substitutions in a large file using bash in fast time?

This question is based on How do I fix sed commands becoming extremely slow when load is high? with the advice of @markp-fuso, @jhnc, and @Jetchisel to avoid a chameleon question as many of the answers used hashing and maps for optimization.

I have the following bash script

INPUT=`cat $INPUT_FILE`
while read line; do
   PROP_NAME="$(echo $line | cut -f1 -d'=')"
   PROP_VALUE="$(echo $line | cut -f2- -d'=' | sed 's/\$/\\\$/g' | sed 's/\&/\\\&/g')"
   INPUT="$(echo "$INPUT" | sed "s\`${PROP_NAME}\b\`${PROP_VALUE}\`g")"
done < "$PROPERTIES_FILE"
echo "$INPUT"

This script takes a properties file with format that supports recursion and special characters:

$foo=$barname bar
$barname=Tom&Jerry
$hello=world

And uses it to substitute into a text file with no set format. So

I went to the $foo and said hello to the $hello and they fined me $5.

becomes

I went to the Tom&Jerry bar and said hello to the world and they fined me $5.

The properties file and the text file are hundreds of lines long so performance is important, and the naive implementation results in many minutes or even over an hour of processing time depending on system load. Also important to note is use of \b in the sed, which means that all references are terminated with a punctuation mark or whitespace.

The script cannot do infinite recursion because it only makes one pass through the properties file, which also causes order of properties to matter when recursion is being used.

Solution

From the comments on the question, it seems you could preprocess the property file to expand property keys that appear in values.

As noted in my answer to the previous question, the original code takes O(m.n) time - m properties looked for in text of size n. Preprocessing the properties and making use of Perl's ability to search literal string alternations in constant time can drop this to O(m+n) - one pass over the properties and one pass over the text:

perl -e '
    # load properties from first file
    while ( ($k,$v) = split "=",<<>>,2 ) {
        chomp $v;
        $k2v{$k} = $v;         # hash for value lookup
        unshift @propkeys, $k; # array for insertion order
        last if eof;
    }

    # build single regex from all keys
    # \Q escapes regex metacharacters
    $re = join "|", map qr/\Q$_\E/, @propkeys;

    # walk properties (in reverse), expanding values as we go
    for $k (@propkeys) {
        $k2v{$k} =~ s/($re)\b/ $seen{$1} ? $k2v{$1} : $1 /ge;
        $seen{$k} = 1;
    }

    # load input from second file
    undef $/;
    $_ = <<>>;

    # convert all properties simultaneously
    s/($re)\b/ $k2v{$1} /ge;

    # output the result
    print;

' propfile textfile

The method I use for preprocessing produces property values that match the behaviour of the question code where order of listing properties affects result:

$key1=foo_$key2
$key2=bar

(expands $key1 in textfile to foo_bar)

$key2=bar
$key1=foo_$key2

(expands $key1 in textfile to foo_$key2)

To expand property values until they no longer contain any keys, the code can become:


perl -e '
    while ( ($k,$v) = split "=",<<>>,2 ) {
        chomp $v;
        $k2v{$k} = $v;
        unshift @propkeys, $k;
        last if eof;
    }
    $re = join "|", map qr/\Q$_\E/, @propkeys;

    # loop trying to expand value of each property key in list
    # remove key from list once its value contains no key references
    while (@propkeys = grep $k2v{$_} =~ s/($re)\b/ $k2v{$1} /ge, @propkeys) {
        die "recursion depth exceeded (@propkeys)\n" if ++$ct > 10;
    }

    undef $/;
    $_ = <<>>;
    s/($re)\b/ $k2v{$1} /ge;
    print;

' propfile textfile

Depending on details not provided in the question, it may be possible and useful to cache or memoise the expanded properties for later reuse without having to recompute the values each time.