linuxbashsed

How do I fix sed commands becoming extremely slow when load is high?


I have a bash script that takes a simple properties file and substitutes the values into another file. (Property file is just lines of 'foo=bar' type properties)

INPUT=`cat $INPUT_FILE`
while read line; do
   PROP_NAME=`echo $line | cut -f1 -d'='`
   PROP_VALUE=`echo $line | cut -f2- -d'=' | sed 's/\$/\\\$/g`
   time INPUT="$(echo "$INPUT" | sed "s\`${PROP_NAME}\b\`${PROP_VALUE}\`g")"
done <<<$(cat "$PROPERTIES_FILE")
# Do more stuff with INPUT

However, when my machine has high load (upper forties) I get a large time loss on my seds

real  0m0.169s
user  0m0.001s
sys  0m0.006s

Low load:

real  0m0.011s
user  0m0.002s
sys  0m0.004s

Normally losing 0.1 seconds isn't a huge deal but both the properties file and the input files are hundreds/thousands of lines long and those .1 seconds add up to over an hour of wasted time.

What can I do to fix this? Do I just need more CPUs?

Sample properties (lines start with special char to create a way to indicate that something in the input is trying to access a property)

$foo=bar
$hello=world
^hello=goodbye

Sample input

This is a story about $hello. It starts at a $foo and ends in a park.

Bob said to Sally "^hello, see you soon"

Expected result

This is a story about world. It starts at a bar and ends in a park.

Bob said to Sally "goodbye, see you soon"

Solution

  • This will produce the output you show from the input you show, using any awk:

    $ cat tst.sh
    #!/usr/bin/env bash
    
    awk '
        NR == FNR {
            pos = index($0, "=")
            tag = substr($0, 1, pos - 1)
            val = substr($0, pos + 1)
    
            # Make any regexp metachars in the tag literal 
            gsub(/[^^\\[:alnum:]]/, "[&]", tag)
            gsub(/\\/, "&&", tag)
            gsub(/\^/, "\\\\&", tag)
    
            tags2vals[tag] = val
            next
        }
        {
            for ( tag in tags2vals ) {
                if ( match($0, tag) ) {
                    val = tags2vals[tag]
                    $0 = substr($0, 1, RSTART-1) val substr($0, RSTART+RLENGTH)
                }
            }
            print
        }
    ' props input
    
    $ ./tst.sh
    This is a story about world. It starts at a bar and ends in a park.
    
    Bob said to Sally "goodbye, see you soon"
    

    That was run against the sample input you provided:

    $ head props input
    ==> props <==
    $foo=bar
    $hello=world
    ^hello=goodbye
    
    ==> input <==
    This is a story about $hello. It starts at a $foo and ends in a park.
    
    Bob said to Sally "^hello, see you soon"
    

    but if your real input can contain recursive property definitions ($foo=$hello) and/or substrings in the input (this is $foobar here) you do not want to match then you'd need to enhance it to handle those however you want them handled.

    See Is it possible to escape regex metacharacters reliably with sed (it's a sed question but the issue of escaping regexp metachars applies to awk too) for what the gsub()s are doing in the script.