regexsed

Is it possible to escape regex metacharacters reliably with sed


I'm wondering whether it is possible to write a 100% reliable sed command to escape any regex metacharacters in an input string so that it can be used in a subsequent sed command. Like this:

#!/bin/bash
# Trying to replace one regex by another in an input file with sed

search="/abc\n\t[a-z]\+\([^ ]\)\{2,3\}\3"
replace="/xyz\n\t[0-9]\+\([^ ]\)\{2,3\}\3"

# Sanitize input
search=$(sed 'script to escape' <<< "$search")
replace=$(sed 'script to escape' <<< "$replace")

# Use it in a sed command
sed "s/$search/$replace/" input

I know that there are better tools to work with fixed strings instead of patterns, for example awk, perl or python. I would just like to prove whether it is possible or not with sed. I would say let's concentrate on basic POSIX regexes to have even more fun! :)

I have tried a lot of things but anytime I could find an input which broke my attempt. I thought keeping it abstract as script to escape would not lead anybody into the wrong direction.

Btw, the discussion came up here. I thought this could be a good place to collect solutions and probably break and/or elaborate them.


Solution

  • Note:


    SINGLE-line Solutions


    Escaping a string literal for use as a regex in sed:

    To give credit where credit is due: I found the regex used below in this answer.

    Assuming that the search string is a single-line string:

    search='abc\n\t[a-z]\+\([^ ]\)\{2,3\}\3'  # sample input containing metachars.
    
    searchEscaped=$(sed 's/[^^]/[&]/g; s/\^/\\^/g' <<<"$search") # escape it.
    
    sed -n "s/$searchEscaped/foo/p" <<<"$search" # Echoes 'foo'
    

    The approach is robust, but not efficient.

    The robustness comes from not trying to anticipate all special regex characters - which will vary across regex dialects - but to focus on only 2 features shared by all regex dialects:


    Escaping a string literal for use as the replacement string in sed's s/// command:

    The replacement string in a sed s/// command is not a regex, but it recognizes placeholders that refer to either the entire string matched by the regex (&) or specific capture-group results by index (\1, \2, ...), so these must be escaped, along with the (customary) regex delimiter, /.

    Assuming that the replacement string is a single-line string:

    replace='Laurel & Hardy; PS\2' # sample input containing metachars.
    
    replaceEscaped=$(sed 's/[&/\]/\\&/g' <<<"$replace") # escape it
    
    sed -n "s/.*/$replaceEscaped/p" <<<"foo" # Echoes $replace as-is
    


    MULTI-line Solutions


    Escaping a MULTI-LINE string literal for use as a regex in sed:

    Note: This only makes sense if multiple input lines (possibly ALL) have been read before attempting to match.
    Since tools such as sed and awk operate on a single line at a time by default, extra steps are needed to make them read more than one line at a time.

    # Define sample multi-line literal.
    search='/abc\n\t[a-z]\+\([^ ]\)\{2,3\}\3
    /def\n\t[A-Z]\+\([^ ]\)\{3,4\}\4'
    
    # Escape it.
    searchEscaped=$(sed -e 's/[^^]/[&]/g; s/\^/\\^/g; $!a\'$'\n''\\n' <<<"$search" | tr -d '\n')           #'
    
    # Use in a Sed command that reads ALL input lines up front.
    # If ok, echoes 'foo'
    sed -n -e ':a' -e '$!{N;ba' -e '}' -e "s/$searchEscaped/foo/p" <<<"$search"
    

    Escaping a MULTI-LINE string literal for use as the replacement string in sed's s/// command:

    # Define sample multi-line literal.
    replace='Laurel & Hardy; PS\2
    Masters\1 & Johnson\2'
    
    # Escape it for use as a Sed replacement string.
    IFS= read -d '' -r < <(sed -e ':a' -e '$!{N;ba' -e '}' -e 's/[&/\]/\\&/g; s/\n/\\&/g' <<<"$replace")
    replaceEscaped=${REPLY%$'\n'}
    
    # If ok, outputs $replace as is.
    sed -n "s/\(.*\) \(.*\)/$replaceEscaped/p" <<<"foo bar" 
    


    bash functions based on the above (for sed):

    # SYNOPSIS
    #   quoteRe <text>
    quoteRe() { sed -e 's/[^^]/[&]/g; s/\^/\\^/g; $!a\'$'\n''\\n' <<<"$1" | tr -d '\n'; }
    
    # SYNOPSIS
    #  quoteSubst <text>
    quoteSubst() {
      IFS= read -d '' -r < <(sed -e ':a' -e '$!{N;ba' -e '}' -e 's/[&/\]/\\&/g; s/\n/\\&/g' <<<"$1")
      printf %s "${REPLY%$'\n'}"
    }
    

    Example:

    from=$'Cost\(*):\n$3.' # sample input containing metachars. 
    to='You & I'$'\n''eating A\1 sauce.' # sample replacement string with metachars.
    
    # Should print the unmodified value of $to
    sed -e ':a' -e '$!{N;ba' -e '}' -e "s/$(quoteRe "$from")/$(quoteSubst "$to")/" <<<"$from" 
    

    Note the use of -e ':a' -e '$!{N;ba' -e '}' to read all input at once, so that the multi-line substitution works.



    perl solution:

    Perl has built-in support for escaping arbitrary strings for literal use in a regex: the quotemeta() function or its equivalent \Q...\E quoting.
    The approach is the same for both single- and multi-line strings; for example:

    from=$'Cost\(*):\n$3.' # sample input containing metachars.
    to='You owe me $1/$& for'$'\n''eating A\1 sauce.' # sample replacement string w/ metachars.
    
    # Should print the unmodified value of $to.
    # Note that the replacement value needs NO escaping.
    perl -s -0777 -pe 's/\Q$from\E/$to/' -- -from="$from" -to="$to" <<<"$from"