python-3.xregexlinuxbashsed

linux sed expression to select text between markers


Here is a challenge for regex gurus. Need a very simple sed expression to select text between markers.

Here is an example text. Please mind it can contain any special chars, TABS and white spaces even though this example doesn't depict all possible combinations.

^[[200~a^[[200~aaa aM1bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hhM3kkkkk~
  1. Select text between first matched start of marker M1 to last matched end of marker M3. The text to select from example is

bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh

  1. Select text between last matched start of marker M1 to first matched end of marker M3. The text to select from example is

ccc[$cM2ddddM2eeeee

I tried this but it select last start of marker to last end of marker

echo "^[[200~a^[[200~aaa aM1bb bbbM1ccc[\$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hhM3kkkkk~"|sed -E "s|.*M1(.*)M3.*$|\1|g"

ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh

How it is possible? single sed regex expression would be the best. What I mean single regex is one for each above two requirements. i.e. two regex Also need the equivalent python re expression.


Solution

  • The second case is easy, even with sed:

    $ a='^[[200~a^[[200~aaa aM1bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hhM3kkkkk~'
    $ sed -E 's/.*M1|M3.*//g' <<< "$a"
    ccc[$cM2ddddM2eeeee
    

    The first case is more complex because of the greediness of sed regexes. If you can use python or perl, instead of sed, you can harness their non-greedy .*? operator:

    $ python -c 'import sys,re; print("\n".join(re.sub(r".*?M1|M3.*?","",l) for l in sys.stdin),end="")' <<< "$a"
    bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh
    $ perl -pe 's/.*?M1|M3.*?//g' <<< "$a"
    bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh
    

    A bit shorter with python if you have only one line of text to process and if we pass it as an argument:

    $ python -c 'import sys,re; print(re.sub(r".*?M1|M3.*?","",sys.argv[1]))' "$a"
    bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh
    

    With sed, one possibility consists in first inserting separator characters that do not appear in the input string, for instance newlines, and then keeping only what appears between them. If your sed supports \n for newline in the replacement string of the substitute command:

    $ sed -E 's/M1(.*)M3/\n\1\n/;s/.*\n(.*)\n.*/\1/' <<< "$a"
    bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh
    

    Else, with any sed:

    $ sed -E 's/M1(.*)M3/\
    \1\
    /;s/.*\n(.*)\n.*/\1/' <<< "$a"
    bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh
    

    Note: as your shell is bash, if you absolutely want a one-liner you can use a $'...' character sequence:

    $ sed -E $'s/M1(.*)M3/\\\n\\1\\\n/;s/.*\\n(.*)\\n.*/\\1/' <<< "$a"