sedposixbusyboxgnu-sed

What does the range-operator in "sed" actually do, is it broken in GNU/busybox?


I wonder whether the GNU and BusyBox implementations of "sed" may be broken.

My default sed implementation is the one from GNU.

POSIX says:

An editing command with two addresses shall select the inclusive range from the first pattern space that matches the first address through the next pattern space that matches the second.

But then why gives

$ { echo ha; echo ha; echo ha; } | sed '0,/ha/ !d'
ha

instead of

ha
ha

? Clearly the 2nd "ha" here is the "next" pattern space which matches, so it should be output as well!

But even more strange,

$ { echo ha; echo ha; echo ha; } | busybox sed '0,/ha/ !d'

does not output anything at all!

But even if sed would do what the POSIX definition says, it is still unclear what should happen when a range expression is actually checked.

Does every range-condition has its own internal state? Or is there a single global state for all range-conditions in a sed script?

Obviously, a range condition needs at least to remember whether it is currently in the "search for a match of the first address"-state or in the "search for a match of the second address"-state. Perhaps it even needs to remember a third state "I have already processed the range and will not match again, no matter what".

It certainly matters when those conditions are updated: Every time a new pattern space is read? Every time the pattern space is modified, say by an s-command? Or just if the control flow reaches a range condition?

So, what is it?

Until I know better, I will avoid range conditions in my sed-scripts and consider them to be a dubious feature.


Solution

  • Two answers:

    1. 0 is not a valid POSIX address (lines count from 1)
    2. 0,/re/ is a GNU extension

    GNU awk man page includes:

    0,addr2

    Start out in "matched first address" state, until addr2 is found. This is similar to 1,addr2, except that if addr2 matches the very first line of input the 0,addr2 form will be at the end of its range, whereas the 1,addr2 form will still be at the beginning of its range. This works only when addr2 is a regular expression.

    Perhaps this will help clarify:

    $ { echo ha1; echo ha2; echo ha3; } | sed '0,/ha/ !d'
    ha1
    
    $ { echo ha1; echo ha2; echo ha3; } | sed '1,/ha/ !d'
    ha1
    ha2
    
    $ { echo ha1; echo ha2; echo ha3; } | sed --posix '0,/ha/ !d'
    sed: -e expression #1, char 8: invalid usage of line address 0
    

    The busybox code explicitly checks addr1 is greater than 0 and so never enters matching state. See the busybox source code, line 1121:

                || (sed_cmd->beg_line > 0
    

    1. Each match maintains its own state, as multiple can be active simultaneously.

    POSIX says:

    An editing command with two addresses shall select the inclusive range from the first pattern space that matches the first address through the next pattern space that matches the second. (If the second address is a number less than or equal to the line number first selected, only one line shall be selected.) Starting at the first line following the selected range, sed shall look again for the first address. Thereafter, the process shall be repeated.

    The test happens each time it is encountered:

    $ { echo ..a; echo ..b; echo ..c; } |\
      sed -n '
                 =;
                 y/cba/ba:/;
         1 ,/b/  s/$/ 1/p;
        /a/,/c/  s/$/ 2/p;
         2,  3   s/$/ 3/p;
      '
    1
    ..: 1
    2
    ..a 1
    ..a 1 2
    ..a 1 2 3
    3
    ..b 1
    ..b 1 2
    ..b 1 2 3
    

    This is also demonstrated by, for example, the busybox source code - see the sed_cmd_s typedef.