awk

Recognising backslash in awk field separator


Input is

AZE D11/879\x0Dabc\x0D\x0A\x1E!DEF F11/999

awk script sets field separator to "\x0D" (I have tried with and without escaping the backslash.

awk script is

BEGIN {FS="\\x0D"}
   {print NF}

It should output 3 because there are 2 occurrences of the field separator but it outputs 1 which indicates it is not being recognized.


Solution

  • There are 2 ways to provide a regexp in awk - a static regexp (aka regexp literal) written as /regexp/ and a dynamic regexp (aka computed regexp) written as "regexp" and used in a regexp context. A field separator is just a regexp with some additional behavior so lets just consider regexps in general to explain what's going on in your example.

    The split() function takes a field separator (a regexp for our purposes) as it's third argument so it provides a good test bed:

    Using a static regexp:

    $ awk '{print split($0,a,/\x0D/)}' file
    1
    

    The \ above is escaping the x, it's not a literal \. For that you need to escape the \ itself:

    $ awk '{print split($0,a,/\\x0D/)}' file
    3
    

    What if we used a dynamic regexp instead of the above static regexp?

    $ awk '{print split($0,a,"\x0D")}' file
    1
    $ awk '{print split($0,a,"\\x0D")}' file
    1
    $ awk '{print split($0,a,"\\\x0D")}' file
    ' is not a known regexp operator FNR=1) warning: regexp escape sequence `\
    1
    $ awk '{print split($0,a,"\\\\x0D")}' file
    3
    

    The behavior above is because awk first parses the string to convert it into a regexp (using up one layer of escape chars) and then parses it a second time when using it as a regexp (using up a second layer of escape chars).

    Unfortunately when you specify a FS there is no option to specify it as a literal regexp, it's always specified using a string and thus is a dynamic regexp and so needs an extra layer of escaping:

    $ awk -v FS='\x0D' '{print NF}' file
    1
    $ awk -v FS='\\x0D' '{print NF}' file
    1
    $ awk -v FS='\\\x0D' '{print NF}' file
    ' is not a known regexp operatorence `\
    1
    $ awk -v FS='\\\\x0D' '{print NF}' file
    3
    

    Now - what if you were using the wrong type of quotes in the shell part of the script, i.e. " instead of '? Then you introduce even more pain because now you're inviting the shell to also parse the string even before awk gets to see and parse it twice:

    $ awk -v FS="\\\\x0D" '{print NF}' file
    1
    $ awk -v FS="\\\\\x0D" '{print NF}' file
    ' is not a known regexp operatorence `\
    1
    $ awk -v FS="\\\\\\x0D" '{print NF}' file
    ' is not a known regexp operatorence `\
    1
    $ awk -v FS="\\\\\\\x0D" '{print NF}' file
    3
    

    That's different from the case where the double quotes are using inside awk because that's all wrapped inside single quotes and so protected from the shell already:

    $ awk 'BEGIN{FS="\\\\x0D"} {print NF}' file
    3
    

    So - in the shell always use the most restrictive quotes (' over " over none) unless you have a very specific reason not to, and when using regexps or field separators always use literal /.../ rather than dynamic "...", again unless you have a very specific reason not to.

    The odd, truncated looking error message above are because of the \rs the tool is trying to print due to the escape sequence we're providing, they're really all warning: regexp escape sequence '\^M' is not a known regexp operator