bashawk

AWK: backslash as one of the many field seperators


How do I escape the backslash? I have tried 2,3,4 \ to try to escape it. What am I am doing wrong?

echo "email@gmail.com:aaa/bbb\ccc ddd" |awk -F"[/@: \]" '
{
  print $1 " | " $2 " | " $3 " | " $4 " | " $5
}'

Expected output:

email | gmail.com | aaa | bbb | ccc | ddd

Solution

  • Here's how to write your script:

    $ printf '%s\n' 'email@gmail.com:aaa/bbb\ccc ddd' |
    awk -F'[/@: \\\\]' -v OFS=' | ' '
        {
            $1 = $1
            print
        }
    '
    email | gmail.com | aaa | bbb | ccc | ddd
    

    The difference between quotes in shell (note: this has nothing to do with awk or any other tool you might call from shell, this is all about shell):

    1. 'foo' = "hey, shell, stay the hell away from this, do NOT look at it"
    2. "foo" = "hey, shell, please interpret this to expand variables, etc."
    3. foo = "hey, shell, please interpret this to do the same stuff as for double quotes but also do globbing, file name expansion, etc".

    See https://mywiki.wooledge.org/Quotes for all the gory details.

    So the shell quoting rule is:

    Always use single quotes around all strings and scripts unless you need the shell to expand variables and then use double quotes unless you also need the shell to do globbing and then use no quotes.

    Now for the awk part - when you specify an FS as a string you are actually writing a dynamic (aka computed) regexp (see https://www.gnu.org/software/gawk/manual/gawk.html#Computed-Regexps) and as such awk has to parse that string twice, once to convert it to a regexp and then to use it as a regexp. That means that any backslash you want to be present in the regexp has to appear twice in the string so that after the first pass of converting the string to a regexp there's still 1 backslash left.

    So if you want a regexp like that lets you find, say, | or \ in the input then you'd like to just write:

    $ echo 'ab\c' | awk '/[|\]/'
    awk: cmd. line:1: /[|\]/
    awk: cmd. line:1:  ^ unterminated regexp
    awk: cmd. line:1: error: Unmatched [, [^, [:, [., or [=: /[|\]//
    

    but you can't because \ is the escape character in a regexp so you need to escape IT to make it literal:

    $ echo 'ab\c' | awk '/[|\\]/'
    ab\c
    

    Now if you wanted to use a dynamic instead of literal regexp (this applies to setting FS too) you'd like to do:

    $ echo 'ab\c' | awk -v re='[|\\]' '$0 ~ re'
    awk: cmd. line:1: (FILENAME=- FNR=1) fatal: invalid regexp: Unmatched [, [^, [:, [., or [=: /[|\]/
    

    but you can't because the string that represents the dynamic regexp has to be converted into a literal and that uses up one set of backslashes so you have to write:

    $ echo 'ab\c' | awk -v re='[|\\\\]' '$0 ~ re'
    ab\c
    

    If you instead wrote (now about to [incorrectly] use double quotes instead of single):

    $ echo 'ab\c' | awk -v re="[|\\\\]" '$0 ~ re'
    awk: cmd. line:1: (FILENAME=- FNR=1) fatal: invalid regexp: Unmatched [, [^, [:, [., or [=: /[|\]/
    

    then you'd be asking the shell to interpret the string before awk even sees it and so you'd need yet another layer of backslashes for THAT pass to consume:

    $ echo 'ab\c' | awk -v re="[|\\\\\\\\]" '$0 ~ re'
    ab\c
    

    So - just don't do that unless you need to invite the shell to interpret the string. Simply follow the shell quoting rules I gave above.

    Now, remember - awk is not shell, so double quotes inside an awk script are part of the awk language, not part of the shell language, and so do not have the same semantics. When you write "foo" in awk you're not inviting awk or the shell or anything else to interpret it, you're writing a literal string, just like if you wrote 'foo' in shell, so you don't need any extra escapes for a string inside an awk script:

    $ echo 'ab\c' | awk 'BEGIN{re="[|\\\\]"} $0 ~ re'
    ab\c
    

    That last statement assumes your awk script is stored in a file or inside single quotes being invoked from shell. If you choose to use double quotes around the awk script instead of single then you are inviting a world of pain having to escape $ signs, double up on backslashes, etc. as you're asking the shell to interpret the whole ask script before awk sees it so - don't do that. On Windows I understand you have to which is why the standard advice there is to save the awk script in a file instead of quoting it on the command line.