bashawk

Can printf "%x\n" \'a be performed in awk?


All printable characters' hex code values can be displayed this way in bash.

printf "%x\n"  \'a
61

awk 'BEGIN{printf("%x\n",\\'a)}'
awk 'BEGIN{printf("%x\n",\'a)}'

None of them can be performed in awk,is there no way to do in awk?
awk doesn't provide this kind of printf format such as in bash?

awk -v var="a"  'BEGIN{printf("%x\n", var)}'
0
echo -n  a|xxd
0000000: 61   

It is simple to get the a printable characters' hex code value with echo -n a|xxd,my question is to ask does awk provide this kind of printf format such as in bash or not ,not about how to get the hex code value with other method in awk.

awk -v var="a"  'BEGIN{printf("%x\n", \'var)}'
bash: syntax error near unexpected token `)'
debian8@debian:~$ awk -v var="a"  "BEGIN{printf("%x\n", \'var)}"
awk: cmd. line:1: BEGIN{printf(%xn, \'var)}
awk: cmd. line:1:              ^ syntax error
awk: cmd. line:1: BEGIN{printf(%xn, \'var)}
awk: cmd. line:1:                   ^ backslash not last character on line
awk: cmd. line:1: BEGIN{printf(%xn, \'var)}
awk: cmd. line:1:                   ^ syntax error

Conclusion:awk doesn't support this kind of printf format.


Solution

  • Here's a command that shows that awk's printf function indeed does not support the '-prefixed syntax for getting a character's code point (applies to GNU Awk, Mawk, and BSD/macOS Awk):

    $ awk -v char="'a" 'BEGIN { printf "%x\n", char }'
    0  # Value 'a is literally interpreted as a number, which defaults to 0
    

    Note that Bash v4+'s printf builtin is Unicode-aware:

    $ printf '%x\n' \'€
    20ac  # U+20AC is the Unicode code point of the EURO symbol
    

    A hex-dump utility such as xxd will only give you the byte representation of a character, which is only the same as the code point in the 7-bit ASCII range.
    In a UTF-8-based locale (which is typical these days), anything beyond the ASCII range will print the bytes that make up the UTF-8-encoded form of the character:

    $ xxd <<<€
    00000000: e282 ac0a # 0xe2 0x82 0xac are the UTF-8 encoding of Unicode char. U+20AC
    

    The ord() function used with GNU Awk in Ed Morton's helpful answer is limited to ASCII characters. Any character with a codepoint beyond 0x7f results in a negative value.

    The create-a-map-of-all-characters workaround from James Brown's helpful answer:

    Tip of the hat to RARE Kpop Manifesto, who suggested iterating over the BMP code points in descending order, "otherwise the ASCII-duplicating section in 0xD800-0xDFFF would overwrite ASCII ordinals with these meaningless values in the UTF-16 surrogate exclusion range."
    That is, with iteration in ascending order, ASCII-range characters such as char=a as input would mistakenly yield a surrogate code point.