bashshellunicodecharacter-encoding

How do you convert unicode string to escapes in bash?


I need a tool that will translate the unicode string into escape characters like \u0230.

For example,

echo ãçé | convert-unicode-tool
\u00e3\u00e7\u00e9

Solution

  • All bash method -

    echo ãçé |
       while read -n 1 u
       do [[ -n "$u" ]] && printf '\\u%04x' "'$u"
       done
    

    That leading apostrophe is a printf formatting/interpretation guide.

    From the GNU man page online:

    If the leading character of a numeric argument is ‘"’ or ‘'’ then its value is the numeric value of the immediately following character. Any remaining characters are silently ignored if the POSIXLY_CORRECT environment variable is set; otherwise, a warning is printed. For example, ‘printf "%d" "'a"’ outputs ‘97’ on hosts that use the ASCII character set, since ‘a’ has the numeric value 97 in ASCII.

    That lets us pass the character to printf for numeric interpretations such as %d or %03o, or here, %04x.

    The [[ -n "$u" ]] is because there's a null trailing byte that will otherwise be appended as \u0000.

    Output:

    $:     echo ãçé |
    >        while read -n 1 u
    >        do [[ -n "$u" ]] && printf '\\u%04x' "'$u"
    >        done
    \u00e3\u00e7\u00e9
    

    Without the null byte check -

    $: echo ãçé | while read -n 1 u; do printf '\\u%04x' "'$u"; done
    \u00e3\u00e7\u00e9\u0000