awkcharacter-encodingdiacritics

Awk tolower a string that starts with an accent - support for foreign characters


I have a file with this string in a line: "Ávila"

And I want to get this output: "ávila".

The problem is that the function tolower of awk only works when the string does not start with accent, and I must use awk.

For example, if I do awk 'BEGIN { print tolower("Ávila") }' then I get "Ávila" instead of "ávila", that is what I expect.

But if I do awk 'BEGIN { print tolower("Castellón") }' then I get "castellón"


Solution

  • For a given awk implementation to work properly with non-ASCII characters (foreign letters), it must respect the active locale's character encoding, as reflected in the (effective) LC_CTYPE setting (run locale to see it).

    These days, most locales use UTF-8 encoding, a multi-byte-on-demand encoding that is single-byte in the ASCII range, and uses 2 to 4 bytes to represent all other Unicode characters.
    Thus, for a given awk implementation to recognize non-ASCII (accented, foreign) letters, it must be able to recognize multiple bytes as a single character.

    Among the major awk implementations,

    only GNU Awk properly handles UTF8-encoded characters (and presumably any other encoding if specified in the locale):

    $ echo ÁvilA | gawk '{print tolower($0)}'
    ávila  # both Á and A lowercased
    

    Conversely, if you expressly want to limit character processing to ASCII only, prepend LC_CTYPE=C:

    $ echo ÁvilA | LC_CTYPE=C gawk '{print tolower($0)}'
    Ávila  # only ASCII char. A lowercased
    

    Practical advice:


    [1] BSD Awk and Mawk on OS X (the latter curiously not on Linux) treat UTF-8-encoded character as follows:

    In the case at hand, this means: