I have a file with this string in a line: "Ávila"
And I want to get this output: "ávila".
The problem is that the function tolower of awk only works when the string does not start with accent, and I must use awk.
For example, if I do awk 'BEGIN { print tolower("Ávila") }' then I get "Ávila" instead of "ávila", that is what I expect.
But if I do awk 'BEGIN { print tolower("Castellón") }' then I get "castellón"
For a given awk implementation to work properly with non-ASCII characters (foreign letters), it must respect the active locale's character encoding, as reflected in the (effective) LC_CTYPE setting (run locale to see it).
These days, most locales use UTF-8 encoding, a multi-byte-on-demand encoding that is single-byte in the ASCII range, and uses 2 to 4 bytes to represent all other Unicode characters.
Thus, for a given awk implementation to recognize non-ASCII (accented, foreign) letters, it must be able to recognize multiple bytes as a single character.
Among the major awk implementations,
gawk), the default on some Linux distrosawk, as also used on OS Xmawk), the default on Debian-based Linux distros such as Ubuntuonly GNU Awk properly handles UTF8-encoded characters (and presumably any other encoding if specified in the locale):
$ echo ÁvilA | gawk '{print tolower($0)}'
ávila # both Á and A lowercased
Conversely, if you expressly want to limit character processing to ASCII only, prepend LC_CTYPE=C:
$ echo ÁvilA | LC_CTYPE=C gawk '{print tolower($0)}'
Ávila # only ASCII char. A lowercased
Practical advice:
To determine what implementation your default awk is, run awk --version.
-W version, but that error message will contain the word mawk.If possible, install and use GNU Awk (and optionally make it the default awk); it is available for most Unix-like platforms; e.g.:
sudo apt-get install gawkbrew install gawk.If you must use either BSD Awk or Mawk, use the above LC_CTYPE=C approach to ensure that the multi-byte UTF-8 characters are at least passed through without modification.[1], but foreign letters will NOT be recognized as letters (and thus won't be lowercased, in this case).
[1] BSD Awk and Mawk on OS X (the latter curiously not on Linux) treat UTF-8-encoded character as follows:
32 is added to the original byte value to obtain the lowercase counterpart.In the case at hand, this means:
Á is Unicode codepoint U+00C1, whose UTF-8 encoding is the 2-byte sequence: 0xC3 0x81.
0xC3: Dropping the high bit (0xC3 & 0x7F) yields 0x43, which is interpreted as ASCII letter C, and 32 (0x20) is therefore added to the original value, yielding 0xE3 (0xC3 + 0x20).
0x81: Dropping the high bit (0x81 & 0x7F) yields 0x1, which is not in the range of ASCII uppercase letters (65-90, 0x41-0x5a), so the byte is left as-is.
Effectively, the first byte is modified from 0xC3 to 0xE3, while the 2nd byte is left untouched; since 0xC3 0x81 is not a properly UTF-8-encoded character, the terminal will print ? instead to signal that.