I am writing a C application, to be distributed as a single static binary (so it doesn't have a chance to do anything like install message catalog files). It only needs to support a few natural languages; I want to get the "current language" the program is supposed to be speaking, and then print one thing if it is English, another if it is, say, Japanese, and a third if it is Polish, and maybe something else if it isn't any of those.
I know that a C program starts up with the "C" locale. This is not the "correct" language for the program to speak in its output, based on the standard environment variables like LANG. To get the program to consult its environment and determine the correct locale for messages to the user, I need to do something like:
char* message_locale = setlocale(LC_MESSAGES, "");
This has the side effect of configuring the message-catalog system to the correct locale as well, which is fine.
But, if I'm not using the message catalog system to get my message text from an installed catalog on disk, I need to take this string and figure out what language it actually means. The man page on my system covers return values of NULL (locale configured is invalid), "C", and "POSIX". From poking around the internet, it seems like I should expect strings like "zh_Hans_HK.UTF-8", "it_IT.ISO8859-15", "en_IE", "saq", or "pl", but I haven't been able to find documentation on the format of these strings or how to parse them.
Is there any guarantee on or documentation of the format of these strings, or are they meant to be opaque to applications? When I don't have NULL, "C" or "POSIX", can I always take the part up to the first underscore (if any) and get a two- or three-letter language code, or will some users have locales configured that use a different name structure? What standardized set of language codes is used? Will the language code part always be in lower-case even if the user set up the environment with something like LANG=EN?
can I always take the part up to the first underscore (if any) and get a two- or three-letter language code
No.
What standardized set of language codes is used?
ISO 639 language codes is insightful, yet be prepared for non-conforming locale names.
The format of the locale string is not limited by C.
Yet in practice, common patterns are seen like some of the 600 returned from the locale -a program in *nix. (See below):
This suggests
const char* message_locale = setlocale(LC_MESSAGES, "");
// Check the ISO639 2-letter name and the ISO language name.
// Do not modify message_locale (Avoid strtok(message_locale).)
if (strncmp(message_locale, "ru_", 3) == 0 || strncmp(message_locale, "russian", 7)==0) {
// Do Russian
else if ( ...) {
// ...
else {
// Important handling here. May need to ask the user.
// ...
}
It remains important to handle the case where no language match is found.
C
C.utf8
POSIX
...
en_AE
en_AG
en_AI
en_AS
en_AT
en_AU
en_AU.utf8
en_BB
en_BE
en_BI
en_BM
en_BS
en_BW
en_BW.utf8
en_BZ
en_CA
en_CA.utf8
en_CC
en_CH
en_CK
en_CM
en_CX
en_CY
en_DE
en_DK
en_DK.utf8
en_DM
en_ER
en_FI
en_FJ
en_FK
en_FM
en_GB
en_GB.utf8
en_GD
en_GG
en_GH
en_GI
en_GM
en_GU
en_GY
en_HK
en_HK.utf8
en_ID
en_IE
en_IE.utf8
en_IE@euro
en_IL
en_IM
en_IN
en_IO
en_JE
en_JM
en_KE
en_KI
en_KN
en_KY
en_LC
en_LR
en_LS
en_MG
en_MH
en_MO
en_MP
en_MS
en_MT
en_MU
en_MW
en_MY
en_NA
en_NF
en_NG
en_NL
en_NR
en_NU
en_NZ
en_NZ.utf8
en_PG
en_PH
en_PH.utf8
en_PK
en_PN
en_PR
en_PW
en_RW
en_SB
en_SC
en_SD
en_SE
en_SG
en_SG.utf8
en_SH
en_SI
en_SL
en_SS
en_SX
en_SZ
en_TC
en_TK
en_TO
en_TT
en_TV
en_TZ
en_UG
en_UM
en_US
en_US.utf8
en_VC
en_VG
en_VI
en_VU
en_WS
en_ZA
en_ZA.utf8
en_ZM
en_ZW
en_ZW.utf8
...
ru_BY
ru_KG
ru_KZ
ru_MD
ru_RU
ru_RU.utf8
ru_UA
ru_UA.utf8
russian
....
xh_ZA
xh_ZA.utf8
xog_UG
yav_CM
yo_BJ
yo_NG
zh_CN
zh_CN.utf8
zh_CN.utf8@cjknarrow
zh_CN@cjknarrow
zh_HK
zh_HK.utf8
zh_HK.utf8@cjknarrow
zh_HK@cjknarrow
zh_MO
zh_MO@cjknarrow
zh_SG
zh_SG.utf8
zh_SG.utf8@cjknarrow
zh_SG@cjknarrow
zh_TW
zh_TW.utf8
zh_TW.utf8@cjknarrow
zh_TW@cjknarrow
zu_ZA
zu_ZA.utf8