cinternationalizationposixlocalesetlocale

How do I select the right language for my C program's output?


I am writing a C application, to be distributed as a single static binary (so it doesn't have a chance to do anything like install message catalog files). It only needs to support a few natural languages; I want to get the "current language" the program is supposed to be speaking, and then print one thing if it is English, another if it is, say, Japanese, and a third if it is Polish, and maybe something else if it isn't any of those.

I know that a C program starts up with the "C" locale. This is not the "correct" language for the program to speak in its output, based on the standard environment variables like LANG. To get the program to consult its environment and determine the correct locale for messages to the user, I need to do something like:

char* message_locale = setlocale(LC_MESSAGES, "");

This has the side effect of configuring the message-catalog system to the correct locale as well, which is fine.

But, if I'm not using the message catalog system to get my message text from an installed catalog on disk, I need to take this string and figure out what language it actually means. The man page on my system covers return values of NULL (locale configured is invalid), "C", and "POSIX". From poking around the internet, it seems like I should expect strings like "zh_Hans_HK.UTF-8", "it_IT.ISO8859-15", "en_IE", "saq", or "pl", but I haven't been able to find documentation on the format of these strings or how to parse them.

Is there any guarantee on or documentation of the format of these strings, or are they meant to be opaque to applications? When I don't have NULL, "C" or "POSIX", can I always take the part up to the first underscore (if any) and get a two- or three-letter language code, or will some users have locales configured that use a different name structure? What standardized set of language codes is used? Will the language code part always be in lower-case even if the user set up the environment with something like LANG=EN?


Solution

  • can I always take the part up to the first underscore (if any) and get a two- or three-letter language code

    No.

    What standardized set of language codes is used?

    ISO 639 language codes is insightful, yet be prepared for non-conforming locale names.


    The format of the locale string is not limited by C.

    Yet in practice, common patterns are seen like some of the 600 returned from the locale -a program in *nix. (See below):

    This suggests

    const char* message_locale = setlocale(LC_MESSAGES, "");
    // Check the ISO639 2-letter name and the ISO language name.
    // Do not modify message_locale (Avoid strtok(message_locale).)
    if (strncmp(message_locale, "ru_", 3) == 0 || strncmp(message_locale, "russian", 7)==0) {
      // Do Russian
    else if ( ...) {
      // ...
    else {
      // Important handling here.  May need to ask the user.
      // ...
    }
    

    It remains important to handle the case where no language match is found.


    C
    C.utf8
    POSIX
    ...
    en_AE
    en_AG
    en_AI
    en_AS
    en_AT
    en_AU
    en_AU.utf8
    en_BB
    en_BE
    en_BI
    en_BM
    en_BS
    en_BW
    en_BW.utf8
    en_BZ
    en_CA
    en_CA.utf8
    en_CC
    en_CH
    en_CK
    en_CM
    en_CX
    en_CY
    en_DE
    en_DK
    en_DK.utf8
    en_DM
    en_ER
    en_FI
    en_FJ
    en_FK
    en_FM
    en_GB
    en_GB.utf8
    en_GD
    en_GG
    en_GH
    en_GI
    en_GM
    en_GU
    en_GY
    en_HK
    en_HK.utf8
    en_ID
    en_IE
    en_IE.utf8
    en_IE@euro
    en_IL
    en_IM
    en_IN
    en_IO
    en_JE
    en_JM
    en_KE
    en_KI
    en_KN
    en_KY
    en_LC
    en_LR
    en_LS
    en_MG
    en_MH
    en_MO
    en_MP
    en_MS
    en_MT
    en_MU
    en_MW
    en_MY
    en_NA
    en_NF
    en_NG
    en_NL
    en_NR
    en_NU
    en_NZ
    en_NZ.utf8
    en_PG
    en_PH
    en_PH.utf8
    en_PK
    en_PN
    en_PR
    en_PW
    en_RW
    en_SB
    en_SC
    en_SD
    en_SE
    en_SG
    en_SG.utf8
    en_SH
    en_SI
    en_SL
    en_SS
    en_SX
    en_SZ
    en_TC
    en_TK
    en_TO
    en_TT
    en_TV
    en_TZ
    en_UG
    en_UM
    en_US
    en_US.utf8
    en_VC
    en_VG
    en_VI
    en_VU
    en_WS
    en_ZA
    en_ZA.utf8
    en_ZM
    en_ZW
    en_ZW.utf8
    ...
    ru_BY
    ru_KG
    ru_KZ
    ru_MD
    ru_RU
    ru_RU.utf8
    ru_UA
    ru_UA.utf8
    russian
    ....
    xh_ZA
    xh_ZA.utf8
    xog_UG
    yav_CM
    yo_BJ
    yo_NG
    zh_CN
    zh_CN.utf8
    zh_CN.utf8@cjknarrow
    zh_CN@cjknarrow
    zh_HK
    zh_HK.utf8
    zh_HK.utf8@cjknarrow
    zh_HK@cjknarrow
    zh_MO
    zh_MO@cjknarrow
    zh_SG
    zh_SG.utf8
    zh_SG.utf8@cjknarrow
    zh_SG@cjknarrow
    zh_TW
    zh_TW.utf8
    zh_TW.utf8@cjknarrow
    zh_TW@cjknarrow
    zu_ZA
    zu_ZA.utf8