c++unicodelocaleutfucs

UTF usage in C++ code


What is the difference between UTF and UCS.

What are the best ways to represent not European character sets (using UTF) in C++ strings. I would like to know your recommendations for:


Solution

  • What is the difference between UTF and UCS.

    UCS encodings are fixed width, and are marked by how many bytes are used for each character. For example, UCS-2 requires 2 bytes per character. Characters with code points outside the available range can't be encoded in a UCS encoding.

    UTF encodings are variable width, and marked by the minimum number of bits to store a character. For example, UTF-16 requires at least 16 bits (2 bytes) per character. Characters with large code points are encoded using a larger number of bytes -- 4 bytes for astral characters in UTF-16.

    • Internal representation inside the code
    • Best storage representation (i.e. In file)
    • Best on wire transport format (Transfer between application that may be on different architectures and have a different standard locale)

    For modern systems, the most reasonable storage and transport encoding is UTF-8. There are special cases where others might be appropriate -- UTF-7 for old mail servers, UTF-16 for poorly-written text editors -- but UTF-8 is most common.

    Preferred internal representation will depend on your platform. In Windows, it is UTF-16. In UNIX, it is UCS-4. Each has its good points:

    Finally, some systems use UTF-8 as an internal format. This is good if you need to inter-operate with existing ASCII- or ISO-8859-based systems because NULL bytes are not present in the middle of UTF-8 text -- they are in UTF-16 or UCS-4.