c++unicodelanguage-lawyerunicode-normalizationcanonicalization

May a C++ compiler normalize Unicode identifiers?


In C++, we can use a wide variety of Unicode characters in identifiers. For example, you could name a variable résumé.

Those accented es can be represented in different ways: either as a precomposed character or as a plain e with a combining accent character. Many applications normalize such strings so that seemingly identical strings actually match.

Looking at the C++ standard, I don't see anything that requires the compiler to normalize identifiers, so variable résumé could be distinct from variable résumé. (In my tests, it doesn't seem as though MSVC nor clang normalize the identifiers.)

Is there anything that prohibits the compiler from choosing a normal form? If not, at what phase of translation should normalization occur?

[To be clear: I'm talking about identifiers, not string literals.]


Solution

  • I believe the compiler is permitted to perform this normalization in translation phase 1:

    Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. Any source file character not in the basic source character set (5.3) is replaced by the universal-character-name that designates that character. An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (e.g., using the \uXXXX notation), are handled equivalently except where this replacement is reverted (5.4) in a raw string literal.

    Since the mapping of source file characters to the basic source character set and to universal character names is implementation-defined, the implementation may choose to convert whatever byte sequences represent either the precomposed or decomposed lowercase-e-with-acute-accent to the same universal character name, but must document this choice.