golanguage-specifications

How to understand this spec text?


I want to improve my knowledge about Golang by reading the Golang specification but English isn't my native language; and, I do not fully understand what the following text means:

Source code is Unicode text encoded in UTF-8. The text is not canonicalized, so a single accented code point is distinct from the same character constructed from combining an accent and a letter; those are treated as two code points. For simplicity, this document will use the unqualified term character to refer to a Unicode code point in the source text.

With reference to the above text, what do the following parts mean?

  1. The text is not canonicalized
  2. Single accented code
  3. Unqualified term character to refer to a Unicode code point in the source text

If questions of this type are not suitable for this site, please advise a more suitable place to ask such questions.


Solution

  • It's important that you understand a particular facet of the Unicode standard first. There are essentially two ways to represent a accented character like ë. One is the single code point U+00EB (Latin Small Letter E with Diaeresis), and the second is two code points ̈e which is the simple code point U+0065 (Latin Small Letter E, a regular letter e) with another code point U+0308 (Combining Diaeresis).

    Now in effect, these two characters are the same. They are merely constructed differently. This leads to a concept called Unicode equivalence which normalizes (or canonicalizes) those two sets of code points to be equivalent.

    The text is not canonicalized, so a single accented code point is distinct from the same character constructed from combining an accent and a letter

    This means that the two accented letters ë and ̈e above are not equivalent in the language spec. The first one is the "single accented code" U+00EB, and the latter is the letter e combined with a combining diacritic.


    For simplicity, this document will use the unqualified term character to refer to a Unicode code point in the source text

    It's just saying "We're defining for this document only the term 'character' to mean a single Unicode code point." This is for ease of reading, not to define anything in the language specification, and therefore it is "unqualified."