pythonpython-3.xunicodepython-unicode

unicodedata.decomposition() vs. unicodedata.normalize(NFD/NFKD)?


According to py3 doc:

  • unicodedata.decomposition(chr)

    Returns the character decomposition mapping assigned to the character chr as string. An empty string is returned in case no such mapping is defined.

Here I don't quite understand how character decomposition mapping is defined and what's the relationship/difference between unicodedata.decomposition() and unicodedata.normalize(NFD/NFKD)?

See following examples:

$ python3
>>> import unicodedata
>>> unicodedata.decomposition('⑴')
'<compat> 0028 0031 0029'              <-- why not just '0028 0031 0029'?
>>> unicodedata.normalize('NFKD', '⑴')
'(1)'
>>> unicodedata.decomposition('①')
'<circle> 0031'                        <-- why not just '0031'?
>>> unicodedata.normalize('NFKD', '①')
'1'
>>> unicodedata.decomposition('è')
'0065 0300'                            <-- like this?
>>> unicodedata.normalize('NFD', 'è') == '\u0065\u0300'
True
>>>

Solution

  • unicodedata.decomposition returns the decomposition type and mapping of a single code point in the format used in the Unicode Character Database. From UAX #44:

    Decomposition_Type, Decomposition_Mapping: This field contains both values, with the type in angle brackets.

    If there's no type in angle brackets, the code point has a canonical decomposition used in NFC and NFD. If there's a type in angle brackets, the code point has a compatibility decomposition which are used by NFKC and NFKD in addition to the canonical decompositions.

    unicodedata.normalize implements the Unicode Normalization algorithms for whole strings.