According to py3 doc:
unicodedata.decomposition(chr)
Returns the character decomposition mapping assigned to the character
chr
as string. An empty string is returned in case no such mapping is defined.
Here I don't quite understand how character decomposition mapping is defined and what's the relationship/difference between unicodedata.decomposition()
and unicodedata.normalize(NFD/NFKD)
?
See following examples:
$ python3
>>> import unicodedata
>>> unicodedata.decomposition('⑴')
'<compat> 0028 0031 0029' <-- why not just '0028 0031 0029'?
>>> unicodedata.normalize('NFKD', '⑴')
'(1)'
>>> unicodedata.decomposition('①')
'<circle> 0031' <-- why not just '0031'?
>>> unicodedata.normalize('NFKD', '①')
'1'
>>> unicodedata.decomposition('è')
'0065 0300' <-- like this?
>>> unicodedata.normalize('NFD', 'è') == '\u0065\u0300'
True
>>>
unicodedata.decomposition
returns the decomposition type and mapping of a single code point in the format used in the Unicode Character Database. From UAX #44:
Decomposition_Type, Decomposition_Mapping: This field contains both values, with the type in angle brackets.
If there's no type in angle brackets, the code point has a canonical decomposition used in NFC and NFD. If there's a type in angle brackets, the code point has a compatibility decomposition which are used by NFKC and NFKD in addition to the canonical decompositions.
unicodedata.normalize
implements the Unicode Normalization algorithms for whole strings.