Some Chinese characters have multiple Unicode representations. For example, although the characters 金 and 金 are usually rendered the same, they actually have different underlying unicodes (see https://www.compart.com/en/unicode/U+91D1 and https://www.compart.com/en/unicode/U+F90A#UNC_DB). Evidently, the site www.compart.com knows about the links between these two Unicodes as there's a link to the U+91D1 page in the U+F90A page.
Is there a public database where I can query these kinds of correspondences between Unicode characters that are actually the "same"?
The solution is Unicode Normalization.
Is there a public database where I can query these kinds of correspondences between Unicode characters that are actually the "same"? Yes, there is Unicode Character Database; pay your attention to UnicodeData.txt: technically a csv
file without header line (fields described here). We are interested in the 5th field (Decomposition_Type
Decomposition_Mapping
):
(5) This field contains both values, with the type in angle brackets. The decomposition mappings exactly match the decomposition mappings published with the character names in the Unicode Standard. For more information, see Character Decomposition Mappings.
Search the file manually or semi-automatic: findstr
in Windows or equivalent bash command (Linux): grep
for your code point 91d1
:
findstr /I /R "\<91d1\>" "\Utils\CodePages\UnicodeData.txt"
2FA6;KANGXI RADICAL GOLD;So;0;ON;<compat> 91D1;;;;N;;;;; 322E;PARENTHESIZED IDEOGRAPH METAL;So;0;L;<compat> 0028 91D1 0029;;;;N;;;;; 328E;CIRCLED IDEOGRAPH METAL;So;0;L;<circle> 91D1;;;;N;;;;; F90A;CJK COMPATIBILITY IDEOGRAPH-F90A;Lo;0;L;91D1;;;;N;;;;;
The characters found above are
⾦
(U+2FA6, Kangxi Radical Gold)㊎
(U+328E, Circled Ideograph Metal)金
(U+F90A, CJK Compatibility Ideograph-F90a)㈮
(U+322E, Parenthesized Ideograph Metal)金
(U+91D1, CJK Ideograph) (missing in above findstr
output as it's hidden in the CJK Ideograph block, code points 4E00
..9FFF
).The following Python script could enlighten some aspects of normalization…
import sys
from unicodedata import normalize
def encodeuni(s):
'''
Returns input string encoded to escape sequences as in a string literal.
Output is similar to
str(s.encode('unicode_escape')).lstrip('b').strip("'").replace('\\\\','\\');
but even every ASCII character is encoded as a \\xNN escape sequence
(except a space character). For instance:
s = 'A á ř 🌈';
encodeuni(s); # '\\x41 \\xe1 \\u0159 \\U0001f308' while
str(s.encode('unicode_escape')).lstrip('b').strip("'").replace('\\\\','\\');
# # 'A \\xe1 \\u0159 \\U0001f308'
'''
def encodechar(ch):
ordch = ord(ch)
return ( ch if ordch == 0x20 else
f"\\x{ordch:02x}" if ordch <= 0xFF else
f"\\u{ordch:04x}" if ordch <= 0xFFFF else
f"\\U{ordch:08x}" )
return ''.join([encodechar(ch) for ch in s])
if len(sys.argv) >= 2 and sys.argv[1] != '':
letters = (' '.join(
[sys.argv[i] for i in range(1,len(sys.argv))])).strip()
# .\SO\59979037.py ÅÅÅ🌈
else:
letters = '\u212B \u00C5 \u0041\u030A \U0001f308'
# \u212B Å Angstrom Sign
# \u00C5 Å Latin Capital Letter A With Ring Above
# \u0041 A Latin Capital Letter A
# \u030A ̊ Combining Ring Above
# \U0001f308 🌈 Rainbow
print('\t'.join( ['raw' ,
letters.ljust(10),
str(len(letters)),
encodeuni(letters),'\n']))
for form in ['NFC','NFKC','NFD','NFKD']:
letnorm = normalize(form, letters)
print( '\t'.join( [form,
letnorm.ljust(10),
str(len(letnorm)),
encodeuni(letnorm)]))
Output: encodeuni.py ⾦㊎金金㈮
raw ⾦㊎金金㈮ 5 \u2fa6\u328e\uf90a\u91d1\u322e
NFC ⾦㊎金金㈮ 5 \u2fa6\u328e\u91d1\u91d1\u322e
NFKC 金金金金(金) 7 \u91d1\u91d1\u91d1\u91d1\x28\u91d1\x29
NFD ⾦㊎金金㈮ 5 \u2fa6\u328e\u91d1\u91d1\u322e
NFKD 金金金金(金) 7 \u91d1\u91d1\u91d1\u91d1\x28\u91d1\x29
Further resources (required reading): Unicode® Technical Reports