unicodecjkchinese-locale

Chinese characters with multiple unicode representations


Some Chinese characters have multiple Unicode representations. For example, although the characters 金 and 金 are usually rendered the same, they actually have different underlying unicodes (see https://www.compart.com/en/unicode/U+91D1 and https://www.compart.com/en/unicode/U+F90A#UNC_DB). Evidently, the site www.compart.com knows about the links between these two Unicodes as there's a link to the U+91D1 page in the U+F90A page.

Is there a public database where I can query these kinds of correspondences between Unicode characters that are actually the "same"?


Solution

  • The solution is Unicode Normalization.

    Is there a public database where I can query these kinds of correspondences between Unicode characters that are actually the "same"? Yes, there is Unicode Character Database; pay your attention to UnicodeData.txt: technically a csv file without header line (fields described here). We are interested in the 5th field (Decomposition_Type Decomposition_Mapping):

    (5) This field contains both values, with the type in angle brackets. The decomposition mappings exactly match the decomposition mappings published with the character names in the Unicode Standard. For more information, see Character Decomposition Mappings.

    Search the file manually or semi-automatic: findstr in Windows or equivalent bash command (Linux): grep for your code point 91d1:

    findstr /I /R "\<91d1\>" "\Utils\CodePages\UnicodeData.txt"
    
    2FA6;KANGXI RADICAL GOLD;So;0;ON;<compat> 91D1;;;;N;;;;;
    322E;PARENTHESIZED IDEOGRAPH METAL;So;0;L;<compat> 0028 91D1 0029;;;;N;;;;;
    328E;CIRCLED IDEOGRAPH METAL;So;0;L;<circle> 91D1;;;;N;;;;;
    F90A;CJK COMPATIBILITY IDEOGRAPH-F90A;Lo;0;L;91D1;;;;N;;;;;
    

    The characters found above are

    The following Python script could enlighten some aspects of normalization…

    import sys
    from unicodedata import normalize
    
    def encodeuni(s):
        '''
        Returns input string encoded to escape sequences as in a string literal.
        Output is similar to
          str(s.encode('unicode_escape')).lstrip('b').strip("'").replace('\\\\','\\');
        but even every ASCII character is encoded as a \\xNN escape sequence
        (except a space character). For instance: 
        
        s = 'A á ř 🌈';
        encodeuni(s);       # '\\x41 \\xe1 \\u0159 \\U0001f308'     while 
        str(s.encode('unicode_escape')).lstrip('b').strip("'").replace('\\\\','\\');
        #                   #    'A \\xe1 \\u0159 \\U0001f308'
        '''
        def encodechar(ch):
            ordch = ord(ch)
            return ( ch                if ordch == 0x20   else 
                     f"\\x{ordch:02x}" if ordch <= 0xFF   else
                     f"\\u{ordch:04x}" if ordch <= 0xFFFF else
                     f"\\U{ordch:08x}" )
                     
        return ''.join([encodechar(ch) for ch in s]) 
    
    if len(sys.argv) >= 2 and sys.argv[1] != '':
        letters = (' '.join(
        [sys.argv[i] for i in range(1,len(sys.argv))])).strip()
        # .\SO\59979037.py  ÅÅÅ🌈
    else:
        letters = '\u212B \u00C5 \u0041\u030A \U0001f308'
        #          \u212B                     Å Angstrom Sign
        #                 \u00C5              Å Latin Capital Letter A With Ring Above
        #                        \u0041       A Latin Capital Letter A
        #                              \u030A ̊  Combining Ring Above
        #                                     \U0001f308 🌈 Rainbow
    
    print('\t'.join( ['raw' ,
                      letters.ljust(10),
                      str(len(letters)),
                      encodeuni(letters),'\n']))
    for form in ['NFC','NFKC','NFD','NFKD']:
        letnorm = normalize(form, letters)
        print( '\t'.join( [form,
                          letnorm.ljust(10),
                          str(len(letnorm)),
                          encodeuni(letnorm)]))
    

    Output: encodeuni.py ⾦㊎金金㈮

    raw     ⾦㊎金金㈮      5       \u2fa6\u328e\uf90a\u91d1\u322e
    
    NFC     ⾦㊎金金㈮      5       \u2fa6\u328e\u91d1\u91d1\u322e
    NFKC    金金金金(金)    7       \u91d1\u91d1\u91d1\u91d1\x28\u91d1\x29
    NFD     ⾦㊎金金㈮      5       \u2fa6\u328e\u91d1\u91d1\u322e
    NFKD    金金金金(金)    7       \u91d1\u91d1\u91d1\u91d1\x28\u91d1\x29
    

    Further resources (required reading): Unicode® Technical Reports