unicodepython-unicodeunicode-normalization

How to ignore space in Sound Mark during Unicode Composition/Decomposition in Japanese text?


I have two different tables with data, in one of them Katakana-Hiragana Sound Mark is part of the previous character, in another it's a separate symbol. I need to match values between the two tables. The Unicode Equivalence should handle these cases, but suddenly U+309B (Katakana-Hiragana Voiced Sound Mark) is decomposed into U+0020 (space) and U+3099 (Combining Katakana-Hiragana Voiced Sound Mark). The space doesn't let me combine U+3099 with the previous character.

Example:

From one table I get value ジ (U+30B8). I perform the NFKC transformation: U+30B8 is decomposed as U+30B7 and U+3099 and then composed back to U+30B8.

From the other table I get value シ゛(U+30B7 and U+309B). I perform the NFKC transformation: (U+30B7 U+309B) is decomposed as (U+30B7 U+0020 U+3099) and (U+30B7 U+3099) is not composed back to U+30B8 because of the space in between. So I'm left with シ ゙ (U+30B7 U+0020 U+3099) and I can't match this value with ジ (U+30B8) from the previous table.

How can I get rid of the space in decomposition of U+309B and why is it even there?

Here is the Python code:

import unicodedata2


print(f"Unicode code points: {[hex(ord(c)) for c in unicodedata2.normalize('NFKC', 'シ゛')]}")
# Result: Unicode code points: ['0x30b7', '0x20', '0x3099']
print(f"Unicode code points: {[hex(ord(c)) for c in unicodedata2.normalize('NFKC', 'ジ')]}")
# Result: Unicode code points: ['0x30b8']

Solution

  • The character is not a combining character so it legitimately occurs after and separate from the previous character.

    https://www.fileformat.info/info/unicode/char/309b/index.htm explicitly lists its decomposition as a space and a combining accent. (This is just one of many popular browsable versions of the actual Unicode database, so I would regard it as reasonably authoritative.)

    For what it's worth, for me, in Chrome on a Mac, it renders as if the space was after the accent, but it's clearly a different string from your other example:

    screen shot of detail of OP's code

    As for how to solve this, what's acceptable really depends on your use case. As a quick hack, you might simply want to discard a space immediately before a combining character and see what you get, although that then means you cannot correctly process text which is supposed to represent the combining character in isolation.

    If you can use the third-party regex library, it allows you to match on Unicode properties; so try something like

    >>> import unicodedata, regex
    >>> regex.sub(r' (?=\p{Joining_Type=Transparent})', '', unicodedata.normalize(
    ...          'NFKD', 'the name of the bank is シ゛ヨウホクシンキン'))
    'the name of the bank is ジヨウホクシンキン'
    

    Installing an additional third-party library might be overkill, though. If you really only care about the Japanese joining characters, you can simply enumerate them. https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=[:Joining_Type=Transparent:] has a listing; search for "Hiragana".

    >>> import re
    >>> re.sub(r' (?=[\u3099\u309A])', '', unicodedata.normalize(
    ...       'NFKD', 'the name of the bank is シ゛ヨウホクシンキン'))
    'the name of the bank is ジヨウホクシンキン'
    

    (Yes, there are only two; your dakuten ゛and handakuten ゜.)