pythonunicodehindi

I want to kown how many unicode make one hindi character


I used python:

for m in regex.findall(r"\X", 'ल्लील्ली', regex.UNICODE):
    for i in m:
        print(i, i.encode('unicode-escape'))
    print('--------')

the results show ल्ली has 2 hindi characters:

ल b'\\u0932'
् b'\\u094d'
--------
ल b'\\u0932'
ी b'\\u0940'
--------

it's wrong, actually ल्ली is one hindi character. How to get hindi character(such as ल्ली) by how many unicode compose.

In short, I want to split 'कृपयाल्ली' to 'कृ','प','या','ल्ली'


Solution

  • I'm not quite sure if this is correct, being Finnish and not well versed in Hindi, but this would merge characters with any subsequent Unicode Mark characters:

    import unicodedata
    
    
    def merge_compose(s: str):
        current = []
        for c in s:
            if current and not unicodedata.category(c).startswith("M"):
                yield current
                current = []
            current.append(c)
        if current:
            yield current
    
    
    for group in merge_compose("कृपयाल्ली"):
        print(group, len(group), "->", "".join(group))
    
    

    The output is

    ['क', 'ृ'] 2 -> कृ
    ['प'] 1 -> प
    ['य', 'ा'] 2 -> या
    ['ल', '्'] 2 -> ल्
    ['ल', 'ी'] 2 -> ली