I used python:
for m in regex.findall(r"\X", 'ल्लील्ली', regex.UNICODE):
for i in m:
print(i, i.encode('unicode-escape'))
print('--------')
the results show ल्ली has 2 hindi characters:
ल b'\\u0932'
् b'\\u094d'
--------
ल b'\\u0932'
ी b'\\u0940'
--------
it's wrong, actually ल्ली is one hindi character. How to get hindi character(such as ल्ली) by how many unicode compose.
In short, I want to split 'कृपयाल्ली'
to 'कृ'
,'प'
,'या'
,'ल्ली'
I'm not quite sure if this is correct, being Finnish and not well versed in Hindi, but this would merge characters with any subsequent Unicode Mark characters:
import unicodedata
def merge_compose(s: str):
current = []
for c in s:
if current and not unicodedata.category(c).startswith("M"):
yield current
current = []
current.append(c)
if current:
yield current
for group in merge_compose("कृपयाल्ली"):
print(group, len(group), "->", "".join(group))
The output is
['क', 'ृ'] 2 -> कृ
['प'] 1 -> प
['य', 'ा'] 2 -> या
['ल', '्'] 2 -> ल्
['ल', 'ी'] 2 -> ली