pythonstringpython-3.xutfdevanagari

Syllabification of Devanagari


I am trying to syllabify devanagari words

धर्मक्षेत्रे -> धर् मक् षेत् रे dharmakeshetre -> dhar mak shet re

wd.split('्')

I get the result as :

['धर', 'मक', 'षेत', 'रे']

Which is partially correct

I try another word कुरुक्षेत्र -> कु रुक् षेत् रे kurukshetre -> ku ruk she tre

['कुरुक', 'षेत', 'रे']

The result is obviously wrong.

How do I extract the syllables effectively?


Solution

  • If you look at your strings character by character

    >>> data = "कुरुक्षेत्र"
    >>> re.findall(".", data)
    ['क', 'ु', 'र', 'ु', 'क', '्', 'ष', 'े', 'त', '्', 'र']
    

    And your other string

    >>> data = "धर्मक्षेत्रे"
    >>> re.findall(".", data)
    ['ध', 'र', '्', 'म', 'क', '्', 'ष', 'े', 'त', '्', 'र', 'े']
    

    So what you want is probably split these using '् '्. Let's call them notation characters for now. If you print the ord(data[2])for the first notation character, it is 2381. Now if you probe around this value

    >>> for i in range(2350, 2400):
    ...     print(i, chr(i))
    ...
    2350 म
    2351 य
    2352 र
    2353 ऱ
    2354 ल
    2355 ळ
    2356 ऴ
    2357 व
    2358 श
    2359 ष
    2360 स
    2361 ह
    2362 ऺ
    2363 ऻ
    2364 ़
    2365 ऽ
    2366 ा
    2367 ि
    2368 ी
    2369 ु
    2370 ू
    2371 ृ
    2372 ॄ
    2373 ॅ
    2374 ॆ
    2375 े
    2376 ै
    2377 ॉ
    2378 ॊ
    2379 ो
    2380 ौ
    2381 ्
    2382 ॎ
    2383 ॏ
    2384 ॐ
    2385 ॑
    2386 ॒
    2387 ॓
    2388 ॔
    2389 ॕ
    2390 ॖ
    2391 ॗ
    2392 क़
    2393 ख़
    2394 ग़
    2395 ज़
    2396 ड़
    2397 ढ़
    2398 फ़
    2399 य़
    

    We are mostly interested in in values between 2362 and 2391. So we create a array of such values

    >>> split = ""
    >>> for i in range(2362, 2392):
    ...     split += chr(i)
    

    Next we want to find all pattern with or without a corresponding notation symbol.

    >>> re.findall(".[" + split + "]?", "धर्मक्षेत्रे")
    ['ध', 'र्', 'म', 'क्', 'षे', 'त्', 'रे']
    >>> re.findall(".[" + split + "]?", "कुरुक्षेत्र")
    ['कु', 'रु', 'क्', 'षे', 'त्', 'र']
    

    This should get you close to what you are probably looking for. If you need more complex handling then you will have to go with the link @OphirYoktan posted