python-3.xunicodepython-unicode

Unable to print Dependent vowels


I am reading the text file consisting of Bengali words. But I am unable to print the dependent vowels like KA, KI etc...

1


Here is my sample code and output

import unicodedata

bengali_phoneme_maplist = {
    u'অ':'A',u'আ':'AA',u'ই':'I',u'ঈ':'II',u'উ':'U',u'ঊ ':'UU',u'ঋ ':'R',u'ঌ ':'L',u'এ ':'E',
    u'ঐ ':'AI',u'ও ':'O',u'ঔ ':'AU',u'ক':'KA',u'খ ':'KHA',u'গ ':'GA',u'ঘ':'GHA',u'ঙ ':'NGA',
    u'চ ':'CA',u'ছ':'CHA',u'জ ':'JA',u'ঝ':'JHA',u'ঞ':'NYA',u'ট ':'TTA',u'ঠ':'TTHA',
    u'ড ':'DDA',u'ঢ':'DDHA',u'ণ ':'NNA',u'ত ':'TA',u'ত ':'THA',u'দ':'DA',u'ধ':'DHA',
    u'ন':'NA',u'প':'PA',u'ফ':'PHA',u'ব':'BA',u'ভ':'BHA',u'ম ':'MA',u'য ':'YA',u'র':'RA',
    u'ল ':'LA',u'শ ':'SHA',u'ষ':'SSA',u'স ':'SA',u'হ':'ha',u' া ':'AAV',u' ি':'IV',
    u'ী':'IIV',u'ু':'UV',u'ূ':'UUV',u'ৃ':'RRV',u'ৄ ':'RR',u'ৄ':'EV',u' ৈ':'EV',u'়':'NUKTHA',
    u'ঽ':'AVAGRAHA'
}
bengali_phoneme_maplist_normalise = {
    unicodedata.normalize('NFKD', k): v for k, v in bengali_phoneme_maplist.items()
}

with open('bengali.txt', 'r') as infile:
    lines=infile.readlines()
    for index, line in enumerate(lines):
        print('Phonemes in line{0}.total{1} symbols'.format(index, len(line)))
        unknown=[]
        words=line.split()
        for word in words:
            print(word, ':', sep=' ', end='')
            for character in word:
                c = unicodedata.normalize('NFKD', character).casefold()
                try:
                    print(bengali_phoneme_maplist_normalise[c], sep='', end='')
                except KeyError:
                    print('_', sep='', end='')
                    if c not in unknown:
                        unknown.append(c)
            print()

if unknown:
    print('Unrecognised symbols: {0} (total {1} symbols)'.format(','.join(unknown), len(unknown) ) )

Sample input file bengali.txt:

text_000002 "শিল্পাঞ্চলে ঢোকার মুখে, স্ন্যাক্সবারে খাবার কিনছিলেন, বহুজাতিক তথ্যপ্রযুক্তি সংস্থার কর্মী, শুভময় বন্দ্যোপাধ্যায়

Sample output:

Phonemes in line0.total126 symbols
text_000002 :___________
"শিল্পাঞ্ :_____PA_NYA____
ঢোক :DDHA_KA_RA
মুখ, :_UV___
স্ন্যাক্সব :__NA___KA__BA_RA_
খাব :__BA_RA
কিনছি, :KA_NACHA___NA_
বহুজাত :BAhaUV____KA
তথ্যপ্রযুক্ত :____PA_RA_UVKA___
সংস্থ :______RA
কর্ম, :KARA__IIV_
শুভময় :_UVBHA__
বন্দ্যোপাধ্ :BANA_DA___PA_DHA____
Unrecognised symbols: t,e,x,_,0,2,",শ,ি,ল,্,া,চ,ে,,ো,ম,খ,,,স,য,জ, (total 25 symbols)

Solution

  • (Note that I know nohting about Bengali. :)

    There are a few problems in your code:

    1. There are many extra SPACE chars in the bengali_phoneme_maplist definition. For example, u'ঊ ' should be u'ঊ'. And it seems like it's not easy to input chars like u'া' in an text editor so I suggest you directly use unicode in the code, like '\u09be':'AAV'. (Actually I'd suggest you use '\uxxxx' for all chars and write the real chars in comments.)
    2. u'ত':'TA',u'ত':'THA' should change to u'ত':'TA',u'থ':'THA'.
    3. The chars in bengali_phoneme_maplist are not complete. For example there's no ো , ৌ , ্ and ং

    After fixing these errors you will get the correct result.


    File: foo.py

    import unicodedata
    
    bengali_phoneme_maplist = {
        u'অ':'A',u'আ':'AA',u'ই':'I',u'ঈ':'II',u'উ':'U',u'ঊ':'UU',u'ঋ':'R',u'ঌ':'L',u'এ':'E',
        u'ঐ':'AI',u'ও':'O',u'ঔ':'AU',u'ক':'KA',u'খ':'KHA',u'গ':'GA',u'ঘ':'GHA',u'ঙ':'NGA',
        u'চ':'CA',u'ছ':'CHA',u'জ':'JA',u'ঝ':'JHA',u'ঞ':'NYA',u'ট':'TTA',u'ঠ':'TTHA',
        u'ড':'DDA',u'ঢ':'DDHA',u'ণ':'NNA',u'ত':'TA',u'থ':'THA',u'দ':'DA',u'ধ':'DHA',
        u'ন':'NA',u'প':'PA',u'ফ':'PHA',u'ব':'BA',u'ভ':'BHA',u'ম':'MA',u'য':'YA',u'র':'RA',
        u'ল':'LA',u'শ':'SHA',u'ষ':'SSA',u'স':'SA',u'হ':'ha',u'া':'AAV',u'ি':'IV',
        u'ী':'IIV',u'ু':'UV',u'ূ':'UUV',u'ৃ':'RRV',
        u'ৄ':'RR',u'ৈ':'EV',u'়':'NUKTHA',u'ঽ':'AVAGRAHA',
        u'ো': 'O', u'ৌ': 'AU', u'্': 'VIRAMA', u'ে': 'E', u'ং': 'Anusvara', u'য়': 'Yya',
    }
    bengali_phoneme_maplist_normalise = {
        unicodedata.normalize('NFKD', k): v for k, v in bengali_phoneme_maplist.items()
    }
    
    with open('bengali.txt', 'r') as infile:
        lines=infile.readlines()
        for index, line in enumerate(lines):
            print('Phonemes in line{0}.total{1} symbols'.format(index, len(line)))
            unknown=[]
            words=line.split()
            for word in words:
                print(word, ':', sep=' ', end='')
                for character in word:
                    c = unicodedata.normalize('NFKD', character).casefold()
                    try:
                        print(bengali_phoneme_maplist_normalise[c], sep='', end='')
                    except KeyError:
                        print('_', sep='', end='')
                        if c not in unknown:
                            unknown.append(c)
                print()
    
    if unknown:
        print('Unrecognised symbols: {0} (total {1} symbols)'.format(','.join(unknown), len(unknown) ) )
    

    Output:

    $ python3 foo.py
    Phonemes in line0.total126 symbols
    text_000002 :___________
    "শিল্পাঞ্ :_SHAIVLAVIRAMAPAAAVNYAVIRAMACALAE
    ঢোক :DDHAOKAAAVRA
    মুখ, :MAUVKHAE_
    স্ন্যাক্সব :SAVIRAMANAVIRAMAYAAAVKAVIRAMASABAAAVRAE
    খাব :KHAAAVBAAAVRA
    কিনছি, :KAIVNACHAIVLAENA_
    বহুজাত :BAhaUVJAAAVTAIVKA
    তথ্যপ্রযুক্ত :TATHAVIRAMAYAPAVIRAMARAYAUVKAVIRAMATAIV
    সংস্থ :SAAnusvaraSAVIRAMATHAAAVRA
    কর্ম, :KARAVIRAMAMAIIV_
    শুভময় :SHAUVBHAMAYya
    বন্দ্যোপাধ্ :BANAVIRAMADAVIRAMAYAOPAAAVDHAVIRAMAYAAAVYya
    Unrecognised symbols: t,e,x,_,0,2,",, (total 8 symbols)