pythonpyqtpyqt5tamil

How to Create a Tamil Phonetic keyboard inputs in python?


My target: create a Phonetic Keyboard in the Tamil language, using dictionary key mapping. My struggle: How to replace my keys with values and set that value to my textbox. For Example: If I press "K" in textbox1, then my textbox1.text will change into the Tamil letter "க்", if I press "Ku" then textbox1.text will be replaced by the Tamil letter "கு",, if I press "kuu" then textbox1.text will be replaced by the Tamil letter "கூ" And then If I press "m" then the Tamil letter "ம்" will be added to the previous letter "கூ" and now textbox1.text becomes "கூம்"

import sys

from PyQt5.QtGui import *
from PyQt5.QtWidgets import *

tamil_dict = {"a":{'a':'அ','aa':'ஆ'},
              "k":{'k':'க்','ka':'க','kaa':'கா','ki':'கி','kii':'கீ','ku':'கு','kuu':'கூ'},
              "m":{'m':'ம்','ma':'ம','maa':'மா','mi':'மி','mii':'மீ','mu':'மு','muu':'மூ'},
              "i":{"i":"இ"},
              "e":"ஈ "}


class Keyboard_Dict(QWidget):

    def __init__(self):
        super().__init__()
        self.setWindowTitle("Tamil InPut ")

        self.tbox1 = QLineEdit()
        self.tbox1.setFont(QFont('Arial Unicode MS', 10, QFont.Bold))
        self.tbox1.textChanged.connect(self.func_textbox_textchanged)

        self.tbox2 = QLineEdit()
        self.tbox2.setFont(QFont('Arial Unicode MS', 10, QFont.Bold))

        self.vbox = QVBoxLayout()
        self.vbox.addWidget(self.tbox1)
        self.vbox.addWidget(self.tbox2)
        self.setLayout(self.vbox)

        self.process_txt_temp_letter = ""
        self.process_txt_temp_position = 0
        self.process_letter_found = False
        self.process_letter_temp = False
        self.processed_text =""
    def func_textbox_textchanged(self,txt):
        self.txt_len = len(self.tbox1.text())
        if self.txt_len >= 1:
            self.process_txt_position = self.txt_len-1
            self.process_text_letter = (txt[self.process_txt_position])


            if self.process_letter_found == False:
                if (txt[self.txt_len-1]) in tamil_dict:
                    self.process_letter_found = True
                    self.process_txt_temp_position = (self.txt_len-1)
                    self.process_txt_temp_letter = (txt[self.txt_len-1])
                    self.process_letter_temp = True

            if self.process_letter_temp == True :
                if (txt[self.process_txt_temp_position:]) in tamil_dict[self.process_txt_temp_letter]:
                    self.processed_text =  tamil_dict[self.process_txt_temp_letter][txt[self.process_txt_temp_position:]]
                    print(self.processed_text)
                    # print("jjjjjjjjj",tamil_dict[self.process_txt_temp_letter][txt[self.process_txt_temp_position:]])

                elif (txt[self.process_txt_temp_position:]) not in tamil_dict[self.process_txt_temp_letter]:
                    self.process_txt_temp_position = self.txt_len - 1
                    self.process_txt_temp_letter = (txt[self.process_txt_temp_position])

                    if self.process_txt_temp_letter not in tamil_dict:
                        self.process_letter_temp = False
                        self.process_letter_found = False
                    else:
                        self.processed_text = tamil_dict[self.process_txt_temp_letter][txt[self.process_txt_temp_position:]]
                        print(self.processed_text)
                        # print("ffffff", tamil_dict[self.process_txt_temp_letter][txt[self.process_txt_temp_position:]])
        self.tbox2.setText(self.processed_text)





def main():
    app = QApplication(sys.argv)

    mainscreen = Keyboard_Dict()
    app.setStyle("Fusion")
    mainscreen.show()
    sys.exit(app.exec_())

if __name__ == '__main__':
    main()

update : 1 I Have a keys Maps as follows :

      *Main character/processed first character : "k"* 
        k     : "Apple"
        ka    : "Bannana"
        kaa   : "Citrus:
        ki    : "Orange"
        kii   : "Pine" 
        *Main character/processed first character : "M"* 
         m     : "is" 
         ma    : "more"
         maa   : "taste"
 *Main character/processed first character : "i"* 
         i     : "world"
         ii    : "earth"

tamil_dict ={"k":{"k":"apple","ka":"bannana","kaa":"citrus","ki":"orange","kii":"pine"},
             "m":{"m":"is","ma":"more","maa":"taste"},
             "i":{"i":"world","ii":"earth"}
            }


     

In my textbox.text, my first character is "A", not found in dictionary, so need not process that character and display as it is.

In my textbox.text, my second input word is "k", which is found in my dictionery and also "k" is the main Word. Now My processed word is "A"+"k", my textbox is replaced as follows, "A" + "Apple" = AApple.

In my textbox.text, my third charcter is "a", Now my processed word is "ka", its found in dict . so replaced with that equivalent value "Banana". Now My processed word is "A"+"ka", and now my textbox.text as follows : "A"+"banana" = Abanana.

In my textbox.text, my fourth charcter is "a", my processed word becomes "Kaa" its equvilient value is "citrus", now my processed word is "A"+"kaa" , its equvlient value is "A"+"citrus" = "Acitrus"

In my textbox.text my fifth input character is "m", now my processed word is "kaam",not found in dictionery. Now we split character "m" and check it in dict, whether its found or not, if Found, then we replace its equvilent value. Now My processed word is "A"+'kaa'+"m" and its vaule is "A"+"citurs"+"is", my textbox.text as follows "Acitrusis"

sixth input word is "a", now my processed character is "A"+"kaa"+"ma" equvilaent value is "A"+"citrus"+"more". Textbox.text become "Acitrusmore"

Now, if I press charcter "i" not found in "m" set. so we split "i" sepreately and check it in dict, if found, replace that value or leave as it is.

If 7th input character is " space bar" then process will end.

if my 8th input charcter is some english alphabet, once agin start process and replace equvilaent value and so on

update : 2 For more cleareance : enter image description here


Solution

  • It seems to me that all you need is this:

    import re
    
    [...]
    
    TAMIL_DICT = {
        'a':'அ', 'aa':'ஆ',
        'k':'க்', 'ka':'க', 'kaa':'கா', 'ki':'கி', 'kii':'கீ', 'ku':'கு', 'kuu':'கூ',
        'm':'ம்', 'ma':'ம', 'maa':'மா', 'mi':'மி', 'mii':'மீ', 'mu':'மு', 'muu':'மூ',
        'i':'இ',
        'e':'ஈ',  # [...]
    }
    TAMIL_KEYS = [re.escape(key) for key in sorted(TAMIL_DICT, key=len, reverse=True)]
    TAMIL_REGEX = re.compile('|'.join(TAMIL_KEYS))
    
    def convert_to_tamil(s):
        return TAMIL_REGEX.sub(lambda match: TAMIL_DICT[match.group(0)], s)
    
    [...]
    
        def func_textbox_textchanged(self, txt):
            self.tbox2.setText(convert_to_tamil(txt))
    

    Test:

    def test_convert_to_tamil():
        test_vectors = [
            ('', ''),
            ('k', 'க்'),
            ('ku', 'கு'),
            ('kuu', 'கூ'),
            ('kuum', 'கூம்'),
            ('VAkuum!', 'VAகூம்!'),
            ('!?', '!?'),
            ('ami', 'அமி'),
            ('aami', 'ஆமி'),
            ('kaaa.', 'காஅ.'),
            ('xka-aau', 'xக-ஆu')
        ]
        for v in test_vectors:
            c = convert_to_tamil(v[0])
            if c != v[1]:
                return f"'{v[0]}': expected '{v[1]}', found '{c}'"
        return "OK"
    
    print(test_convert_to_tamil())
    

    Test output:

    OK
    

    How it works: The regex scans the input string replacing as it goes. Since what it searches is sorted by decreasing length (key=len, reverse=True), it will always replace the longest match, and then continue scanning from the first character following it.

    To confirm that sorting by decreasing length is important, try to replace reverse=True with reverse=False. If you do, and run test_convert_to_tamil(), you will get this output:

    'ku': expected 'கு', found 'க்u'
    

    Please note that the format of TAMIL_DICT is different from the one of your tamil_dict.


    UPDATE

    The above was based on the understanding that you wanted to use two fields, like you do in your code. It's possible to modify the above code for one field, but at the cost of lots of head scratching caused by the handling of the cursor position, that must also play well with QT's cursor position logic. Anyway, here is an attempt:

    TAMIL_DICT = {
        'a':'அ', 'aa':'ஆ',
        'k':'க்', 'ka':'க', 'kaa':'கா', 'ki':'கி', 'kii':'கீ', 'ku':'கு', 'kuu':'கூ',
        'm':'ம்', 'ma':'ம', 'maa':'மா', 'mi':'மி', 'mii':'மீ', 'mu':'மு', 'muu':'மூ',
        'i':'இ',
        'e':'ஈ',  # [...]
    }
    LATIN_DICT = {val: key for key, val in TAMIL_DICT.items()}
    
    TAMIL_DICT_WITH_CURSOR = {}
    for latin, tamil in TAMIL_DICT.items():
        for i in range(1, len(latin)):
            TAMIL_DICT_WITH_CURSOR[f"{latin[:i]}\a{latin[i:]}"] = f"{tamil}\a"
    TAMIL_DICT.update(TAMIL_DICT_WITH_CURSOR)
    
    LATIN_DICT_WITH_CURSOR = {}
    for tamil, latin in LATIN_DICT.items():
        for i in range(1, len(tamil)):
            LATIN_DICT_WITH_CURSOR[f"{tamil[:i]}\a{tamil[i:]}"] = f"{latin}\a"
    LATIN_DICT.update(LATIN_DICT_WITH_CURSOR)
    
    TAMIL_KEYS = [re.escape(key) for key in sorted(TAMIL_DICT, key=len, reverse=True)]
    LATIN_KEYS = [re.escape(key) for key in sorted(LATIN_DICT, key=len, reverse=True)]
    TAMIL_REGEX = re.compile('|'.join(TAMIL_KEYS))
    LATIN_REGEX = re.compile('|'.join(LATIN_KEYS))
    
    def convert_to_tamil(s):
        return TAMIL_REGEX.sub(lambda match: TAMIL_DICT[match.group(0)], s)
    
    def convert_to_latin(s):
        return LATIN_REGEX.sub(lambda match: LATIN_DICT[match.group(0)], s)
    
    
    class Keyboard_Dict(QWidget):
    
        def __init__(self):
            super().__init__()
            self.setWindowTitle("Tamil Input")
            self.tbox1 = QLineEdit()
            self.tbox1.setFont(QFont('Arial Unicode MS', 10, QFont.Bold))
            self.tbox1.textEdited.connect(self.func_textbox_textedited)
            self.vbox = QVBoxLayout()
            self.vbox.addWidget(self.tbox1)
            self.setLayout(self.vbox)
    
        def func_textbox_textedited(self, mixed):
            at = self.tbox1.cursorPosition()
            mixed_a = f"{mixed[:at]}\a{mixed[at:]}"
            latin_a = convert_to_latin(mixed_a)
            tamil_a = convert_to_tamil(latin_a)
            tamil = tamil_a.replace('\a', '')
            if tamil != mixed:
                self.tbox1.setText(tamil)
                self.tbox1.setCursorPosition(tamil_a.find('\a'))
    

    Note the use of textEdited instead of textChanged. This is because textChanged would be called on setText().

    How it works: On every edit, the text is converted to Latin characters and back to Tamil script (as much as possible) again. The cursor position is temporarily marked in the text as an "alarm" or "bell" character, \a, in order to be able to position the cursor after the double conversion. Each conversion either leaves the \a where it is, or, if it is in the middle of a recognized group, moves it to the end of the converted group. It could also be moved to another position, but other choices seem to be more problematic, in particular because QT allows the user to delete a modifier, but not to move the cursor to a position between the modified character and the modifier.

    Care must be taken to build the mapping in a way that works "back and forth". Remember that at each edit step (which could also be the pasting of a string into the textbox!) the text, in the most common case consisting of all Tamil characters and one Latin character, will be converted to an intermediate representation in all Latin characters, and then back to Tamil characters. The conversion Latin→Tamil→Latin could, with certain mappings, not return to the initial representation, and this could cause problems (but not necessarily, since the intermediate full Latin representation is not visible to the user). See the comments for an example.


    ADDENDUM

    Looking at this table, pointed out by @Bala (N.B.: don't use it as it is!), there seems to be the need for entering digits as both ASCII and Tamil digits. There is a simple solution to this:

    TAMIL_DICT = {
        [...]
        '`0': '௦',
        '`1': '௧',
        '`2': '௨',
        [...]
        '`9': '௯',
        '`10': '௰',
        '`100': '௱',
        '`1000': '௲',
        [...]
    

    All other entries in the linked table containing backticks or digits should not be added to TAMIL_DICT. The effect will be: ASCII digits not preceded by ` will remain unconverted, ASCII digits preceded by ` will be converted to Tamil digits (returning to a `–ASCII digit sequence in the intermediate Latin representation), ` not followed by an ASCII digit will remain unconverted. The only downside is that it won't be possible to have a `–ASCII digit sequence as a final result (i.e., in the Tamil representation).

    This handling of digits can be improved by modifying the two regexes and the code that uses them, for the conversion of a single backtick followed by ASCII digits into a sequence of Tamil digits and vice versa (not forgetting the special cases of 10, 100, 1000, and possibly others). To produce ௧௨௩௪௫ you would type `12345 instead of `1`2`3`4`5.