pythonnlptext-segmentationsymspell

How to get the best merger from symspellpy word segmentation of many languages in Python?


The following code uses SymSpell in Python, see the symspellpy guide on word_segmentation.

It uses "de-100k.txt" and "en-80k.txt" frequency dictionaries from a github repo, you need to save them in your working directory. As long as you do not want to use any SymSpell logic, you do not need to install and run this script to answer the question, take just the output of the two language's word segmentations and go on.

import pkg_resources
from symspellpy.symspellpy import SymSpell

input_term = "sonnenempfindlichkeitsunoil farbpalettesuncreme"

# German:
# Set max_dictionary_edit_distance to 0 to avoid spelling correction
sym_spell = SymSpell(max_dictionary_edit_distance=0, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "de-100k.txt"
)
# term_index is the column of the term and count_index is the
# column of the term frequency
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
result = sym_spell.word_segmentation(input_term)
print(f"{result.corrected_string}, {result.distance_sum}, {result.log_prob_sum}")

# English:
# Reset the sym_spell object
sym_spell = SymSpell(max_dictionary_edit_distance=0, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "en-80k.txt"
)
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
result = sym_spell.word_segmentation(input_term)
print(f"{result.corrected_string}, {result.distance_sum}, {result.log_prob_sum}")

Out:

sonnen empfindlichkeit s uno i l farb palette sun creme, 8, -61.741842760725255
sonnen empfindlichkeit sun oil farb palette sun creme, 6, -45.923471400632884

The aim is to find out the most relevant words by some logic: most frequent ngram neighours and/or word frequency, longest word, and the like. The logic is free of choice.

In this example with two languages, the two outputs need to be compared so that only the best segments are kept while dropping the rest, without interceptions of parts of words. In the outcome, each letter is used one time and uniquely.

If there are spaces between words in the input_term, these words should not be joined to become a new segment. For example, if you have 'cr eme' with a wrong space in it, that should still not be allowed to become 'creme'. It is just likely that the space is right more often than the errors that would appear from taking neighoured letters.

array('sonnen', 'empfindlichkeit', 'sun', 'oil', 'farb', 'palette', 'sun', 'creme')
array(['DE'], ['DE'], ['EN'], ['EN'], ['DE'], ['DE', 'EN'], ['EN'], ['DE', 'EN'])

The 'DE/EN' tag is just an optional idea to show that the word exists in German and English, you can also choose 'EN' over 'DE' in this example. The language tags are a bonus, you can also answer without that.

There is probably a fast solution that uses numpy arrays and/or dictionaries instead of lists or Dataframes, but choose as you like.

How to use many languages in symspell word segmentation and combine them to one chosen merger? The aim is a sentence of words built from all letters, using each letter once, keeping all original spaces.


Solution

  • SimSpell way

    This is the recommended way. I found this out only after doing the manual way. You can easily use the same frequency logic that is used for one language for two languages instead: Just load two languages or more into the sym_spell object!

    import pkg_resources
    from symspellpy.symspellpy import SymSpell
    
    input_term = "sonnenempfindlichkeitsunoil farbpalettesuncreme"
    
    # Set max_dictionary_edit_distance to 0 to avoid spelling correction
    sym_spell = SymSpell(max_dictionary_edit_distance=0, prefix_length=7)
    dictionary_path = pkg_resources.resource_filename(
        "symspellpy", "de-100k.txt"
    )
    
    # term_index is the column of the term and count_index is the
    # column of the term frequency
    sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
    
    result = sym_spell.word_segmentation(input_term)
    print(f"{result.corrected_string}, {result.distance_sum}, {result.log_prob_sum}")
    
    # DO NOT reset the sym_spell object at this line so that
    # English is added to the German frequency dictionary
    # NOT: #reset the sym_spell object
    # NOT: #sym_spell = SymSpell(max_dictionary_edit_distance=0, prefix_length=7)
    dictionary_path = pkg_resources.resource_filename(
        "symspellpy", "en-80k.txt"
    )
    sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
    result = sym_spell.word_segmentation(input_term)
    print(f"{result.corrected_string}, {result.distance_sum}, {result.log_prob_sum}")
    

    Out:

    sonnen empfindlichkeit s uno i l farb palette sun creme, 8, -61.741842760725255
    sonnen empfindlichkeit sun oil farb palette sun creme, 6, -45.923471400632884
    

    Manual way

    In this manual way, the logic is: the longer word of two languages wins, logging the winner language tag. If they are at the same length, both languages are logged.

    As in the question, the input_term = "sonnenempfindlichkeitsunoil farbpalettesuncreme", using a reset object for each language segmentation in SymSpell, leads to s1 for German and s2 for English.

    import numpy as np
    
    s1 = 'sonnen empfindlichkeit s uno i l farb palette sun creme'
    s2 = 'son ne ne mp find li ch k e it sun oil far b palette sun creme'
    
    num_letters = len(s1.replace(' ',''))
    list_w1 = s1.split()
    list_w2 = s2.split()
    list_w1_len = [len(x) for x in list_w1]
    list_w2_len = [len(x) for x in list_w2]
    
    lst_de = [(x[0], x[1], x[2], 'de', x[3], x[4]) for x in zip(list_w1, list_w1_len, range(len(list_w1)), np.cumsum([0] + [len(x)+1 for x in list_w1][:-1]), np.cumsum([0] + [len(x) for x in list_w1][:-1]))]
    lst_en = [(x[0], x[1], x[2], 'en', x[3], x[4]) for x in zip(list_w2, list_w2_len, range(len(list_w2)), np.cumsum([0] + [len(x)+1 for x in list_w2][:-1]), np.cumsum([0] + [len(x) for x in list_w2][:-1]))]
    
    idx_word_de = 0
    idx_word_en = 0
    lst_words = []
    idx_letter = 0
    
    # stop at num_letters-1, else you check the last word 
    # also on the last idx_letter and get it twice
    while idx_letter <= num_letters-1:
    lst_de[idx_word_de][5], idx_letter)
        while(lst_de[idx_word_de][5]<idx_letter):
            idx_word_de +=1
        while(lst_en[idx_word_en][5]<idx_letter):
            idx_word_en +=1
    
        if lst_de[idx_word_de][1]>lst_en[idx_word_en][1]:
            lst_word_stats = lst_de[idx_word_de]
            str_word = lst_word_stats[0]
    #         print('de:', lst_de[idx_word_de])
            idx_letter += len(str_word) #lst_de[idx_word_de][0])
        elif lst_de[idx_word_de][1]==lst_en[idx_word_en][1]:
            lst_word_stats = (lst_de[idx_word_de][0], lst_de[idx_word_de][1], (lst_de[idx_word_de][2], lst_en[idx_word_en][2]), (lst_de[idx_word_de][3], lst_en[idx_word_en][3]), (lst_de[idx_word_de][4], lst_en[idx_word_en][4]), lst_de[idx_word_de][5])
            str_word = lst_word_stats[0]
    #         print('de:', lst_de[idx_word_de], 'en:', lst_en[idx_word_en])
            idx_letter += len(str_word) #lst_de[idx_word_de][0])        
        else:
            lst_word_stats = lst_en[idx_word_en]
            str_word = lst_word_stats[0]
    #         print('en:', lst_en[idx_word_en][0])
            idx_letter += len(str_word)
        lst_words.append(lst_word_stats)
    

    Out lst_words:

    [('sonnen', 6, 0, 'de', 0, 0),
     ('empfindlichkeit', 15, 1, 'de', 7, 6),
     ('sun', 3, 10, 'en', 31, 21),
     ('oil', 3, 11, 'en', 35, 24),
     ('farb', 4, 6, 'de', 33, 27),
     ('palette', 7, (7, 14), ('de', 'en'), (38, 45), 31),
     ('sun', 3, (8, 15), ('de', 'en'), (46, 53), 38),
     ('creme', 5, (9, 16), ('de', 'en'), (50, 57), 41)]
    

    Legend of the output:

    chosen word | len | word_idx_of_lang | lang | letter_idx_lang_with_spaces | letter_idx_no_spaces