The following code uses SymSpell in Python, see the symspellpy guide on word_segmentation.
It uses "de-100k.txt" and "en-80k.txt" frequency dictionaries from a github repo, you need to save them in your working directory. As long as you do not want to use any SymSpell logic, you do not need to install and run this script to answer the question, take just the output of the two language's word segmentations and go on.
import pkg_resources
from symspellpy.symspellpy import SymSpell
input_term = "sonnenempfindlichkeitsunoil farbpalettesuncreme"
# German:
# Set max_dictionary_edit_distance to 0 to avoid spelling correction
sym_spell = SymSpell(max_dictionary_edit_distance=0, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
"symspellpy", "de-100k.txt"
)
# term_index is the column of the term and count_index is the
# column of the term frequency
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
result = sym_spell.word_segmentation(input_term)
print(f"{result.corrected_string}, {result.distance_sum}, {result.log_prob_sum}")
# English:
# Reset the sym_spell object
sym_spell = SymSpell(max_dictionary_edit_distance=0, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
"symspellpy", "en-80k.txt"
)
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
result = sym_spell.word_segmentation(input_term)
print(f"{result.corrected_string}, {result.distance_sum}, {result.log_prob_sum}")
Out:
sonnen empfindlichkeit s uno i l farb palette sun creme, 8, -61.741842760725255
sonnen empfindlichkeit sun oil farb palette sun creme, 6, -45.923471400632884
The aim is to find out the most relevant words by some logic: most frequent ngram neighours and/or word frequency, longest word, and the like. The logic is free of choice.
In this example with two languages, the two outputs need to be compared so that only the best segments are kept while dropping the rest, without interceptions of parts of words. In the outcome, each letter is used one time and uniquely.
If there are spaces between words in the input_term, these words should not be joined to become a new segment. For example, if you have 'cr eme' with a wrong space in it, that should still not be allowed to become 'creme'. It is just likely that the space is right more often than the errors that would appear from taking neighoured letters.
array('sonnen', 'empfindlichkeit', 'sun', 'oil', 'farb', 'palette', 'sun', 'creme')
array(['DE'], ['DE'], ['EN'], ['EN'], ['DE'], ['DE', 'EN'], ['EN'], ['DE', 'EN'])
The 'DE/EN' tag is just an optional idea to show that the word exists in German and English, you can also choose 'EN' over 'DE' in this example. The language tags are a bonus, you can also answer without that.
There is probably a fast solution that uses numpy
arrays and/or dictionaries
instead of lists
or Dataframes
, but choose as you like.
How to use many languages in symspell word segmentation and combine them to one chosen merger? The aim is a sentence of words built from all letters, using each letter once, keeping all original spaces.
This is the recommended way. I found this out only after doing the manual way. You can easily use the same frequency logic that is used for one language for two languages instead: Just load two languages or more into the sym_spell object!
import pkg_resources
from symspellpy.symspellpy import SymSpell
input_term = "sonnenempfindlichkeitsunoil farbpalettesuncreme"
# Set max_dictionary_edit_distance to 0 to avoid spelling correction
sym_spell = SymSpell(max_dictionary_edit_distance=0, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
"symspellpy", "de-100k.txt"
)
# term_index is the column of the term and count_index is the
# column of the term frequency
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
result = sym_spell.word_segmentation(input_term)
print(f"{result.corrected_string}, {result.distance_sum}, {result.log_prob_sum}")
# DO NOT reset the sym_spell object at this line so that
# English is added to the German frequency dictionary
# NOT: #reset the sym_spell object
# NOT: #sym_spell = SymSpell(max_dictionary_edit_distance=0, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
"symspellpy", "en-80k.txt"
)
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
result = sym_spell.word_segmentation(input_term)
print(f"{result.corrected_string}, {result.distance_sum}, {result.log_prob_sum}")
Out:
sonnen empfindlichkeit s uno i l farb palette sun creme, 8, -61.741842760725255
sonnen empfindlichkeit sun oil farb palette sun creme, 6, -45.923471400632884
In this manual way, the logic is: the longer word of two languages wins, logging the winner language tag. If they are at the same length, both languages are logged.
As in the question, the input_term = "sonnenempfindlichkeitsunoil farbpalettesuncreme"
, using a reset object for each language segmentation in SymSpell, leads to s1
for German and s2
for English.
import numpy as np
s1 = 'sonnen empfindlichkeit s uno i l farb palette sun creme'
s2 = 'son ne ne mp find li ch k e it sun oil far b palette sun creme'
num_letters = len(s1.replace(' ',''))
list_w1 = s1.split()
list_w2 = s2.split()
list_w1_len = [len(x) for x in list_w1]
list_w2_len = [len(x) for x in list_w2]
lst_de = [(x[0], x[1], x[2], 'de', x[3], x[4]) for x in zip(list_w1, list_w1_len, range(len(list_w1)), np.cumsum([0] + [len(x)+1 for x in list_w1][:-1]), np.cumsum([0] + [len(x) for x in list_w1][:-1]))]
lst_en = [(x[0], x[1], x[2], 'en', x[3], x[4]) for x in zip(list_w2, list_w2_len, range(len(list_w2)), np.cumsum([0] + [len(x)+1 for x in list_w2][:-1]), np.cumsum([0] + [len(x) for x in list_w2][:-1]))]
idx_word_de = 0
idx_word_en = 0
lst_words = []
idx_letter = 0
# stop at num_letters-1, else you check the last word
# also on the last idx_letter and get it twice
while idx_letter <= num_letters-1:
lst_de[idx_word_de][5], idx_letter)
while(lst_de[idx_word_de][5]<idx_letter):
idx_word_de +=1
while(lst_en[idx_word_en][5]<idx_letter):
idx_word_en +=1
if lst_de[idx_word_de][1]>lst_en[idx_word_en][1]:
lst_word_stats = lst_de[idx_word_de]
str_word = lst_word_stats[0]
# print('de:', lst_de[idx_word_de])
idx_letter += len(str_word) #lst_de[idx_word_de][0])
elif lst_de[idx_word_de][1]==lst_en[idx_word_en][1]:
lst_word_stats = (lst_de[idx_word_de][0], lst_de[idx_word_de][1], (lst_de[idx_word_de][2], lst_en[idx_word_en][2]), (lst_de[idx_word_de][3], lst_en[idx_word_en][3]), (lst_de[idx_word_de][4], lst_en[idx_word_en][4]), lst_de[idx_word_de][5])
str_word = lst_word_stats[0]
# print('de:', lst_de[idx_word_de], 'en:', lst_en[idx_word_en])
idx_letter += len(str_word) #lst_de[idx_word_de][0])
else:
lst_word_stats = lst_en[idx_word_en]
str_word = lst_word_stats[0]
# print('en:', lst_en[idx_word_en][0])
idx_letter += len(str_word)
lst_words.append(lst_word_stats)
Out lst_words
:
[('sonnen', 6, 0, 'de', 0, 0),
('empfindlichkeit', 15, 1, 'de', 7, 6),
('sun', 3, 10, 'en', 31, 21),
('oil', 3, 11, 'en', 35, 24),
('farb', 4, 6, 'de', 33, 27),
('palette', 7, (7, 14), ('de', 'en'), (38, 45), 31),
('sun', 3, (8, 15), ('de', 'en'), (46, 53), 38),
('creme', 5, (9, 16), ('de', 'en'), (50, 57), 41)]
Legend of the output:
chosen word | len | word_idx_of_lang | lang | letter_idx_lang_with_spaces | letter_idx_no_spaces