I have a huge dictionary/dataframe of German words and how often they appeared in a huge text corpus. For example:
der 23245
die 23599
das 23959
eine 22000
dass 18095
Buch 15988
Büchern 1000
Arbeitsplatz-Management 949
Arbeitsplatz-Versicherung 800
Since words like "Buch" (book) and "Büchern" (books, but in a different declension form) have similar meanings, I want to add up their frequencies. Same thing with the articles "der, die, das", but not with the last two words that have completely different meanings even if they stem from the same words.
I tried the Levenshtein distance, which is "the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other." But I get bigger Levenshtein distances between "Buch" and "Bücher" than between "das" and "dass" (completely different meanings)
import enchant
string1 = "das"
string2 = "dass"
string3 = "Buch"
string4 = "Büchern"
print(enchant.utils.levenshtein(string1, string2))
print(enchant.utils.levenshtein(string3, string4))
>>>> 1
>>>> 4
Is there any other way to cluster such words efficiently?
First, Buch and Bücher is pretty simple as they are just different morphologies of the same word. For both Buch and Bücher, there is only one version in the dictionary (called a lemma). As it happens, der, die and das are also just different morphologies of the lemma der. We just need to count the dictionary form of words (the lemmas) . Spacy has an easy way to access the lemma of a word, for example:
import spacy
from collections import Counter
nlp = spacy.load('de')
words = ['der', 'die', 'das', 'eine', 'dass', 'Buch', 'Büchern', 'Arbeitsplatz-Management','Arbeitsplatz-Versicherung']
lemmas = [nlp(a)[0].lemma_ for a in words]
counter = Counter(lemmas)
results in counter:
Counter({'der': 3, 'einen': 1, 'dass': 1, 'Buch': 2, 'Arbeitsplatz-Management': 1, 'Arbeitsplatz-Versicherung': 1})