pythonnlplemmatization

Failed lemmatization


I'm trying to lemmatize german texts which are in a dataframe. I use german library to succesfully handle with specific grammatic structure: https://github.com/jfilter/german-preprocessing

My code:

from german import preprocess

df = pd.read_csv('Afd.csv', sep=',')

Lemma = open('MessageAFD_lemma.txt', 'w')
for i in df['message']:
    preprocess (i, remove_stop=True)
    Lemma.write(i)
Lemma.close()

The process of lemmatization goes successfully, there's no any error in the terminal, but openning the file "MessageAFD_lemma.txt", I get this : (nothing was lemmatized)

The expected result is like:

Input:

preprocess(['Johpannes war einer von vielen guten Schülern.', 'Julia trinkt gern Tee.'], remove_stop=True)

Output: ['johannes gut schüler', 'julia trinken tee']

What goes wrong?


Solution

  • The preprocess function returns a copy of the texts, instead of modifying the input. So you need to write the result of preprocess to the file, not the original i messages.

    Furthermore, preprocess accepts a list of texts to process, so you must wrap your message in [message], and extract the single result from the returned list with result, = ...

    from german import preprocess
    
    df = pd.read_csv('Afd.csv', sep=',')
    
    Lemma = open('MessageAFD_lemma.txt', 'w')
    for message in df['message']:
        result, = preprocess([message], remove_stop=True)
        Lemma.write(result)
    Lemma.close()
    
    # Or, to process all messages in one go:
    with open('MessageAFD_lemma.txt', 'w') as f:
        for result in preprocess(df['message'], remove_stop=True):
            f.write(result)