pythonpandasnlpspacylemmatization

With spaCy, how can I get all lemmas from a string?


I have a pandas data frame with a column of text values (documents). I want to apply lemmatization on these values with the spaCy library using the pandas apply function. I've defined my to_lemma function to iterate through the words in the document and concatenate the corresponding lemmas in the output string, however this is very slow. Is there a way to extract the lemmatized form of a document in spaCy?

def to_lemma(text):
    tp = nlp(text)
    line = ""
    for word in tp:
        line = line + word.lemma_ + " "
    return line

Solution

  • There are many ways to speed up SpaCy processing. The question which of them make sense for you depends mostly on the size of your input.

    1. The most obvious one is not individually apply the model to every single row, but rather use batch processing. Use nlp.pipe() with an Iterable of strings. This means it is easier to not use apply.
    2. Disable components that you do not use. For token level processing where you need the lemmas this would be 'parser' (the dependency parser) and 'ner' (the Named Entity Recognition component).
    3. Increase the batch_size (objects to buffer) in pipe(). The default is 1000. Obviously this only makes sense to touch if you have the memory to increase it a lot.
    4. Increase the number of processors used using n_process. This will increase the time it takes to initially load the model but decrease the processing time. In my experience this starts making sense at about 500k+ texts. Note that this also requires the code to be run in an if __name__ == '__main__': wrapper.

    Basic example with 1. and 2.:

    texts = df["column_name"]
    nlp = spacy.load('en_core_web_lg', disable=['parser', 'ner'])
    lemmas = []
    for processed_doc in nlp.pipe(texts):
        lemmas.append(" ".join([token.lemma_ for token in processed_doc]))
    df["column_name_lemmas"] = lemmas
    

    Advanced example for all four:

    if __name__ == '__main__':
        texts = df["column_name"]
        nlp = spacy.load('en_core_web_lg', disable=['parser', 'ner'])
        lemmas = []
        for processed_doc in nlp.pipe(texts, batch_size=10000, n_process=4):
            lemmas.append(" ".join([token.lemma_ for token in processed_doc]))
        df["column_name_lemmas"] = lemmas