pythonnlpspacylemmatization

Lemmatization taking forever with Spacy


I'm trying to lemmatize chat registers in a dataframe using spacy. My code is:

nlp = spacy.load("es_core_news_sm")
df["text_lemma"] = df["text"].apply(lambda row: " ".join([w.lemma_ for w in nlp(row)]))

I have aprox 600.000 rows and the apply takes more than two hours to execute. Is there a faster package/way to lemmatize? (I need a solution that works for spanish)

I have only tried using spacy package


Solution

  • The slow-down in processing speed is coming from the multiple calls to the spaCy pipeline via nlp(). The faster way to process large texts is to instead process them as a stream using the nlp.pipe() command. When I tested this on 5000 rows of dummy text, it offered a ~3.874x improvement in speed (~9.759sec vs ~2.519sec) over the original method. There are ways to improve this further if required, see this checklist for spaCy optimisation I made.

    Solution

    # Assume dataframe (df) already contains column "text" with text
    
    # Load spaCy pipeline
    nlp = spacy.load("es_core_news_sm")
    
    # Process large text as a stream via `nlp.pipe()` and iterate over the results, extracting lemmas
    lemma_text_list = []
    for doc in nlp.pipe(df["text"]):
        lemma_text_list.append(" ".join(token.lemma_ for token in doc))
    df["text_lemma"] = lemma_text_list
    

    Full code for testing timings

    import spacy
    import pandas as pd
    import time
    
    # Random Spanish sentences
    rand_es_sentences = [
        "Tus drafts influirán en la puntuación de las cartas según tu número de puntos DCI.",
        "Información facilitada por la División de Conferencias de la OMI en los cuestionarios enviados por la DCI.",
        "Oleg me ha dicho que tenías que decirme algo.",
        "Era como tú, muy buena con los ordenadores.",
        "Mas David tomó la fortaleza de Sion, que es la ciudad de David."]
    
    # Duplicate sentences specified number of times
    es_text = [sent for i in range(1000) for sent in rand_es_sentences]
    # Create data-frame
    df = pd.DataFrame({"text": es_text})
    # Load spaCy pipeline
    nlp = spacy.load("es_core_news_sm")
    
    
    # Original method (very slow due to multiple calls to `nlp()`)
    t0 = time.time()
    df["text_lemma_1"] = df["text"].apply(lambda row: " ".join([w.lemma_ for w in nlp(row)]))
    t1 = time.time()
    print("Total time: {}".format(t1-t0))  # ~9.759 seconds on 5000 rows
    
    
    # Faster method processing rows as stream via `nlp.pipe()`
    t0 = time.time()
    lemma_text_list = []
    for doc in nlp.pipe(df["text"]):
        lemma_text_list.append(" ".join(token.lemma_ for token in doc))
    df["text_lemma_2"] = lemma_text_list
    t1 = time.time()
    print("Total time: {}".format(t1-t0))  # ~2.519 seconds on 5000 rows