I'm trying to lemmatize chat registers in a dataframe using spacy. My code is:
nlp = spacy.load("es_core_news_sm")
df["text_lemma"] = df["text"].apply(lambda row: " ".join([w.lemma_ for w in nlp(row)]))
I have aprox 600.000 rows and the apply takes more than two hours to execute. Is there a faster package/way to lemmatize? (I need a solution that works for spanish)
I have only tried using spacy package
The slow-down in processing speed is coming from the multiple calls to the spaCy pipeline via nlp()
. The faster way to process large texts is to instead process them as a stream using the nlp.pipe()
command. When I tested this on 5000 rows of dummy text, it offered a ~3.874x improvement in speed (~9.759sec vs ~2.519sec) over the original method. There are ways to improve this further if required, see this checklist for spaCy optimisation I made.
# Assume dataframe (df) already contains column "text" with text
# Load spaCy pipeline
nlp = spacy.load("es_core_news_sm")
# Process large text as a stream via `nlp.pipe()` and iterate over the results, extracting lemmas
lemma_text_list = []
for doc in nlp.pipe(df["text"]):
lemma_text_list.append(" ".join(token.lemma_ for token in doc))
df["text_lemma"] = lemma_text_list
import spacy
import pandas as pd
import time
# Random Spanish sentences
rand_es_sentences = [
"Tus drafts influirán en la puntuación de las cartas según tu número de puntos DCI.",
"Información facilitada por la División de Conferencias de la OMI en los cuestionarios enviados por la DCI.",
"Oleg me ha dicho que tenías que decirme algo.",
"Era como tú, muy buena con los ordenadores.",
"Mas David tomó la fortaleza de Sion, que es la ciudad de David."]
# Duplicate sentences specified number of times
es_text = [sent for i in range(1000) for sent in rand_es_sentences]
# Create data-frame
df = pd.DataFrame({"text": es_text})
# Load spaCy pipeline
nlp = spacy.load("es_core_news_sm")
# Original method (very slow due to multiple calls to `nlp()`)
t0 = time.time()
df["text_lemma_1"] = df["text"].apply(lambda row: " ".join([w.lemma_ for w in nlp(row)]))
t1 = time.time()
print("Total time: {}".format(t1-t0)) # ~9.759 seconds on 5000 rows
# Faster method processing rows as stream via `nlp.pipe()`
t0 = time.time()
lemma_text_list = []
for doc in nlp.pipe(df["text"]):
lemma_text_list.append(" ".join(token.lemma_ for token in doc))
df["text_lemma_2"] = lemma_text_list
t1 = time.time()
print("Total time: {}".format(t1-t0)) # ~2.519 seconds on 5000 rows