I have a pandas data frame with a column of text values (documents). I want to apply lemmatization on these values with the spaCy library using the pandas apply
function. I've defined my to_lemma
function to iterate through the words in the document and concatenate the corresponding lemmas in the output string, however this is very slow. Is there a way to extract the lemmatized form of a document in spaCy?
def to_lemma(text):
tp = nlp(text)
line = ""
for word in tp:
line = line + word.lemma_ + " "
return line
There are many ways to speed up SpaCy processing. The question which of them make sense for you depends mostly on the size of your input.
nlp.pipe()
with an Iterable of strings. This means it is easier to not use apply.'parser'
(the dependency parser) and 'ner'
(the Named Entity Recognition component).batch_size
(objects to buffer) in pipe(). The default is 1000. Obviously this only makes sense to touch if you have the memory to increase it a lot.n_process
. This will increase the time it takes to initially load the model but decrease the processing time. In my experience this starts making sense at about 500k+ texts. Note that this also requires the code to be run in an if __name__ == '__main__':
wrapper.Basic example with 1. and 2.:
texts = df["column_name"]
nlp = spacy.load('en_core_web_lg', disable=['parser', 'ner'])
lemmas = []
for processed_doc in nlp.pipe(texts):
lemmas.append(" ".join([token.lemma_ for token in processed_doc]))
df["column_name_lemmas"] = lemmas
Advanced example for all four:
if __name__ == '__main__':
texts = df["column_name"]
nlp = spacy.load('en_core_web_lg', disable=['parser', 'ner'])
lemmas = []
for processed_doc in nlp.pipe(texts, batch_size=10000, n_process=4):
lemmas.append(" ".join([token.lemma_ for token in processed_doc]))
df["column_name_lemmas"] = lemmas