pythondataframeoptimizationserieslemmatization

How to speed up the lemmatization of a Serie in a Python Dataframe


I got this line that lemmatize a serie of a pandas DataFrame in python.

res = serie.parallel_apply(lambda x :' '.join([d.lemma_ for d in self.nlp_spacy(x)]))

I got 200 000 rows of datas in this dataframe and other treatment on this serie are made. And all of these treatement are very long. Is there a way to speed up this specific treatment.

I heard that vectorized operations are faster with DataFrame. Is there a way to do like this ? The apply method is also very long because it checks all the values. How can i avoid using it ?


Solution

  • I found a way to speed up this. In my case, i am just using spacy to lemmatize a dataframe. So i dont need to use all the component that are in spacy like explain here : https://spacy.io/usage/processing-pipelines#disabling.

    I disabled all the components i don't need. But in French, to lemmatize, we need this components : "tagger", "parser", "lemmatizer".

    So i convert this line self.nlp_spacy = spacy.load(spacy_model) to this : self.nlp_spacy = spacy.load(spacy_model, disable=["parser","tok2vec","ner"])

    The treatment is about 2 to 4 times faster