I got this line that lemmatize a serie of a pandas DataFrame in python.
res = serie.parallel_apply(lambda x :' '.join([d.lemma_ for d in self.nlp_spacy(x)]))
I got 200 000 rows of datas in this dataframe and other treatment on this serie are made. And all of these treatement are very long. Is there a way to speed up this specific treatment.
I heard that vectorized operations are faster with DataFrame. Is there a way to do like this ? The apply method is also very long because it checks all the values. How can i avoid using it ?
I found a way to speed up this. In my case, i am just using spacy to lemmatize a dataframe. So i dont need to use all the component that are in spacy like explain here : https://spacy.io/usage/processing-pipelines#disabling.
I disabled all the components i don't need. But in French, to lemmatize, we need this components : "tagger", "parser", "lemmatizer".
So i convert this line self.nlp_spacy = spacy.load(spacy_model)
to this :
self.nlp_spacy = spacy.load(spacy_model, disable=["parser","tok2vec","ner"])
The treatment is about 2 to 4 times faster