I made the following function to clean the text notes of my dataset :
import spacy
nlp = spacy.load("en")
def clean(text):
"""
Text preprocessing for english text
"""
# Apply spacy to the text
doc=nlp(text)
# Lemmatization, remotion of noise (stopwords, digit, puntuaction and singol characters)
tokens=[token.lemma_.strip() for token in doc if
not token.is_stop and not nlp.vocab[token.lemma_].is_stop # Remotion StopWords
and not token.is_punct # Remove puntuaction
and not token.is_digit # Remove digit
]
# Recreation of the text
text=" ".join(tokens)
return text.lower()
Problem is when I want to clean all my dataset text, it take hour and hour. (my dataset is 70k row and between 100 to 5000 words per row)
I tried to use swifter
to run the apply
method on multiplethread like that : data.note_line_comment.swifter.apply(clean)
But it doesn't made really better as it took almost one hour.
I was wondering if there is any way to make a vectorized form of my function or maybe and other way to speed up the process. Any idea ?
Short answer
This type of problem inherently takes time.
Long answer
The more information about the strings you need to make a decision, the longer it will take.
Good news is, if your cleaning of the text is relatively simplified, a few regular expressions might do the trick.
Otherwise you are using the spacy pipeline to help remove bits of text which is costly since it does many things by default:
Alternatively, you can try your task again and turn off the aspects of the spacy pipeline you don't want which may speed it up quite a bit.
For example, maybe turn off named entity recognition, tagging and dependency parsing...
nlp = spacy.load("en", disable=["parser", "tagger", "ner"])
Then try again, it will speed up.