pythonmultithreadingpandasspacyswifter

Vectorized form of cleaning function for NLP


I made the following function to clean the text notes of my dataset :

import spacy
nlp = spacy.load("en")
def clean(text):
    """
    Text preprocessing for english text
    """
    # Apply spacy to the text
    doc=nlp(text)
    # Lemmatization, remotion of noise (stopwords, digit, puntuaction and singol characters)
    tokens=[token.lemma_.strip() for token in doc if 
            not token.is_stop and not nlp.vocab[token.lemma_].is_stop # Remotion StopWords
            and not token.is_punct # Remove puntuaction
            and not token.is_digit # Remove digit
           ]
    # Recreation of the text
    text=" ".join(tokens)

    return text.lower()

Problem is when I want to clean all my dataset text, it take hour and hour. (my dataset is 70k row and between 100 to 5000 words per row)

I tried to use swifter to run the apply method on multiplethread like that : data.note_line_comment.swifter.apply(clean)

But it doesn't made really better as it took almost one hour.

I was wondering if there is any way to make a vectorized form of my function or maybe and other way to speed up the process. Any idea ?


Solution

  • Short answer

    This type of problem inherently takes time.

    Long answer

    The more information about the strings you need to make a decision, the longer it will take.

    Good news is, if your cleaning of the text is relatively simplified, a few regular expressions might do the trick.

    Otherwise you are using the spacy pipeline to help remove bits of text which is costly since it does many things by default:

    1. Tokenisation
    2. Lemmatisation
    3. Dependency parsing
    4. NER
    5. Chunking

    Alternatively, you can try your task again and turn off the aspects of the spacy pipeline you don't want which may speed it up quite a bit.

    For example, maybe turn off named entity recognition, tagging and dependency parsing...

    nlp = spacy.load("en", disable=["parser", "tagger", "ner"])
    

    Then try again, it will speed up.