pythonnlpcosine-similaritystop-wordstfidfvectorizer

Can stop phrases be removed while doing text processing in python?


On the task that I'm working on, involves finding the cosine similarity using tfidf between a base transcript and other sample transcripts.

I am removing stop words for this. But I would also like to remove certain stop phrases that are unique to the sample transcripts.

For example - I would like to retain words like 'sounds' , 'like'. But want to remove the phrase 'sounds like' when it occurs together.

I am using sklearn tfidfvectorizer package currently. Is there an efficient way to do the above?


Solution

  • Yes, you can achieve this by defining function custom_preprocessor that removes the stop phrases and passing it to the TfidfVectorizer constructor using the preprocessor argument.

    def custom_preprocessor(text):
        for stop_phrase in stop_phrases:
            text = text.replace(stop_phrase, '')
        return text
    vectorizer = TfidfVectorizer(preprocessor=custom_preprocessor)