[SOLVED] Can stop phrases be removed while doing text processing in python?

Can stop phrases be removed while doing text processing in python?

On the task that I'm working on, involves finding the cosine similarity using tfidf between a base transcript and other sample transcripts.

I am removing stop words for this. But I would also like to remove certain stop phrases that are unique to the sample transcripts.

For example - I would like to retain words like 'sounds' , 'like'. But want to remove the phrase 'sounds like' when it occurs together.

I am using sklearn tfidfvectorizer package currently. Is there an efficient way to do the above?

Solution

Yes, you can achieve this by defining function custom_preprocessor that removes the stop phrases and passing it to the TfidfVectorizer constructor using the preprocessor argument.

def custom_preprocessor(text):
    for stop_phrase in stop_phrases:
        text = text.replace(stop_phrase, '')
    return text
vectorizer = TfidfVectorizer(preprocessor=custom_preprocessor)