On the task that I'm working on, involves finding the cosine similarity using tfidf between a base transcript and other sample transcripts.
I am removing stop words for this. But I would also like to remove certain stop phrases that are unique to the sample transcripts.
For example - I would like to retain words like 'sounds' , 'like'. But want to remove the phrase 'sounds like' when it occurs together.
I am using sklearn tfidfvectorizer package currently. Is there an efficient way to do the above?
Yes, you can achieve this by defining function custom_preprocessor that removes the stop phrases and passing it to the TfidfVectorizer constructor using the preprocessor argument.
def custom_preprocessor(text):
for stop_phrase in stop_phrases:
text = text.replace(stop_phrase, '')
return text
vectorizer = TfidfVectorizer(preprocessor=custom_preprocessor)