In a dataframe, I have rows which include sentences like "machine learning, data, ia, segmentation, analysis" or "big data, data lake, data visualisation, marketing, seo".
I want to use TF-IDF and kmeans in order to create clusters based on each word.
My problem is that when I use TF-IDFvectorizer, it tokenizes sentences wrongly. I get terms like "analyse analyse" or "english excel" which are not supposed to be put together.
Instead, I would like sentences to be tokenized based on the commas in the sentence. So terms would be "analyse", "big data", "data lake", "english", etc.
I guess I should change something in the TF-IDFvectorize params but I don't understand how.
Do you please have any idea how to realize this ?
Using keraslibray for Tokenization the sentence in dataframe .Before Tokenization remove the punctation in dataset of dataframe .TF-IDFvectorizer
I am attack the link check it
Check the example code that is help to tokenization of sentence