pythontokenizetfidfvectorizer

Tokenize sentence based on existing punctuation (TF-IDF vectorizer)


In a dataframe, I have rows which include sentences like "machine learning, data, ia, segmentation, analysis" or "big data, data lake, data visualisation, marketing, seo".

I want to use TF-IDF and kmeans in order to create clusters based on each word.

My problem is that when I use TF-IDFvectorizer, it tokenizes sentences wrongly. I get terms like "analyse analyse" or "english excel" which are not supposed to be put together.

Instead, I would like sentences to be tokenized based on the commas in the sentence. So terms would be "analyse", "big data", "data lake", "english", etc.

I guess I should change something in the TF-IDFvectorize params but I don't understand how.

Do you please have any idea how to realize this ?


Solution

  • Using keraslibray for Tokenization the sentence in dataframe .Before Tokenization remove the punctation in dataset of dataframe .TF-IDFvectorizer

    I am attack the link check it

    Keras

    Check the example code that is help to tokenization of sentence