pythonregexscikit-learntfidfvectorizer

How to catch any words in TfidfVectorizer by token_pattern


I'd like to catch any words separated by just space in TfidfVectorizer, even if the words like "0" "a" "x" "0?0" and so on. I wrote the below code for this purpose.

However, maybe, this code doesn't work well.

vectorizer = TfidfVectorizer(smooth_idf = False, token_pattern=r"[^ ]+")

P.S.

I could get a right pattern matching by using '\b' . Thanks a lot.


Solution

  • You may be looking for word boundaries:

    \b\S+\b
    

    Explanation:

    Usage:

    For string: Greetings from Spain it'd match Greetings , from and Spain