pythonregextfidfvectorizer

Regular expression that accepts tokens of three or more alphabetical characters


I'm trying to build a TFIDVectorizer that only accepts tokens of 3 or more alphabetical characters using TFIdfVectorizer(token_pattern="(?u)\\b\\D\\D\\D+\\b")

But it doesn't behave correctly, I know token_pattern="(?u)\\b\\w\\w\\w+\\b" accepts tokens of 3 or more alphanumerical characters, so I just don't understand why the former is not working.

What am I missing?


Solution

  • The problem lies in using the \D metacharacter, as it's actually for matching any non-digit character, rather than any alphabetical character. From Python docs: enter image description here


    You can go instead with:
    token_pattern="(?i)[a-z]{3,}"
    

    Explanation:

    I hope this answers your question. :)