I'm trying to build a TFIDVectorizer that only accepts tokens of 3 or more alphabetical characters using TFIdfVectorizer(token_pattern="(?u)\\b\\D\\D\\D+\\b")
But it doesn't behave correctly, I know token_pattern="(?u)\\b\\w\\w\\w+\\b"
accepts tokens of 3 or more alphanumerical characters, so I just don't understand why the former is not working.
What am I missing?
The problem lies in using the \D
metacharacter, as it's actually for matching any non-digit character, rather than any alphabetical character. From Python docs:
token_pattern="(?i)[a-z]{3,}"
Explanation:
(?i)
— inline flag to make matching case-insensitive,[a-z]
— matches any Latin letter,{3,}
— makes the previous token match three or more times (greedily, i.e., as many times as possible).I hope this answers your question. :)