I use
singleTFIDF = TfidfVectorizer(
analyzer='char_wb',
ngram_range=(4,6),
stop_words=my_stop_words,
max_features=50
).fit([text])
And wonder why there are whitespaces in my features like 'chaft '
How can I avoid this? Do I need to tokenize and preprocess this myself?
Use analyzer='word'
.
When we use analyzer='char_wb'
, vectorizer pads the white space because it won't tokenize with respect to words; it tokenizes with respect to characters.
According to documentation for the analyzer
argument:
analyzer{‘word’, ‘char’, ‘char_wb’} or callable, default=’word’
Whether the feature should be made of word or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.
Look at the following example:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer(
analyzer='char_wb',
ngram_range= (4,6))
X = vectorizer.fit_transform(corpus)
print([(len(w),w) for w in vectorizer.get_feature_names()])
[(4, ' and'), (5, ' and '), (4, ' doc'), (5, ' docu'), (6, ' docum'), (4, ' fir'), (5, ' firs'), (6, ' first'), (4, ' is '), (4, ' one'), (5, ' one.'), (6, ' one. '), (4, ' sec'), (5, ' seco'), (6, ' secon'), (4, ' the'), (5, ' the '), (4, ' thi'), (5, ' thir'), (6, ' third'), (5, ' this'), (6, ' this '), (4, 'and '), (4, 'cond'), (5, 'cond '), (4, 'cume'), (5, 'cumen'), (6, 'cument'), (4, 'docu'), (5, 'docum'), (6, 'docume'), (4, 'econ'), (5, 'econd'), (6, 'econd '), (4, 'ent '), (4, 'ent.'), (5, 'ent. '), (4, 'ent?'), (5, 'ent? '), (4, 'firs'), (5, 'first'), (6, 'first '), (4, 'hird'), (5, 'hird '), (4, 'his '), (4, 'ird '), (4, 'irst'), (5, 'irst '), (4, 'ment'), (5, 'ment '), (5, 'ment.'), (6, 'ment. '), (5, 'ment?'), (6, 'ment? '), (4, 'ne. '), (4, 'nt. '), (4, 'nt? '), (4, 'ocum'), (5, 'ocume'), (6, 'ocumen'), (4, 'ond '), (4, 'one.'), (5, 'one. '), (4, 'rst '), (4, 'seco'), (5, 'secon'), (6, 'second'), (4, 'the '), (4, 'thir'), (5, 'third'), (6, 'third '), (4, 'this'), (5, 'this '), (4, 'umen'), (5, 'ument'), (6, 'ument '), (6, 'ument.'), (6, 'ument?')]
Notice:
' this'
(padded at the start by an extra space that is not there in the original text; the sentence starts with 'This'
)'ment. '
(padded at the end by an extra space that is not there in the original text; the sentence ends with 'document.'
)'is the'
, because that n-gram crosses a word boundary, but the 'char_wb'
analyzer only creates n-grams "inside word boundaries"