[SOLVED] Tf-idf vectorizer has whitespaces in feature words with char

Tf-idf vectorizer has whitespaces in feature words with char_wb?

I use

singleTFIDF = TfidfVectorizer(
    analyzer='char_wb', 
    ngram_range=(4,6),
    stop_words=my_stop_words, 
    max_features=50
).fit([text])

And wonder why there are whitespaces in my features like 'chaft '

How can I avoid this? Do I need to tokenize and preprocess this myself?

Solution

Use analyzer='word'.

When we use analyzer='char_wb', vectorizer pads the white space because it won't tokenize with respect to words; it tokenizes with respect to characters.

According to documentation for the analyzer argument:

analyzer{‘word’, ‘char’, ‘char_wb’} or callable, default=’word’

Whether the feature should be made of word or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

Look at the following example:

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = TfidfVectorizer(
  analyzer='char_wb', 
  ngram_range= (4,6))
X = vectorizer.fit_transform(corpus)
print([(len(w),w) for w in vectorizer.get_feature_names()])

[(4, ' and'), (5, ' and '), (4, ' doc'), (5, ' docu'), (6, ' docum'), (4, ' fir'), (5, ' firs'), (6, ' first'), (4, ' is '), (4, ' one'), (5, ' one.'), (6, ' one. '), (4, ' sec'), (5, ' seco'), (6, ' secon'), (4, ' the'), (5, ' the '), (4, ' thi'), (5, ' thir'), (6, ' third'), (5, ' this'), (6, ' this '), (4, 'and '), (4, 'cond'), (5, 'cond '), (4, 'cume'), (5, 'cumen'), (6, 'cument'), (4, 'docu'), (5, 'docum'), (6, 'docume'), (4, 'econ'), (5, 'econd'), (6, 'econd '), (4, 'ent '), (4, 'ent.'), (5, 'ent. '), (4, 'ent?'), (5, 'ent? '), (4, 'firs'), (5, 'first'), (6, 'first '), (4, 'hird'), (5, 'hird '), (4, 'his '), (4, 'ird '), (4, 'irst'), (5, 'irst '), (4, 'ment'), (5, 'ment '), (5, 'ment.'), (6, 'ment. '), (5, 'ment?'), (6, 'ment? '), (4, 'ne. '), (4, 'nt. '), (4, 'nt? '), (4, 'ocum'), (5, 'ocume'), (6, 'ocumen'), (4, 'ond '), (4, 'one.'), (5, 'one. '), (4, 'rst '), (4, 'seco'), (5, 'secon'), (6, 'second'), (4, 'the '), (4, 'thir'), (5, 'third'), (6, 'third '), (4, 'this'), (5, 'this '), (4, 'umen'), (5, 'ument'), (6, 'ument '), (6, 'ument.'), (6, 'ument?')]

Notice:

the output/features include ' this' (padded at the start by an extra space that is not there in the original text; the sentence starts with 'This')
the output/features include 'ment. ' (padded at the end by an extra space that is not there in the original text; the sentence ends with 'document.')
the output/features do not include 'is the', because that n-gram crosses a word boundary, but the 'char_wb' analyzer only creates n-grams "inside word boundaries"