pythonscikit-learntfidfvectorizer

Tf-idf vectorizer has whitespaces in feature words with char_wb?


I use

singleTFIDF = TfidfVectorizer(
    analyzer='char_wb', 
    ngram_range=(4,6),
    stop_words=my_stop_words, 
    max_features=50
).fit([text])

And wonder why there are whitespaces in my features like 'chaft '

How can I avoid this? Do I need to tokenize and preprocess this myself?


Solution

  • Use analyzer='word'.

    When we use analyzer='char_wb', vectorizer pads the white space because it won't tokenize with respect to words; it tokenizes with respect to characters.

    According to documentation for the analyzer argument:

    analyzer{‘word’, ‘char’, ‘char_wb’} or callable, default=’word’

    Whether the feature should be made of word or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

    Look at the following example:

    from sklearn.feature_extraction.text import TfidfVectorizer
    corpus = [
        'This is the first document.',
        'This document is the second document.',
        'And this is the third one.',
        'Is this the first document?',
    ]
    vectorizer = TfidfVectorizer(
      analyzer='char_wb', 
      ngram_range= (4,6))
    X = vectorizer.fit_transform(corpus)
    print([(len(w),w) for w in vectorizer.get_feature_names()])
    

    [(4, ' and'), (5, ' and '), (4, ' doc'), (5, ' docu'), (6, ' docum'), (4, ' fir'), (5, ' firs'), (6, ' first'), (4, ' is '), (4, ' one'), (5, ' one.'), (6, ' one. '), (4, ' sec'), (5, ' seco'), (6, ' secon'), (4, ' the'), (5, ' the '), (4, ' thi'), (5, ' thir'), (6, ' third'), (5, ' this'), (6, ' this '), (4, 'and '), (4, 'cond'), (5, 'cond '), (4, 'cume'), (5, 'cumen'), (6, 'cument'), (4, 'docu'), (5, 'docum'), (6, 'docume'), (4, 'econ'), (5, 'econd'), (6, 'econd '), (4, 'ent '), (4, 'ent.'), (5, 'ent. '), (4, 'ent?'), (5, 'ent? '), (4, 'firs'), (5, 'first'), (6, 'first '), (4, 'hird'), (5, 'hird '), (4, 'his '), (4, 'ird '), (4, 'irst'), (5, 'irst '), (4, 'ment'), (5, 'ment '), (5, 'ment.'), (6, 'ment. '), (5, 'ment?'), (6, 'ment? '), (4, 'ne. '), (4, 'nt. '), (4, 'nt? '), (4, 'ocum'), (5, 'ocume'), (6, 'ocumen'), (4, 'ond '), (4, 'one.'), (5, 'one. '), (4, 'rst '), (4, 'seco'), (5, 'secon'), (6, 'second'), (4, 'the '), (4, 'thir'), (5, 'third'), (6, 'third '), (4, 'this'), (5, 'this '), (4, 'umen'), (5, 'ument'), (6, 'ument '), (6, 'ument.'), (6, 'ument?')]

    Notice: