pythonscikit-learntfidfvectorizercountvectorizer

Remove features with whitespace in sklearn Countvectorizer with char_wb


I am trying to build char level ngrams using sklearn's CountVectorizer. When using analyzer='char_wb' the vocab has features with whitespaces around it. I want to exclude the features/words with whitespaces.

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(binary=True, analyzer='char_wb', ngram_range=(4, 5))
vectorizer.fit(['this is a plural'])
vectorizer.vocabulary_

the vocabulary from the above code is

[' thi', 'this', 'his ', ' this', 'this ', ' is ', ' a ', ' plu', 'plur', 'lura', 'ural', 'ral ', ' plur', 'plura', 'lural', 'ural ']

I have tried using other analyzers e.g. word and char. None of those gives the kind of feature i need.


Solution

  • I hope you get an improved answer because I'm confident this answer is a bit of a bad hack. I'm not sure it does what you want, and what it does is not very efficient. It does produce your vocabulary though (probably)!

    import re
    
    def my_analyzer(s):
        out=[]
        for w in re.split(r"\W+", s):
            if len(w) < 5:
                out.append(w)
            else:
                for l4 in re.findall(r"(?=(\w{4}))", w):
                    out.append(l4)
                for l5 in re.findall(r"(?=(\w{5}))", w):
                    out.append(l5)
        return out
    
    from sklearn.feature_extraction.text import CountVectorizer
    
    vectorizer = CountVectorizer(binary=True, analyzer=my_analyzer)
    
    vectorizer.fit(['this is a plural'])
    print(vectorizer.vocabulary_)
    # {'this': 6, 'is': 1, 'a': 0, 'plur': 4, 'lura': 2, 'ural': 7, 'plura': 5, 'lural': 3}
    
    corpus = [
        'This is the first document.',
        'This document is the second document.',
        'And this is the third one.',
        'Is this the first document?',
    ]
    vectorizer.fit(corpus)
    print(vectorizer.vocabulary_)
    #{'This': 3, 'is': 15, 'the': 22, 'firs': 11, 'irst': 14, 'first': 12, 'docu': 7, 'ocum': 17, 'cume': 5, 'umen': 26, 'ment': 16, 'docum': 8, 'ocume': 18, 'cumen': 6, 'ument': 27, '': 0, 'seco': 20, 'econ': 9, 'cond': 4, 'secon': 21, 'econd': 10, 'And': 1, 'this': 25, 'thir': 23, 'hird': 13, 'third': 24, 'one': 19, 'Is': 2}