pythonscikit-learnnaivebayestfidfvectorizer

How to get rid of urls while using TfidfVectorizer


I'm using TfidfVectorizer to extract features of my samples, all texts. However, in my samples, there are so many urls and as a result, http and https become important features. This also causes inaccurate predictions later with my Naive Bayes model.

The features I got are as follows. As you can see, https has high values.

              good                 got         great    happy          http       https
0   0.18031992253877868 0.056537832999741425    0.0 0.13494772859235538 0.0 0.7206169458767526
1   0.062052081178508904    0.0 0.03348108448960768 0.03482887785597041 0.0 0.8266008657388199
2   0.066100442981558   0.0 0.03566543577965484 0.03710116101033473 0.0 0.9685823681046619
3   0.030596521808766947    0.028779865519712563    0.0 0.0 0.0 0.9781890670696571
4   0.0 0.03803344358481952 0.0 0.0 0.0 0.9964607105785932
5   0.0 0.0 0.0 0.07716693868942119 0.0 0.938602085540054
6   0.17689804723173405 0.033278959234969596    0.07635828939724364 0.15886424082427333 0.0 0.8718951596544265
7   0.0 0.0 0.02288252957804802 0.0 0.0 0.9603936784408945
8   0.08544543470034431 0.3214885842670747  0.09220660336028486 0.09591841408082484 0.0 0.39837897672993183
9   0.09492740119653752 0.02976370819366948 0.06829257573052833 0.0 0.0 0.9273261812039216
10  0.06892455146463301 0.0648321836892671  0.1859461187415361  0.0 0.0 0.8492883859345594
11  0.06407942255789043 0.02009157746015972 0.13829986166195216 0.023977862240478147    0.0 0.938967971292072
12  0.0 0.06353009389659953 0.03644231525495783 0.0 0.0 0.8772167495025313
13  0.0 0.0 0.044113599370101265    0.030592939021541497    0.0 0.34488252084969045

Please anyone could help me to get rid of this when I extract key words using TfIDF?

This is the vectorizer I initialized:

vectorizer = TfidfVectorizer(input='content', lowercase=True, stop_words='english', analyzer='word', max_features=50)

Solution

  • You can pass a list of stopwords to TfidfVectorizer:

    vectorizer = TfidfVectorizer(input='content', lowercase=True, stop_words=['http', 'https'], analyzer='word', max_features=50)
    

    These words will be ignored when vectorizing the texts.

    And you can add your words to the default list like this:

    from sklearn.feature_extraction import text
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    my_stop_words = text.ENGLISH_STOP_WORDS.union(['http', 'https'])
    
    vectorizer = TfidfVectorizer(input='content', lowercase=True, stop_words=my_stop_words, analyzer='word', max_features=50)