I'm using TfidfVectorizer
to extract features of my samples, all texts. However, in my samples, there are so many urls and as a result, http
and https
become important features. This also causes inaccurate predictions later with my Naive Bayes model.
The features I got are as follows. As you can see, https
has high values.
good got great happy http https
0 0.18031992253877868 0.056537832999741425 0.0 0.13494772859235538 0.0 0.7206169458767526
1 0.062052081178508904 0.0 0.03348108448960768 0.03482887785597041 0.0 0.8266008657388199
2 0.066100442981558 0.0 0.03566543577965484 0.03710116101033473 0.0 0.9685823681046619
3 0.030596521808766947 0.028779865519712563 0.0 0.0 0.0 0.9781890670696571
4 0.0 0.03803344358481952 0.0 0.0 0.0 0.9964607105785932
5 0.0 0.0 0.0 0.07716693868942119 0.0 0.938602085540054
6 0.17689804723173405 0.033278959234969596 0.07635828939724364 0.15886424082427333 0.0 0.8718951596544265
7 0.0 0.0 0.02288252957804802 0.0 0.0 0.9603936784408945
8 0.08544543470034431 0.3214885842670747 0.09220660336028486 0.09591841408082484 0.0 0.39837897672993183
9 0.09492740119653752 0.02976370819366948 0.06829257573052833 0.0 0.0 0.9273261812039216
10 0.06892455146463301 0.0648321836892671 0.1859461187415361 0.0 0.0 0.8492883859345594
11 0.06407942255789043 0.02009157746015972 0.13829986166195216 0.023977862240478147 0.0 0.938967971292072
12 0.0 0.06353009389659953 0.03644231525495783 0.0 0.0 0.8772167495025313
13 0.0 0.0 0.044113599370101265 0.030592939021541497 0.0 0.34488252084969045
Please anyone could help me to get rid of this when I extract key words using TfIDF?
This is the vectorizer I initialized:
vectorizer = TfidfVectorizer(input='content', lowercase=True, stop_words='english', analyzer='word', max_features=50)
You can pass a list of stopwords to TfidfVectorizer
:
vectorizer = TfidfVectorizer(input='content', lowercase=True, stop_words=['http', 'https'], analyzer='word', max_features=50)
These words will be ignored when vectorizing the texts.
And you can add your words to the default list like this:
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer
my_stop_words = text.ENGLISH_STOP_WORDS.union(['http', 'https'])
vectorizer = TfidfVectorizer(input='content', lowercase=True, stop_words=my_stop_words, analyzer='word', max_features=50)