I am using sklearn.feature_extraction.text, TfidfTransformer to get the TF_IDF values for my corpus.
This is how my code looks like
X = dataset[:,0]
Y = dataset[:,1]
for index, item in enumerate(X):
reqJson = json.loads(item, object_pairs_hook=OrderedDict)
X[index] = json.dumps(reqJson, separators=(',', ':'))
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = (tfidf_transformer.fit_transform(X_train_counts))
#(58720, 167216) is the size of my sparse matrix
for i in range (0,58720):
for j in range (0,167216):
print(i,j)
if X_train_tfidf[i,j]>0.35:
X_train_tfidf[i,j]=0
As you can see that I want to filter out tf-idf values which more than 0.35 so that I can reduce my feature set and make my model more time efficient but using a for loop just makes worse. I have looked into the documentation of TfidfTransformer but cannot find a way to make it any better. Any ideas or tips? Thank you.
It sounds like this question is trying to ignore frequent words.
The TfidfVectorizer
(not TfidfTransformer
) implementation includes a max_df
parameter for:
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words).
In the following example, word1
and word3
occur in >50% of documents, so setting max_df=0.5
means the resulting array only includes word2
:
from sklearn.feature_extraction.text import TfidfVectorizer
raw_data = [
"word1 word2 word3",
"word1 word1 word1",
"word2 word2 word3",
"word1 word1 word3",
]
vect = TfidfVectorizer(max_df=0.5)
X = vect.fit_transform(raw_data)
print(vect.get_feature_names_out())
print(X.todense())
['word2']
[[1.]
[0.]
[1.]
[0.]]