As I understand TF-IDF, the IDF value of the word "art" = log_e(3/1) + 1 because there are 3 documents in the data set and the word "art" appears once. But after I print IDF from the vectorizer.idf_ function, the IDF value of "art" is ~2.098612 but when the function calculates it, it is 1.693147.
And one more question: I calculated the TF value for the word "art" in data 1, which is equal to 0.125 because it appears once in a total of 8 words. Is my calculation correct like the TfidfVectorizer function?
This is my data and code
data = ['Souvenir shop|Architecture and art|Culture and history',
'Souvenir shop|Resort|Diverse cuisine|Fishing|Shop games|Beautiful scenery',
'Diverse cuisine|Resort|Beautiful scenery']
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(data)
As I understand TF-IDF, the IDF value of the word "art" = log_e(3/1) + 1 because there are 3 documents in the data set and the word "art" appears once. But after I print IDF from the vectorizer.idf_ function, the IDF value of "art" is ~2.098612 but when the function calculates it, it is 1.693147.
By default, TfidfVectorizer has a parameter smooth_idf
set to True
. The effect of this is that it adds one to both the numerator and denominator of the fraction inside the logarithm. If you turn off smooth_idf
, you get your expected value.
Here is the formula with smooth_idf
turned on:
idf("art") = ln((3 + 1)/(1 + 1)) + 1 = 1.6931
Here is the part of the code responsible for this calculation.
# perform idf smoothing if required
df += int(self.smooth_idf)
n_samples += int(self.smooth_idf)
# log+1 instead of log makes sure terms with zero idf don't get suppressed entirely.
idf = np.log(n_samples / df) + 1
(Source.)
And one more question: I calculated the TF value for the word "art" in data 1, which is equal to 0.125 because it appears once in a total of 8 words. Is my calculation correct like the TfidfVectorizer function?
No, the TF is just the number of times the term appears. It's not normalized by document length in this step. There is a normalization step, but it's after multiplying by IDF.