I am learning NLP and was interested in understanding the TF-IDF model using the sklearn library and the class TfidfVectorizer
I have pasted the sample code below.
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
pd.DataFrame(X.toarray(), columns = vectorizer.get_feature_names())
The feature names:
vectorizer.get_feature_names()
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
And the tf-idf values are:
array([[0. , 0.43877674, 0.54197657, 0.43877674, 0. ,
0. , 0.35872874, 0. , 0.43877674],
[0. , 0.27230147, 0. , 0.27230147, 0. ,
0.85322574, 0.22262429, 0. , 0.27230147],
[0.55280532, 0. , 0. , 0. , 0.55280532,
0. , 0.28847675, 0.55280532, 0. ],
[0. , 0.43877674, 0.54197657, 0.43877674, 0. ,
0. , 0.35872874, 0. , 0.43877674]])
I was interested in calculating the tf-idf value of the term "document" for the above mentioned corpus, which comes out to be 0.43877674 for the first document.
I tried using the below formula both for base 10 and base e (natural logarithm), since smooth_idf=True
by default and as per the documentation written in https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting
Using the TfidfTransformer’s default settings, TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False) the term frequency, the number of times a term occurs in a given document, is multiplied with idf component, which is computed as
where n
is the total number of documents in the document set, and df(t)
is the number of documents in the document set that contain term t
According to the output from the program written, it should be 0.43877674
Your calculation is correct, you are just missing the normalization. With default parameters each document is normalized, so that the euclidian length of each document vector equals 1. You can disable the normalization with the parameter norm=None
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer(norm=None)
X = vectorizer.fit_transform(corpus)
results in:
array([[0. , 1.22314355, 1.51082562, 1.22314355, 0. ,
0. , 1. , 0. , 1.22314355],
[0. , 1.22314355, 0. , 1.22314355, 0. ,
3.83258146, 1. , 0. , 1.22314355],
[1.91629073, 0. , 0. , 0. , 1.91629073,
0. , 1. , 1.91629073, 0. ],
[0. , 1.22314355, 1.51082562, 1.22314355, 0. ,
0. , 1. , 0. , 1.22314355]])
Exactly the tfidf value you calculated for the token 'document' in the first document.