machine-learningscikit-learnnlptf-idftfidfvectorizer

TF-IDF value is not matching the output of TfidfVectorizer


I am learning NLP and was interested in understanding the TF-IDF model using the sklearn library and the class TfidfVectorizer I have pasted the sample code below.


corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
pd.DataFrame(X.toarray(), columns = vectorizer.get_feature_names())

The feature names: vectorizer.get_feature_names() ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

And the tf-idf values are:

array([[0.        , 0.43877674, 0.54197657, 0.43877674, 0.        ,
        0.        , 0.35872874, 0.        , 0.43877674],
       [0.        , 0.27230147, 0.        , 0.27230147, 0.        ,
        0.85322574, 0.22262429, 0.        , 0.27230147],
       [0.55280532, 0.        , 0.        , 0.        , 0.55280532,
        0.        , 0.28847675, 0.55280532, 0.        ],
       [0.        , 0.43877674, 0.54197657, 0.43877674, 0.        ,
        0.        , 0.35872874, 0.        , 0.43877674]])

I was interested in calculating the tf-idf value of the term "document" for the above mentioned corpus, which comes out to be 0.43877674 for the first document.

I tried using the below formula both for base 10 and base e (natural logarithm), since smooth_idf=True by default and as per the documentation written in https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting

Using the TfidfTransformer’s default settings, TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False) the term frequency, the number of times a term occurs in a given document, is multiplied with idf component, which is computed as

image

where n is the total number of documents in the document set, and df(t) is the number of documents in the document set that contain term t

enter image description here

According to the output from the program written, it should be 0.43877674


Solution

  • Your calculation is correct, you are just missing the normalization. With default parameters each document is normalized, so that the euclidian length of each document vector equals 1. You can disable the normalization with the parameter norm=None

    corpus = [
        'This is the first document.',
        'This is the second second document.',
        'And the third one.',
        'Is this the first document?',
    ]
    
    vectorizer = TfidfVectorizer(norm=None)
    X = vectorizer.fit_transform(corpus)
    

    results in:

    array([[0.        , 1.22314355, 1.51082562, 1.22314355, 0.        ,
            0.        , 1.        , 0.        , 1.22314355],
           [0.        , 1.22314355, 0.        , 1.22314355, 0.        ,
            3.83258146, 1.        , 0.        , 1.22314355],
           [1.91629073, 0.        , 0.        , 0.        , 1.91629073,
            0.        , 1.        , 1.91629073, 0.        ],
           [0.        , 1.22314355, 1.51082562, 1.22314355, 0.        ,
            0.        , 1.        , 0.        , 1.22314355]])
    

    Exactly the tfidf value you calculated for the token 'document' in the first document.