pythonscikit-learntfidfvectorizer

How does TfidfVectorizer calculate the TF-IDF number for each word?


As I understand TF-IDF, the IDF value of the word "art" = log_e(3/1) + 1 because there are 3 documents in the data set and the word "art" appears once. But after I print IDF from the vectorizer.idf_ function, the IDF value of "art" is ~2.098612 but when the function calculates it, it is 1.693147.

And one more question: I calculated the TF value for the word "art" in data 1, which is equal to 0.125 because it appears once in a total of 8 words. Is my calculation correct like the TfidfVectorizer function?

This is my data and code

data = ['Souvenir shop|Architecture and art|Culture and history',
        'Souvenir shop|Resort|Diverse cuisine|Fishing|Shop games|Beautiful scenery',
        'Diverse cuisine|Resort|Beautiful scenery']
vectorizer = TfidfVectorizer()

tfidf_matrix = vectorizer.fit_transform(data)

Solution

  • As I understand TF-IDF, the IDF value of the word "art" = log_e(3/1) + 1 because there are 3 documents in the data set and the word "art" appears once. But after I print IDF from the vectorizer.idf_ function, the IDF value of "art" is ~2.098612 but when the function calculates it, it is 1.693147.

    By default, TfidfVectorizer has a parameter smooth_idf set to True. The effect of this is that it adds one to both the numerator and denominator of the fraction inside the logarithm. If you turn off smooth_idf, you get your expected value.

    Here is the formula with smooth_idf turned on:

    idf("art") = ln((3 + 1)/(1 + 1)) + 1 = 1.6931
    

    Here is the part of the code responsible for this calculation.

    # perform idf smoothing if required
    df += int(self.smooth_idf)
    n_samples += int(self.smooth_idf)
    
    # log+1 instead of log makes sure terms with zero idf don't get suppressed entirely.
    idf = np.log(n_samples / df) + 1
    

    (Source.)

    Documentation

    And one more question: I calculated the TF value for the word "art" in data 1, which is equal to 0.125 because it appears once in a total of 8 words. Is my calculation correct like the TfidfVectorizer function?

    No, the TF is just the number of times the term appears. It's not normalized by document length in this step. There is a normalization step, but it's after multiplying by IDF.