pythonscikit-learntfidfvectorizer

Export features to excel after fit-transform of the TFIDFVectorizer


Python Version: 3.7

Hi everyone:

I am using the tfidfVectorizer from the library scikit-learn as follow:

vec_body = TfidfVectorizer(**vectorizer_parameters)
X_train_features = vec_body.fit_transform(X_train)

X_train contains the email body. If I understood correctly, X_train_features is a sparse matrix. My objective is to create an excel report to validate which features or words per mail were identified after transformation with the following table:

email_body email features
this is an example example
this is another example another example
... ...

The column "email_body" should have the email body for each mail I have in the X_train. The column "email_features" should have a string with all the features or words after the transformation (fit-transform) for each particular mail. In the vectorizer, I deleted all stop words and used lemmatization too. That is why I want to export the result to excel to validate which words were used in the transformation. I do not know how to achieve that when my result is a sparse matrix.

Please forgive me if I explained something incorrectly but I am very new with this library. Thank so much in advance for any advice or solution.


Solution

  • You could do something like that with tfidf.get_feature_names_out() and np.flatnonzero to find nonzero indices for each row of the transformed space. Like this:

    X = tfidf.fit_transform(documents)
    words = tfidf.get_feature_names_out()
    features = [" ".join(words[np.flatnonzero(row)]) for row in X.todense()]
    

    Here's a simple example based on the examples in the question:

    import pandas as pd
    import numpy as np
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    documents = [
        "this is an example",
        "this is another great example",
    ]
    
    tfidf = TfidfVectorizer(stop_words="english")
    
    X = tfidf.fit_transform(documents)
    words = tfidf.get_feature_names_out()
    features = [" ".join(words[np.flatnonzero(row)]) for row in X.todense()]
    
    print(pd.DataFrame({"email_body": documents, "email_features": features}))
    

    Which produces the following. You should be able to call to_excel on the dataframe if you want an Excel file.

                          email_body    email_features
    0             this is an example           example
    1  this is another great example     example great