Python Version: 3.7
Hi everyone:
I am using the tfidfVectorizer from the library scikit-learn as follow:
vec_body = TfidfVectorizer(**vectorizer_parameters)
X_train_features = vec_body.fit_transform(X_train)
X_train
contains the email body. If I understood correctly, X_train_features is a sparse matrix. My objective is to create an excel report to validate which features or words per mail were identified after transformation with the following table:
email_body | email features |
---|---|
this is an example | example |
this is another example | another example |
... | ... |
The column "email_body" should have the email body for each mail I have in the X_train. The column "email_features" should have a string with all the features or words after the transformation (fit-transform) for each particular mail. In the vectorizer, I deleted all stop words and used lemmatization too. That is why I want to export the result to excel to validate which words were used in the transformation. I do not know how to achieve that when my result is a sparse matrix.
Please forgive me if I explained something incorrectly but I am very new with this library. Thank so much in advance for any advice or solution.
You could do something like that with tfidf.get_feature_names_out()
and np.flatnonzero
to find nonzero indices for each row of the transformed space. Like this:
X = tfidf.fit_transform(documents)
words = tfidf.get_feature_names_out()
features = [" ".join(words[np.flatnonzero(row)]) for row in X.todense()]
Here's a simple example based on the examples in the question:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
documents = [
"this is an example",
"this is another great example",
]
tfidf = TfidfVectorizer(stop_words="english")
X = tfidf.fit_transform(documents)
words = tfidf.get_feature_names_out()
features = [" ".join(words[np.flatnonzero(row)]) for row in X.todense()]
print(pd.DataFrame({"email_body": documents, "email_features": features}))
Which produces the following. You should be able to call to_excel
on the dataframe if you want an Excel file.
email_body email_features
0 this is an example example
1 this is another great example example great