[SOLVED] Text Classification with scikit-learn: how to get a new document's representation from a pickle model

Text Classification with scikit-learn: how to get a new document's representation from a pickle model

I have a document binomial classifier that uses a tf-idf representation of a training set of documents and applies Logistic Regression to it:

lr_tfidf = Pipeline([('vect', tfidf),('clf', LogisticRegression(random_state=0))])

lr_tfidf.fit(X_train, y_train)

I save the model in pickle and used it to classify new documents:

text_model = pickle.load(open('text_model.pkl', 'rb'))
results = text_model.predict_proba(new_document)

How can I get the representation (features + frequencies) used by the model for this new document without explicitly computing it?

EDIT: I am trying to explain better what I want to get. Wen I use predict_proba, I guess that the new document is represented as a vector of term frequencies (according to the rules used in the model stored) and those frequencies are multiplied by the coefficients learnt by the logistic regression model to predict the class. Am I right? If yes, how can I get the terms and term frequencies of this new document, as used by predict_proba?

I am using sklearn v 0.19

Solution

As I understand from the comments, you need to access the tfidfVectorizer from inside the pipeline. This can be done easily by:

tfidfVect = text_model.named_steps['vect']

Now you can use the transform() method of the vectorizer to get the tfidf values.

tfidf_vals = tfidfVect.transform(new_document)

The tfidf_vals will be a sparse matrix of single row containing the tfidf of terms found in the new_document. To check what terms are present in this matrix, you need to use tfidfVect.get_feature_names().