pythonmachine-learningemail-spam

The number of features of the model must match the input. Model n_features is 7985 and input n_features is 1


I built a spam classifier with random forest and wanted to make a separate function that can classify a text message to be spam or ham and I tried:

def predict_message(pred_text):
    pred_text=[pred_text]
    pred_text2 = tfidf_vect.fit_transform(pred_text)
    pred_features = pd.DataFrame(pred_text2.toarray())
    prediction = rf_model.predict(pred_features)
    return (prediction)

pred_text = "how are you doing today?"

prediction = predict_message(pred_text)
print(prediction)

but it gives me the error:

The number of features of the model must match the input.
Model n_features is 7985 and input n_features is 1 

I can't see the problem, how can I make it work?


Solution

  • By calling tfidf_vect.fit_transform(pred_text) your vectorizer loses any information it had from your original training corpus.

    You should just call transform.

    These changes below should help:

    def predict_message(pred_text):
        pred_text=[pred_text]
        pred_text2 = tfidf_vect.transform(pred_text)  # Changed
        prediction = rf_model.predict(pred_text2)
        return (prediction)
    
    pred_text = "how are you doing today?"
    
    prediction = predict_message(pred_text)
    print(prediction)