pythonmachine-learningtext-classificationnaivebayesmachine-learning-model

Machine learning model predicts training labels themselves as result


I am trying to build a model to predict "species" based on data with features "message", "tail", and "finger", and label "species"(see the first few rows of data.csv below):

message fingers tail species
pluvia arbor aquos 4 no Aquari
cosmix xeno nebuz odbitaz 5 yes Zorblax
solarix glixx novum galaxum quasar 5 yes Zorblax
arbor insectus pesros ekos dootix nimbus 2 yes Florian

My code is:

import warnings
warnings.simplefilter("ignore")
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

df = pd.read_csv("data.csv")
X = np.asarray(df[["message", "fingers", "tail"]])
X = [str (item) for item in X]
y = df["species"]

le = LabelEncoder()
y = le.fit_transform(y)

cv = CountVectorizer()
X = cv.fit_transform(X).toarray()

model = MultinomialNB()
model.fit(X, y)

test_data = pd.read_csv('test.csv')
test_data_array = np.asarray(df[["message", "fingers", "tail"]])
test_data_array = [str (item) for item in test_data_array]
test_data_array = cv.fit_transform(test_data_array).toarray()

y_prediction = model.predict(test_data_array)
y_prediction = le.inverse_transform(y_prediction)

print(y_prediction)

I followed this tutorial for the same.

The problem is, when I tried running it, it just outputs the species column of the original training data word-for-word apart from a few differences (there are 493 results while the test data consisted of 299 entries, and the training data consisted of 500 entries). It doesn't actually predict anything for the test data. I don't understand why the code won't work. Could someone help out?


Solution

  • The problem is that you read the test data into test_data, but then use the original DataFrame, df, containing the training data, to make the test set.

    Change this line:

    test_data_array = np.asarray(df[["message", "fingers", "tail"]])
    

    To:

    test_data_array = np.asarray(test_data[["message", "fingers", "tail"]])
    

    And you should have the correct number of predictions.

    Remember to also compare y_prediction to test_data['species'].